Options

SVD Text Mining

B_MinerB_Miner Member Posts: 72 Contributor II
Hi all-

Is there any way to determine the best number of dimensions for SVD applied to a text mining data set (say term frequencies) that will be used in clustering?

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    as usual you cannot say how things will work before you tested it. One mostly applies the X-Validation to get a performance estimation if it's a classification or regression task. Clusterings might be evaluated using cluster measures. You might optimize the number of dimensions by iterating over this parameter and test every combination. There are operators for this, all starting with Optimize Parameters.

    But anyway, I doubt that the SVD will work very well on Text Datasets, simply because it might take much to long time to compute the Singular Value Decomposition of such a huge matrix, as they frequently occur in text mining.

    Greetings,
      Sebastian
  • Options
    B_MinerB_Miner Member Posts: 72 Contributor II
    Hi Sebastian,

    There seems to be a lot of literature about the use of SVD with text, but indeed the time might be prohibitive. Is there a way in RM to get the singular values themselves (I have read one can plot their squares and see where they level off to determine the best # of dimensions)?

    Thanks!
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I'm not sure about this, but aren't they displayed in the model of the Singular Value Decomposition?

    Greetings,
      Sebastian
  • Options
    B_MinerB_Miner Member Posts: 72 Contributor II
    I didn't think so....I cant get anything from the output to show up visually. Are you thinking to export the model and look at XML? Something like this: is that the best way? Start with a really small model I know the answer to and then try and find the right nodes in the XML?


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="100" width="145">
          <operator activated="true" class="generate_massive_data" expanded="true" height="60" name="Generate Massive Data" width="90" x="45" y="30">
            <parameter key="number_examples" value="1000"/>
            <parameter key="number_attributes" value="10"/>
          </operator>
          <operator activated="true" class="singular_value_decomposition" expanded="true" height="94" name="SVD" width="90" x="219" y="69">
            <parameter key="return_preprocessing_model" value="true"/>
          </operator>
          <operator activated="true" class="write_model" expanded="true" height="60" name="Write Model" width="90" x="390" y="161">
            <parameter key="model_file" value="C:\Documents and Settings\Owner\Desktop\out.mod"/>
            <parameter key="output_type" value="XML"/>
          </operator>
          <connect from_op="Generate Massive Data" from_port="output" to_op="SVD" to_port="example set input"/>
          <connect from_op="SVD" from_port="preprocessing model" to_op="Write Model" to_port="input"/>
          <connect from_op="Write Model" from_port="through" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    as I saw now, all this information is discarded in RapidMiner and hence currently cannot be shown in the visualization of the preprocessing model. If you take a look at the result of the PCA, it's in deed possible to use these values for displaying. I could add that relative easily I guess, but this and next week, there won't be the time for it. I'm very busy with customer projects, that bye the way are about text mining. So I'm curious about the runtime of SVD with many features and the amount of memory it needs. As far as I can see, it needs the complete matrix, so it will crash with my around 40.000 word attributes. Did you made any experience with that? What about the classification performance, is it worth implementing a special SVD for sparse matrices?

    Greetings,
    Sebastian
  • Options
    siamak_wantsiamak_want Member Posts: 98 Contributor II
    Hi, Sebastian.

    regarding your question: "is it worth implementing a special SVD for sparse matrices?"

    I think absolutely YES. it worth for sure. A couple of days ago I tested the SVD operator on my text dataset with 23000 features on a relatively high performance machine. After 10 hours the algorithm was finished!!!

    As far as I know, LSA algorithm has tackled with this problem. It just use an approximation of term-document matrix (which is so smaller than the original matrix). So, I kindly suggest that RM team try to embed an LSA operator in RM.

    cheers.

Sign In or Register to comment.