"text mining output"

sheridany · August 2009

Trying to text mine 30K email excerpts collated into one file. I know something is wrong because the frequency count for words that I would expect to be frequent are coming up as zero.

id id integer avg = 1 +/- 0 [1.000 ; 1.000] 0.0
label label nominal mode = bp (1), least = bp (1) bp (1) 0.0
regular Dear real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Wells real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Fargo real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular online real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular bill real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular transactions real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular National real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Benefit real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Life real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Insurance real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Company real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular another real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular Both real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular were real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular deducted real avg = 0 +/- 0 [0.000 ; 0.000] 0.0
regular checking real avg = 0 +/- 0 [0.000 ; 0.000] 0.0

The log also references an issue with the example set at the end even though I have set it to overwrite.

P Aug 13, 2009 11:57:53 AM: Process:
Root[1] (Process)
+- TextInput[1] (TextInput)
| +- StringTokenizer[2] (StringTokenizer)
| +- StopwordFilterFile[2] (StopwordFilterFile)
| +- TokenLengthFilter[0] (TokenLengthFilter)
+- ExampleSetWriter[1] (ExampleSetWriter)
P Aug 13, 2009 11:57:53 AM: [Warning] TextInput: Warning: Encoding unknown. Using default.
P Aug 13, 2009 11:57:56 AM: [Warning] TextInput: The original example example set already contains an attribute named "label". This is likely to cause trouble. Please rename the attribute in the original example set.
P Aug 13, 2009 11:57:56 AM: [Warning] TextInput: There is a term that equals the class attribute, renaming it
P Aug 13, 2009 11:57:56 AM: [Warning] TextInput: Warning: Encoding unknown. Using default.
P Aug 13, 2009 11:57:59 AM: Process:
Root[1] (Process)
+- TextInput[1] (TextInput)
| +- StringTokenizer[2] (StringTokenizer)
| +- StopwordFilterFile[2] (StopwordFilterFile)
| +- TokenLengthFilter[2] (TokenLengthFilter)
+- ExampleSetWriter[1] (ExampleSetWriter)
P Aug 13, 2009 11:57:59 AM: Produced output:
IOContainer (1 objects):
SimpleExampleSet:
1 examples,
34729 regular attributes,
special attributes = {
id = #0: id (integer/single_value)
label = #34730: label (nominal/single_value)/values=[bp]
}
(created by TextInput)
P Aug 13, 2009 11:57:59 AM: [NOTE] Process finished successfully after 5 s
G Aug 13, 2009 11:57:59 AM: [NOTE] Cannot use plotter 'Scatter Matrix': Data table must have between 0 and 50 columns, was 34730.
G Aug 13, 2009 11:57:59 AM: [NOTE] Cannot use plotter 'Survey': Data table must have between 0 and 100 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'Andrews Curves': Data table must have between 0 and 1000 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'Quartile Color Matrix': Data table must have between 0 and 100 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'RadViz': Data table must have between 0 and 1000 columns, was 34730.
G Aug 13, 2009 11:58:00 AM: [NOTE] Cannot use plotter 'GridViz': Data table must have between 0 and 10000 columns, was 34730.

Lastly how can I use visualization to see frequent terms words etc.

land · August 2009

Hi,
could you please post the complete process here inside a code area? Press on the # button for creating one. Otherwise I cannot say anything about the problem with the zeros.

Unfortunately direct visualization of the term frequency will be available in the next version. But you could switch from TFIDF to occurences and then aggregate the complete exampleset. You would have then the complete number of occurences for each word.

Greetings,
Sebastian

sheridany · August 2009

<operator name="Root" class="Process" expanded="yes">
    <operator name="TextInput" class="TextInput" expanded="yes">
        <list key="texts">
          <parameter key="bp"	value="C:\Documents and Settings\youngs\Desktop\rapidminer data file"/>
        </list>
        <parameter key="default_content_language"	value="english"/>
        <list key="namespaces">
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="StopwordFilterFile" class="StopwordFilterFile">
            <parameter key="file"	value="C:\Documents and Settings\youngs\Desktop\stopwordfile"/>
        </operator>
        <operator name="TokenLengthFilter" class="TokenLengthFilter">
        </operator>
    </operator>
    <operator name="ExampleSetWriter" class="ExampleSetWriter">
        <parameter key="example_set_file"	value="C:\Documents and Settings\youngs\Desktop\testrm1.dat"/>
        <parameter key="overwrite_mode"	value="overwrite"/>
    </operator>
    <operator name="ExampleVisualizer" class="ExampleVisualizer">
    </operator>
</operator>

land · August 2009

Hi,
the problem is really simple: You have loaded your complete data as ONE example. In TFIDF encoding, every frequency will be zero then. The TextInput operator will read all files as a single example found in the directory specified.

Greetings,
Sebastian

sheridany · August 2009

Are you saying that each individual line needs to be a separate file?

land · August 2009

I'm saying, that each independent text has to be a single file, if you want to load it with the TextInput.
If it's stored in something like csv, you could load it as exampleSet, change the AttributeType to String using the Nominal2String operator and then use the StringTextInput. This one will tread each row of the example set as one text.

Greetings,
Sebastian

sheridany · August 2009

I am still challenged to get the entire text collection loaded. I current have the text data in a csv file. I am getting this message in the log.

Aug 27, 2009 3:49:44 PM: [Warning] StringTextInput: File C:\Program Files\Rapid-I\RapidMiner\no longer wanted bill pay not found. Assuming the text is directly encoded as document source...

for each and every record.

Here is my xml

[<operator name="Root" class="Process" expanded="yes">
    <description text="#ylt#h3#ygt#Reading texts from string attributes#ylt#/h3#ygt##ylt#p#ygt#In some cases, the text that should be processed is not stored in a file, but is directly provided by an application through an example set. In this case you can use the StringTextInput operator much in the same fashion as you would use the usual TextInput operator, just that the incoming ExampleSet now directly contains string attributes (special value type) that represent the text to be processed.#ylt#/p#ygt#"/>
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes"	value="C:\Documents and Settings\youngs\Desktop\rapidminer data file\rapidnew.aml"/>
        <parameter key="column_separators"	value="\t"/>
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <parameter key="vector_creation"	value="TermOccurrences"/>
        <list key="namespaces">
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
    </operator>
</operator>
/code]

land · August 2009

Hi,
you could increase the log verbosity in the process root operator to avoid this. Unfortunately nobody knows, WHY the text-plugin does this. Together with RapidMiner5 comes a from scratch redesigned new TextPlugin version, not showing this behavior.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"text mining output"

Answers