"Help with Word List Operator"

ronmac · November 2010

I am trying to add the WordList Operator to this Word Vector code I am working on. I cannot enable it properly. I would appreciate any suggestions on implementing the WordList Operator. I wanted to add it to the end so I can get a list of each word with a count.

Thanks,
Ron McEwan

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.11" expanded="true" name="Process">
    <process expanded="true" height="431" width="413">
      <operator activated="true" class="web:get_webpage" compatibility="5.0.4" expanded="true" height="60" name="Get Page" width="90" x="55" y="46">
        <parameter key="url" value="http://seekingalpha.com/news/market_currents?source=refreshed"/>
        <list key="query_parameters"/>
      </operator>
      <operator activated="true" class="text:tokenize" compatibility="5.0.7" expanded="true" height="60" name="Tokenize" width="90" x="202" y="41"/>
      <operator activated="true" class="text:extract_length" compatibility="5.0.7" expanded="true" height="60" name="Extract Length" width="90" x="112" y="165"/>
      <operator activated="true" class="text:extract_token_number" compatibility="5.0.7" expanded="true" height="60" name="Extract Token Number" width="90" x="246" y="165"/>
      <connect from_op="Get Page" from_port="output" to_op="Tokenize" to_port="document"/>
      <connect from_op="Tokenize" from_port="document" to_op="Extract Length" to_port="document"/>
      <connect from_op="Extract Length" from_port="document" to_op="Extract Token Number" to_port="document"/>
      <connect from_op="Extract Token Number" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

el_chief · November 2010

check out my blog this week. i've got 5 videos all about text mining, and this should answer your question.

ronmac · November 2010

Thanks, looking forward to it.

colo · November 2010

Hi Ron,

it seems you didn't create a word vector so far. You can use the "Process Documents" operator to simply do this. If you only need the term occurences it's a very simple extension of your example code:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.11" expanded="true" name="Process">
    <process expanded="true" height="431" width="681">
      <operator activated="true" class="web:get_webpage" compatibility="5.0.4" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
        <parameter key="url" value="http://seekingalpha.com/news/market_currents?source=refreshed"/>
        <list key="query_parameters"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.0.6" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <process expanded="true" height="607" width="786">
          <operator activated="true" class="text:tokenize" compatibility="5.0.6" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:extract_length" compatibility="5.0.6" expanded="true" height="60" name="Extract Length" width="90" x="246" y="30"/>
          <operator activated="true" class="text:extract_token_number" compatibility="5.0.6" expanded="true" height="60" name="Extract Token Number" width="90" x="380" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Extract Length" to_port="document"/>
          <connect from_op="Extract Length" from_port="document" to_op="Extract Token Number" to_port="document"/>
          <connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regards,
Matthias

ronmac · November 2010

Thanks. The exampple was very helpful.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Help with Word List Operator"

Answers