Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"Help with Word List Operator"

ronmacronmac Member Posts: 11 Contributor II
edited June 2019 in Help
I am trying to add the WordList Operator to this Word Vector code I am working on. I cannot enable it properly. I would appreciate any suggestions on implementing the WordList Operator. I wanted to add it to the end so I can get a list of each word with a count.

Thanks,
Ron McEwan
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.11" expanded="true" name="Process">
    <process expanded="true" height="431" width="413">
      <operator activated="true" class="web:get_webpage" compatibility="5.0.4" expanded="true" height="60" name="Get Page" width="90" x="55" y="46">
        <parameter key="url" value="http://seekingalpha.com/news/market_currents?source=refreshed"/>
        <list key="query_parameters"/>
      </operator>
      <operator activated="true" class="text:tokenize" compatibility="5.0.7" expanded="true" height="60" name="Tokenize" width="90" x="202" y="41"/>
      <operator activated="true" class="text:extract_length" compatibility="5.0.7" expanded="true" height="60" name="Extract Length" width="90" x="112" y="165"/>
      <operator activated="true" class="text:extract_token_number" compatibility="5.0.7" expanded="true" height="60" name="Extract Token Number" width="90" x="246" y="165"/>
      <connect from_op="Get Page" from_port="output" to_op="Tokenize" to_port="document"/>
      <connect from_op="Tokenize" from_port="document" to_op="Extract Length" to_port="document"/>
      <connect from_op="Extract Length" from_port="document" to_op="Extract Token Number" to_port="document"/>
      <connect from_op="Extract Token Number" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • el_chiefel_chief Member Posts: 63 Contributor II
    check out my blog this week. i've got 5 videos all about text mining, and this should answer your question.
  • ronmacronmac Member Posts: 11 Contributor II
    Thanks, looking forward to it.
  • colocolo Member Posts: 236 Maven
    Hi Ron,

    it seems you didn't create a word vector so far. You can use the "Process Documents" operator to simply do this. If you only need the term occurences it's a very simple extension of your example code:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.11" expanded="true" name="Process">
        <process expanded="true" height="431" width="681">
          <operator activated="true" class="web:get_webpage" compatibility="5.0.4" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
            <parameter key="url" value="http://seekingalpha.com/news/market_currents?source=refreshed"/>
            <list key="query_parameters"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.0.6" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <process expanded="true" height="607" width="786">
              <operator activated="true" class="text:tokenize" compatibility="5.0.6" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:extract_length" compatibility="5.0.6" expanded="true" height="60" name="Extract Length" width="90" x="246" y="30"/>
              <operator activated="true" class="text:extract_token_number" compatibility="5.0.6" expanded="true" height="60" name="Extract Token Number" width="90" x="380" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Extract Length" to_port="document"/>
              <connect from_op="Extract Length" from_port="document" to_op="Extract Token Number" to_port="document"/>
              <connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Regards,
    Matthias
  • ronmacronmac Member Posts: 11 Contributor II
    Thanks. The exampple was very helpful.
Sign In or Register to comment.