Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

help on using an existing word list with the process documents operator

VRVR Member Posts: 6 Contributor II
edited September 2019 in Help
Hello,

For some reason I am unable to make the process documents operator to create a word vector when I feed in an existing word list to it. On running the process, the process documents operator creates a word vector wherein the value of all attributes (supplied by the word list) is zero for all examples. This is impossible as I have created the word list on the same data set. Is there a bug? or am I doing something wrong?

XML pasted below. Thanks for your help!

Regards
V



<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="5.3.015" expanded="true" height="60" name="Read Excel" width="90" x="45" y="165">
        <parameter key="excel_file" value="C:\Users\Documents\data.xlsx"/>
        <parameter key="imported_cell_range" value="A1:P5446"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="ID Auto.true.polynominal.id"/>
          <parameter key="3" value="Text.true.text.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.3.015" expanded="true" height="76" name="Nominal to Text" width="90" x="246" y="165">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|Text"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="5.3.015" expanded="true" height="60" name="Retrieve word list (2)" width="90" x="514" y="30">
        <parameter key="repository_entry" value="//Local Repository/data/word list"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="5.3.002" expanded="true" height="60" name="Data to Documents" width="90" x="447" y="165">
        <list key="specify_weights"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="581" y="165">
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="1.0"/>
        <parameter key="prune_above_percent" value="100.0"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="210"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Retrieve word list (2)" from_port="output" to_op="Process Documents" to_port="word list"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • VRVR Member Posts: 6 Contributor II
    Further information on my earlier post. I see that I have a problem only when I try to create a TF-IDF vector. I get non-zero values for all others (Term Frequency, etc.)

    Thanks
    Vidya
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
    Hi,

    are you doing exactly the same things inside Process Documents? Both have just a tokenize inside?

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • VRVR Member Posts: 6 Contributor II
    Hi Martin,

    No I am not using the same tokenize - I cannot do so - because here is how I do it

    1. In the first step, I tokenize and extract  a word list - 1 from the data set
    2. Subsequently, I only want to use select parts of speech words to generate a vector, so I have written the word list - 1 from step 1 to a file and run a POS tokenize on this word list - 1 to extract a sub-set of tokens in a word list -2.
    3. The process that I sent to you uses this word-list 2 and gives me zero values for TF-IDF vector for the same data set as in step 1. Interestingly I tried using Generate TF-IDF operator after process operator (but then generating a TF- vector) and I do get non-zero values

    Let me know if this is unclear and thanks for your help!

    Regards
    V
Sign In or Register to comment.