RapidMiner

RapidMiner

Creating a comparing white list of words to a wordlist from a data mined webpage

SOLVED
Highlighted
Contributor II

Creating a comparing white list of words to a wordlist from a data mined webpage

Currently, I need to compare a wordlist that I have created, 'White List' to a word list that RapidMiner has created from a webpage. I have the wordlist from the webpage tokenized, and filtered. What I want to do is import a wordlist I have created into the process so that I can compare the wordlist I have made to the output of the process that works so far so that I can create a matching scheme. E.g., the 'White List' contains imaging while the word list from the output contains imaging, thus creating a match and moving that into a new output file.

 

If you need more information, let me know.

 

Ian

7 REPLIES
RMStaff

Re: Creating a comparing white list of words to a wordlist from a data mined webpage

Hi Ian,

 

you can go for wordlist to data twice, join the resulting tables and then do standard Filtering operations?

 

Best,

Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Contributor II

Re: Creating a comparing white list of words to a wordlist from a data mined webpage

[ Edited ]

Ok, so I've created both wordlists, but I am confused on how to compare them now. Everytime I join them or create a union it does not compare the lists. I.e. it will delete words in common and neither count them as a term occurence or create a comparing wordlist. 

RMStaff

Re: Creating a comparing white list of words to a wordlist from a data mined webpage

Did you use wordlist to data and did an outer join on the word?

Best,

MArtin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Contributor II

Re: Creating a comparing white list of words to a wordlist from a data mined webpage

Oh man, can you tell I'm new? Smiley Very Happy

 

Ok, so I see what your saying about the join outer operation but my two word lists do not have attribute id's and I do not how to set them. Everytime I go from Word list to Data into a join operation, no attribute list is set, and when I turn off use id attribute on the join block it just feeds out useless information. So, how do I give my data attributes within the tables during the prcoess? If I can do this, I do beleive the join block would work.

 

Ian

RMStaff

Re: Creating a comparing white list of words to a wordlist from a data mined webpage

Hi Idunn,

 

have a look at the attached process, it is working well for me.

 

~Martin

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.3.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="85">
        <list key="attribute_values">
          <parameter key="text" value="&quot;this is a text&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.3.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="85"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="380" y="85">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="85"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="7.2.000" expanded="true" height="82" name="WordList to Data" width="90" x="514" y="85"/>
      <operator activated="true" class="generate_data_user_specification" compatibility="7.3.001" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="45" y="238">
        <list key="attribute_values">
          <parameter key="text" value="&quot;this is another text&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.3.001" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="179" y="238"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="380" y="238">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="179" y="85"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="7.2.000" expanded="true" height="82" name="WordList to Data (2)" width="90" x="514" y="238"/>
      <operator activated="true" class="join" compatibility="7.3.001" expanded="true" height="82" name="Join" width="90" x="648" y="136">
        <parameter key="remove_double_attributes" value="false"/>
        <parameter key="join_type" value="outer"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="word" value="word"/>
        </list>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="example set" to_op="Join" to_port="left"/>
      <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="word list" to_op="WordList to Data (2)" to_port="word list"/>
      <connect from_op="WordList to Data (2)" from_port="example set" to_op="Join" to_port="right"/>
      <connect from_op="Join" from_port="join" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Contributor II

Re: Creating a comparing white list of words to a wordlist from a data mined webpage

You've solved it! I had practically everything excpet, nominal to text.

 

Could you explain this block to me in this form? Why did the program not work until I had that exact operation on? Does it just give attributes to tables?

 

Idunn

RMStaff

Re: Creating a comparing white list of words to a wordlist from a data mined webpage

Hi,

 

you mean Nominal to Text? It converts Attributes from Polynominal types (red, green, blue / yes, no) to text types (= unique strings). This type is needed for text mining.

 

~Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner