set categories by finding words in a document

vijen · February 2012

Hello everyone,

I am new to Rapidminer but enjoying the ride so far. I am stuck with a couple of issues..
First, I have a set of 3 categories, each one is defined by 5 words.. meaning that if a document has those 5 words in its corpus then I would like to assign that document to that particular category.
In other words, I would like to go through my dataset, search the corpus for the 5 words of each category and associate the document to the category in which it finds all 5 words.
Is there a way to do that in Rapidminer?

Cheers,

D

MariusHelf · March 2012

Hi,

you should use the Text Processing extension to tokenize your documents. You end up with an example set which contains the documents as rows and the tokens as columns. If the value of a column is greater than 0 in a row it means that the word appeared in the corresponding document. You can then use Generate Attributes to create a new attribute by checking if the 5 words are present and writing the result to the new attribute. Change the vector_creation parameter of your process documents to Binary Term Occurrences. Have a look at the attached process.

Best,
Marius

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.002" expanded="true" name="Process">
    <process expanded="true" height="431" width="748">
      <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
        <parameter key="text" value="this is a test text which contains an indicator word."/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="210">
        <parameter key="text" value="blabla blubb blubb"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.2.001" expanded="true" height="112" name="Process Documents" width="90" x="179" y="30">
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <process expanded="true" height="639" width="757">
          <operator activated="true" class="text:tokenize" compatibility="5.2.001" expanded="true" height="60" name="Tokenize" width="90" x="112" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="5.2.002" expanded="true" height="76" name="Generate Attributes" width="90" x="380" y="30">
        <list key="function_descriptions">
          <parameter key="is_in_cat1" value="if(indicator == 1, &quot;yes&quot;, &quot;no&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="5.2.002" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="514" y="30">
        <list key="function_descriptions">
          <parameter key="is_in_cat2" value="if(indicator == 1 &amp;&amp; word ==1, &quot;yes&quot;, &quot;no&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="5.2.002" expanded="true" height="76" name="Generate Attributes (3)" width="90" x="648" y="30">
        <list key="function_descriptions">
          <parameter key="is_in_cat3" value="if(blabla == 1, &quot;yes&quot;, &quot;no&quot;)"/>
        </list>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
      <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Generate Attributes (3)" to_port="example set input"/>
      <connect from_op="Generate Attributes (3)" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

set categories by finding words in a document

Answers