Options

extract information

meersehnmeersehn Member Posts: 2 Contributor I
edited November 2018 in Help
Hello,

I have got a txt file with more than 100 articles (containing date, headline, text, author).
I want the program to list all the terms ending with -ing, -ion etc. Afterwards i want the program to sort the terms by  frequency alphabeticaly. Unfortunately I am a beginner and i dont know how to go on.
The following steps are working at the moment

1. Read txt
2. Tokenize
3. Delete stopwords


After these steps Rapidminer gives me the text without stopword. But how can i make the program to give me
the words with different endings? Is "extract information" the correct input?

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    the Filter Tokens operator is what you are looking for. Set "condition" to "matches" and enter a regular expression like
    .*ion|.*ing
    , this should give you the expected results.

    Best
    Marius
  • Options
    meersehnmeersehn Member Posts: 2 Contributor I
    thank you so much!
    That helped me a lot!

    Can somebody tell me if it is possible du remove the duplicates?
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Why would you want to do that? Anyway, you can use the Binary Term Frequency to just indicate if a token is present in a document or not, as in the process below.

    Best, Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
        <process expanded="true" height="190" width="480">
          <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document" width="90" x="112" y="30">
            <parameter key="text" value="a a b a c b"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document (2)" width="90" x="112" y="120">
            <parameter key="text" value="a b d a a d d d"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.2.001" expanded="true" height="112" name="Process Documents" width="90" x="246" y="30">
            <parameter key="vector_creation" value="Binary Term Occurrences"/>
            <process expanded="true" height="536" width="950">
              <operator activated="true" class="text:tokenize" compatibility="5.2.001" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.