The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

Token Metadata

erdnusserdnuss Member Posts: 8 Contributor II
edited November 2018 in Help
Hello.

I'm working on a task i couldn't find any help for on the web. What I try is to extract the most frequent words of a document collection and associate them with all sentences they appear in and also all documents they appear in.I would like to have the result be like a tree structure, e.g. in an excel file containing all information, and an output like this:

Token1----hyperlink--------Sentence1- - -hyperlink - - -Doc  --\
                                              Sentence2- - - - - -                Doc    \ (order with reference to highest TF/IDF-value of token)
                                              Sentence3- - - - - - -              Doc--/ 

Token2----hyperlink--------Sentence1- - - - - Doc
                                              Sentence2
........
...
.etc.  you get the idea.

To accomplish this, i try to add the sentences and the documents as meta information to the tokens(sentence tokenizer and word tokenizer operators), and to read this meta information with excel(write excel operator). The result is somewhat too redundant, it contains 13245 examples out of only 16 documents, i think upscaling this process is going to be quite hard. I also wonder if there is a possibility to add meta information in different "levels", specifically to add the document as meta information to the sentences it contains and then add this "package" to the tokens as meta information?

I am not very familiar with data structures and RapidMiner and hope this is going to be possible, here's my process so far:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
    <process expanded="true" height="530" width="748">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="75">
        <list key="text_directories">
          <parameter key="stirling" value="C:\Users\Marc\Desktop\Data\Stirling"/>
        </list>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="information_extraction:sentence_tokenizer" compatibility="1.0.000" expanded="true" height="76" name="SentenceTokenizer" width="90" x="179" y="165">
        <parameter key="optionalAttribute" value="text"/>
        <parameter key="new token-name" value="Sentences"/>
      </operator>
      <operator activated="true" class="information_extraction:word_tokenizer" compatibility="1.0.000" expanded="true" height="76" name="WordTokenizer" width="90" x="313" y="255">
        <parameter key="optionalAttribute" value="Sentences"/>
        <parameter key="new token-name" value="Words"/>
      </operator>
      <operator activated="true" class="write_excel" compatibility="5.3.000" expanded="true" height="76" name="Write Excel" width="90" x="581" y="300">
        <parameter key="excel_file" value="C:\Users\Marc\Desktop\Data\Excel_Result\result.xls"/>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="SentenceTokenizer" to_port="example set input"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 1"/>
      <connect from_op="SentenceTokenizer" from_port="example set output" to_op="WordTokenizer" to_port="example set input"/>
      <connect from_op="WordTokenizer" from_port="example set output" to_op="Write Excel" to_port="input"/>
      <connect from_op="WordTokenizer" from_port="original example set output" to_port="result 2"/>
      <connect from_op="Write Excel" from_port="through" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>
thank you in advance :)

Answers

  • erdnusserdnuss Member Posts: 8 Contributor II
    Really nobody any idea?

    ok, maybe at least you can help me with that:

    I want to extract strings containing a given word, like "stirling", using the extract information operator. Say i want the word and the surrounding +5 and -5 words added asmetadata,
    i tried the match string setting and the regex   

    ^.*stirling.*$

    but it gives me an error. ("Process Failed. No group 1")
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.005">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="75">
            <list key="text_directories">
              <parameter key="stirling" value="C:\Users\Marc\Desktop\Data\Stirling"/>
            </list>
            <parameter key="create_word_vector" value="false"/>
            <parameter key="keep_text" value="true"/>
            <process expanded="true">
              <operator activated="true" class="text:extract_information" compatibility="5.3.000" expanded="true" height="60" name="Extract Information" width="90" x="112" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="Kontext" value="^.*stirling.*$"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • erdnusserdnuss Member Posts: 8 Contributor II
    hellooo :(
Sign In or Register to comment.