RapidMiner

RapidMiner

Text Processing - How to track which are the exact documents contain the word?

Contributor II

Text Processing - How to track which are the exact documents contain the word?

Hi all,

I have processed the TEXT MINING operators and obtained the ExampleSet (WordList to Data) & WordList (Process Documents From Files). Number of occurrence for words has been shown in the result too. How about if I wish to determine the words in result belong to which documents?

Example: The word "apple" appears 100 times in 80 documents. How to track and determine which are the exact documents contain the word "apple"? What am I missing here? Any solution for it?


Thanks in advance.

Regards.
4 REPLIES
Super Contributor

Re: Text Processing - How to track which are the exact documents contain the word?

Hello

Take a look at the following process. The example set output contains labels corresponding to the document and by using term occurrences when processing the documents, you can see the word counts for each document.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="112" y="165">
        <parameter key="text" value="apple banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;apple&#10;cherry&#10;melon"/>
        <parameter key="add label" value="true"/>
        <parameter key="label_value" value="doc1"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document (2)" width="90" x="112" y="255">
        <parameter key="text" value=" banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;&#10;cherry&#10;melon"/>
        <parameter key="add label" value="true"/>
        <parameter key="label_value" value="doc2"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document (3)" width="90" x="112" y="390">
        <parameter key="text" value="apple banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;apple&#10;cherry&#10;melon apple banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;apple&#10;cherry&#10;melon"/>
        <parameter key="add label" value="true"/>
        <parameter key="label_value" value="doc3"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="130" name="Process Documents" width="90" x="380" y="165">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
      <connect from_op="Create Document (3)" from_port="output" to_op="Process Documents" to_port="documents 3"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


regards

Andrew
Contributor II

Re: Text Processing - How to track which are the exact documents contain the word?

Thank you for the concern.

How about if multiple documents have been processed?
(If just a few documents can use "Create Document" operator and label each of them)

For example, the result of WordList shown is as below:

Word                Total Occurrence      In Documents
Apple                          200                            180
Orange                      150                            130
Strawberry                  90                              50

The result reveals that "Apple" appears 200 times in 180 documents.

Is there any method to know that which are those 180 documents from the analysis result? (E.g. Doc. 10, Doc. 16, Doc. 45)

Regards,
Tan
Super Contributor

Re: Text Processing - How to track which are the exact documents contain the word?

If you are using the "Process Document from Files" operator, the file name for the document will appear in the output example set if the option "add meta information" is set to true. The attribute name is metadata_file.

Andrew
Contributor II

Re: Text Processing - How to track which are the exact documents contain the word?

Thanks Andrew for the solution !!

Best Regards.