"Text Processing - How to track which are the exact documents contain the word?"

Tan_Koon_ChinTan_Koon_Chin Member Posts: 4 Contributor I
edited June 2019 in Help
Hi all,

I have processed the TEXT MINING operators and obtained the ExampleSet (WordList to Data) & WordList (Process Documents From Files). Number of occurrence for words has been shown in the result too. How about if I wish to determine the words in result belong to which documents?

Example: The word "apple" appears 100 times in 80 documents. How to track and determine which are the exact documents contain the word "apple"? What am I missing here? Any solution for it?


Thanks in advance.

Regards.

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    Take a look at the following process. The example set output contains labels corresponding to the document and by using term occurrences when processing the documents, you can see the word counts for each document.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="112" y="165">
            <parameter key="text" value="apple banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;apple&#10;cherry&#10;melon"/>
            <parameter key="add label" value="true"/>
            <parameter key="label_value" value="doc1"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document (2)" width="90" x="112" y="255">
            <parameter key="text" value=" banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;&#10;cherry&#10;melon"/>
            <parameter key="add label" value="true"/>
            <parameter key="label_value" value="doc2"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document (3)" width="90" x="112" y="390">
            <parameter key="text" value="apple banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;apple&#10;cherry&#10;melon apple banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;apple&#10;cherry&#10;melon"/>
            <parameter key="add label" value="true"/>
            <parameter key="label_value" value="doc3"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="130" name="Process Documents" width="90" x="380" y="165">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
          <connect from_op="Create Document (3)" from_port="output" to_op="Process Documents" to_port="documents 3"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    regards

    Andrew
  • Tan_Koon_ChinTan_Koon_Chin Member Posts: 4 Contributor I
    Thank you for the concern.

    How about if multiple documents have been processed?
    (If just a few documents can use "Create Document" operator and label each of them)

    For example, the result of WordList shown is as below:

    Word                Total Occurrence      In Documents
    Apple                          200                            180
    Orange                      150                            130
    Strawberry                  90                              50

    The result reveals that "Apple" appears 200 times in 180 documents.

    Is there any method to know that which are those 180 documents from the analysis result? (E.g. Doc. 10, Doc. 16, Doc. 45)

    Regards,
    Tan
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    If you are using the "Process Document from Files" operator, the file name for the document will appear in the output example set if the option "add meta information" is set to true. The attribute name is metadata_file.

    Andrew
  • Tan_Koon_ChinTan_Koon_Chin Member Posts: 4 Contributor I
    Thanks Andrew for the solution !!

    Best Regards.
Sign In or Register to comment.