Text Analysis on documents collection coming from a CSV

gustavo_velhogustavo_velho Member Posts: 3 Contributor I
edited November 2018 in Help

Hello!

 

I'm new to Rapidminer, and my main focus is to use it for text analysis for social media posts. I have a CSV file with several columns, and each row is a post/document. One of the columns is the text/body of the document. How can I select only that specific column for text analysis, but, at the same time, keep all other columns for further analysis, since they are still relevant?

 

Right now I have a process like:

 

Read CSV -> Select Atributes (to select only body column) -> Data to Documents -> Process Documents (Tokenize, Transform cases, N-Grams etc)  -> WordList to Data

 

This works to see the list of most common words/n-grams, but now I lost all the related data for each document. I would like to, for example, filter the documents containing a specific n-gram or word. Any tip would be helpful.

 

Thanks!

 

Gustavo Velho

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,122  RM Data Scientist

    Gustavo,

     

    simply use "Keep text" in the Process Documents operator. That way you should have an additional attribute with the text together with your bag of words in the upper port of the operator.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • gustavo_velhogustavo_velho Member Posts: 3 Contributor I

    Thanks Martin! That seems to make sense, I'll test it. But let me add this: what about other data from a document? I have a file like:

    AUTHOR | DATE | CONTENT | SOURCE

    A                10/26   Lorem Ipsum...   http://source.com

    B                10/27   Lorem Ipsum...   http://source.com

     

    I see that Rapidminer has several other statistics, so I would like to benefit from that also after text analysi.

     

    Thanks again!

    Gustavo Velho

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,122  RM Data Scientist

    Hi,

    Process Document should preserve the ID attribute as well. That way you can simply join the resulting bag of word example set with the former. Maybe Process Documents is also preserving all special roles. Would need to check this.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • gustavo_velhogustavo_velho Member Posts: 3 Contributor I

    Thanks Martin! That makes sense. I was figuring that out, that I would need to join documents table with the words list or something. :)

     

    I've been using other tools for text analysis, and now I'm starting to test Rapidminer. Rapidminer seems to have a better tokenization process so far, so let's see how the rest goes.

     

    Appreciate your help!

     

    Gustavo

Sign In or Register to comment.