Term frequency from Excel file

federica_gatto9federica_gatto9 Member Posts: 7 Contributor I
edited June 2019 in Help

Hi everyone,

 

I have an excel list with customer reviews and I would like to get the frequency of the words. I tried to use directly Generate TFIDF but it considers the frequency of the whole text in each example instead of each word. 

Since I also wanted to tokenize and remove stopwords and these operators only support documents, I am not sure how I should convert the excel file into document. With Process Documents from Data I get a word list and this still doesn't work and with Extract Document I can only select one example, and in the end it still considers the text as a whole.

I hope I could explain well my problem!

 

Best regards,

Federica

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    @federica_gatto9 Please use a Read Excel operator to load in the data, then Select Attributes to select the column with the text, then Nominal to Text operator to convert it to Text that the Process Documents from Data operator can read. Then output the EXA port on the Process Documents from Data operator. 

  • federica_gatto9federica_gatto9 Member Posts: 7 Contributor I

    Hi,

     

    The attribute I want to analyze is already set as text. I slved the problem, I had to put tokenize and stopwords within Process Documents to Data and not after. Other question: how are the results to be interpreted? Like, if for a word I have Min:0 Max:0.864, what does 0.864 mean?

     

    Thank you!

    Federica

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    @federica_gatto9 I don't know what you're doing, so I can't help you interpret the output. It would be best to post a screenshot at the very least. Normally we'd ask you to post your XML process and some sample data. 

  • federica_gatto9federica_gatto9 Member Posts: 7 Contributor I

    You can find attached a picture of the preocess and one of the results on the statistics window. The numers (min, max, average) are what I cannot interpret. I hope the üictures help.

     

    Best regards,

    Federica

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    It would be much better to post the actual process and data, since very little can be learned from the pictures.  For example, you have an operator labeled "Generate TF-IDF" but I have no idea what it is or what it is doing, since generating the TF-IDF vector is automatically part of the output (if selected) from Process Documents.

    But in general, the values you are seeing should be the values for the word vector calculations, presumably based on the TF-IDF method.  You can read about it here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

    It is an adjusted frequency value and is always between 0 and 1.  Generally a higher value means that specific document is more relevant for that term, and a lower value means it is not, and a zero value means that document does not contain that term at all.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.