How to split strings contained in a text column of csv file into words

Ayushi_AggarwalAyushi_Aggarwal Member Posts: 4 Newbie
As of now, I am reading a CSV file which has review(text), n1, n2, n3, overall (text) columns.
I am using select attributes to include only review column, which gives me an output in rapidminer of the form:
Row                                   Review
1                                        Poor service
2                                        There were torn seats

What i want to do is split the contents of Review column into individual words like : Poor, service, There, etc.
I am using Process documnets to data > Tokenize but somehow not getting the required output.

Please help.

Answers

  • David_ADavid_A Moderator, Employee, RMResearcher, Member Posts: 148  RM Research
    Hi,

    if you don't necessarily have to use the Text extension. You could also simply use the "Split" Operator (not to confuse with "Split Data") and use a regular expression. I would say something simple like \s+|\W+ should do the trick (to split along spaces or non word characters (letters and numbers).

    Best,
    David

    mschmitz
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,088   Unicorn
    Can you be more clear about why Tokenize is not giving you what you expect?  What are you getting?  If you share your process and a data sample it will be easier to troubleshoot.  In general Tokenize should do exactly what you are asking for, take a text column and split it up into individual words.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.