Which operator?

johnnyjohnny Member Posts: 1 Contributor I
edited November 2018 in Help

I am considering using Rapidminer for a piece of PhD research on webforums and I'm feeling my way around the program.

What I want to do is use Rapidminer to test a large data set drawn from web forum databases to see three things:
a) how often certain phrases that I am interested in appear;
b) whether this reduces over time - depending on the date of posting in the forum);
c) and whether references to these phrases are favourable.

My dataset is several CSV files that contain 7 colums, and thousands of rows.  Each row contains posting details of a forum posting, and the complete text of that posting, meaning that the "Message" field can be hundreds of words long. Colums are: "MessageID" "ThreadID" "ThreadName"    "MemberID"    "MemberName" "P_Date" "Message".

My question is, which operator should I use to load this kind of CSV that would allow me to use all seven columns?

I am using both Rapidminer 4.6 and 5 to see which is the easiest to learn, and would appreciate any guidance members have on this.


  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    I would recommend RapidMiner 5.0. It not only lowers the learning curve a lot, but also has the more advanced text processing capabilities.
    You can load your data with the read csv operator or simply import it using the wizards (File / Import Data). After this you will be able to use the Process Documents from data operator of the Text Processing Extension to analyse each single Text. By default this operator will generate texts from all attributes of type text. So you might want to change the type of your attribute that stores the text with the operator Nominal to Text.

Sign In or Register to comment.