"Grouping Text Files"

noah977noah977 Member Posts: 32 Maven
edited May 2019 in Help

Next challenge in my attempt to learn RM.

I have a collection of text documents. (Maybe 1,000)

IF possible, I would like to use RM for the following:

    1) Automaticaly cluster them.  (I've seen great screenshots of the results, but have no idea how to do it.)

    2) Do some kind of "best feature" extraction - Use TFIDF or other algorithm to find significant 2-word and 3-word features

    3) Maybe do some kind of sentiment analysis.  (I read a great press release about how RM was used for looking and consumer opinion of a laundry detergent.  That is amazing.  How was this don.)




  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 295 RM Product Management

    well, all of the things you mentioned are (of course ;)) possible with RM. But explaining all this to you here in general would be tantamount to write a tutorial or actually a book about text mining/sentiment analysis. I am sure you will understand that this is beyond the scope of this forum...

    If you are willing to learn how to set up such text mining and sentiment analysis processes with RM really fast I would highly recommend you to attend one of the training courses we offer. If you are very spontaneous it might be interesting for you to know that there is a training course concerning that topic at the beginning of December. As far as I remember there is also a still place available in the course. If you are interested, feel free to contact us. You may send an email to malbrecht@rapid-i.com and I will provide you with some more details.

    Otherwise there is of course the probability to learn from the example processes which are shipped with the text plugin. However, learning things for yourself of course might take more of your time and might still leave you with open questions ...

  • noah977noah977 Member Posts: 32 Maven

    Thank you for the information about your next class.  Unfortunately time and expense prohibit me from attending.  If you ever have a seminar again in California, I would be interested.

    Do you offer any kind of phone consulting?  I would be very helpful to buy one or two hours of time over the phone to discuss some basic project ideas.

  • noah977noah977 Member Posts: 32 Maven

    I took your advice and looked through the examples with the text plugin.  I understand how to implement the pluging, load pages, create vector models, etc.

    My next question is about clustering.  If I choose "K-means", I must define the number of clusters in advance.  Ideally, I would like to feed a number of documents into RM and then get back as many clusters as necessary.  Is there some other tool that intelligently looks at the data (vectors from documents) and creates as many clusters as "necessary" to represent the data?

    Secondly What kind of output options do I have.  Again, the ideal would be a list of documents for each cluster along with the key features of the cluster.  Perhaps to a text file?


  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    if you try to specify "necessary" you will understand, that there cannot any reasonable criterion for selecting the number of clusters automatically. All programms doing that just use one heuristic doing something that might turn out to be good or to be bad. This depends on the circumstances and the problem. RapidMiner does not hide this problematic and forces the user to think about the needed number of clusters.
    If you want to learm about clustering, check the samples of rapidMiner itself...

  • noah977noah977 Member Posts: 32 Maven

    Thank you for the explanation.

    I guess I should explain more about my goals and why I can't specify the number of clusters in advace.

    I am looking at a batch of documents.  Perhaps 1000-10000.  My goal is to use RM to find common "themes" amongst the documents.

    For example:
    1) Cluster of 125 documents all with highly weighted phrases of "litigation", "Product", "injury"
    2) Cluster of 57 documents with features of "Announcement", "Earnings", "Friday"
    3) Cluster of 357  documents with features of "Press Release", "merger with IBM", "stock price"

    My HOPE was that I could use RM to generate good TFIDF weights for tokens in the documents and then group them accordingly.  The logic would be that it would form groups of documents with a similarity score > X  (X would be an adjustable variable.)

    Is this possible/
Sign In or Register to comment.