Text Mining - Document Similarity & Clustering

AB9200AB9200 Member Posts: 2 Contributor I
edited December 2018 in Help

Hi everyone

I'm new to this forum and to Rapidminer and text mining as well, so I need your help: 

I have a large number of documents (.txt) each one containing a specific question for solving a problem, and the relative answer.

My objective is given a new question to identify the closest ones (all the questions are in italian) in order to suggest the possible solution according to the answers given to the other similar questions.

I have downloaded the Text Mining Extencion and I imagine I have to use the "Process Document from files" operator (Tokenize, Filter Stopwords( Italian), Transform Cases, Stem...) first and than probably use "Document to Similarity" and "Clustering" operators.

Could you please give me some hints?

 

Thanks a lot!


Answers

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,446  Community Manager

    hello @AB9200 welcome to the community.

     

    So this sounds like a classic text mining classification machine learning problem. You want the algorithm to take a new question and classify it based on how it has "learned" how to classify similar questions before. In order to do this, you need a "training set" of questions that you have classified by other means (can be manually). Once you have a training set, you can use one of the "Process Documents" operators to generate TF-IDF word vectors to build your ML model. There are good resources in our Training section about how to build ML classification models and many resources on text mining on our YouTube channel.

     

    Scott

     

    rfuentealba
  • AB9200AB9200 Member Posts: 2 Contributor I

    Hi @sgenzer,

     

    thank you so much for the reply. Exactly that is what I would like to do but I am looking for a way to solve the problem without the use of a "training set". Is it possible? Maybe doing some clustering, calculating document similarity or using top modeling...I dont't know exactly.

     

    Thank you again.

     

     

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,446  Community Manager

    hello @AB9200 - so, to quote Euclid: "There is no royal road to geometry." In other words, sometimes you just need to roll up your sleeves and put in the time to get a good solution.

     

    If you want to look at an unsupervised approach, I would recommend watching my recent webinar on topic analysis using the new LDA operator. I walk you through how to do this step-by-step.


    Scott

     

    lionelderkrikorrfuentealba
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 757   Unicorn

    Hi all,

     

    Euclid would have made a good ......data scientist !!!

     

    Regards,

     

    Lionel

    sgenzer
Sign In or Register to comment.