Text Mining - Document Similarity & Clustering

AB9200 · July 2018

Hi everyone

I'm new to this forum and to Rapidminer and text mining as well, so I need your help:

I have a large number of documents (.txt) each one containing a specific question for solving a problem, and the relative answer.

My objective is given a new question to identify the closest ones (all the questions are in italian) in order to suggest the possible solution according to the answers given to the other similar questions.

I have downloaded the Text Mining Extencion and I imagine I have to use the "Process Document from files" operator (Tokenize, Filter Stopwords( Italian), Transform Cases, Stem...) first and than probably use "Document to Similarity" and "Clustering" operators.

Could you please give me some hints?

Thanks a lot!

Add tags

sgenzer · July 2018

hello @AB9200 welcome to the community.

So this sounds like a classic text mining classification machine learning problem. You want the algorithm to take a new question and classify it based on how it has "learned" how to classify similar questions before. In order to do this, you need a "training set" of questions that you have classified by other means (can be manually). Once you have a training set, you can use one of the "Process Documents" operators to generate TF-IDF word vectors to build your ML model. There are good resources in our Training section about how to build ML classification models and many resources on text mining on our YouTube channel.

Scott

AB9200 · July 2018

Hi @sgenzer,

thank you so much for the reply. Exactly that is what I would like to do but I am looking for a way to solve the problem without the use of a "training set". Is it possible? Maybe doing some clustering, calculating document similarity or using top modeling...I dont't know exactly.

Thank you again.

sgenzer · July 2018

hello @AB9200 - so, to quote Euclid: "There is no royal road to geometry." In other words, sometimes you just need to roll up your sleeves and put in the time to get a good solution.

If you want to look at an unsupervised approach, I would recommend watching my recent webinar on topic analysis using the new LDA operator. I walk you through how to do this step-by-step.

Scott

lionelderkrikor · July 2018

Hi all,

Euclid would have made a good ......data scientist !!!

Regards,

Lionel

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text Mining - Document Similarity & Clustering

Answers

Howdy, Stranger!

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text Mining - Document Similarity &amp; Clustering

Answers

Text Mining - Document Similarity & Clustering