Topic extraction on Rapidminer

BadBoy20BadBoy20 Member Posts: 5 Contributor II
edited November 2018 in Help
Hello everyone. I am new to rapidminer.

I've been doing the googling but I haven't found a way to do this yet. Is there a way for rapidminer to detect the topic of a bunch of documents and extract it? Could there be a way to extract the similarity of each document and match how well it matches with a specific keyword. And if there is, could someone write that or link me to such a topic? thanks

Answers

  • DocMusherDocMusher Member Posts: 333 Unicorn
    Hi,
    Although I am no expert in text mining, your question can be solved by following the normal pattern as proposed for instance http://vancouverdata.blogspot.be/2010/11/text-analytics-with-rapidminer-loading.html. The topic of a document is related to the tags if available or to the key words you quantified by text mining.
    Cheers
    Sven
  • BadBoy20BadBoy20 Member Posts: 5 Contributor II
    Is there a way to find how closely a document is to a certain topic? so lets say I have a document about shang-hai and it mentions china a few times. i want to see if said document relates to china and how closely they relate?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    there are most likely solutions for you inside rapidminer. I would say there are basicly three ways to go:

    - Supervised learning

    If you have documents with a Tag (e.g. China) you can go for supervised learning and built a model on each tag which detects the different topics. If you have tagged data, i would go this way. The tutorial above should help you with this

    - Clustering
    If you do not have tagged examples, you can go for clustering. Then you group together similar things. Most likely you want to use either K-Means or K-Medoids for this task. The problem is here: How many Topics do we search for? How to interpret the results? And of course for tags: A text might be in more than one topic (E.g. Hotel and China).

    - Simple similiarty
    You can calculate a similarity between two texts using cross distances. Might be helpful in a lot of cases.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • BadBoy20BadBoy20 Member Posts: 5 Contributor II
    Thank you for that reply. Supervised learning with tags is out of the question because there are no tags. Simple similarity would be way too slow. I think my best bet is to use clustering, which I do have experience in from before.

    The type of data analysis that I am doing is downloading 1000s of documents from a database by doing a headline (heading) search. Thing is, just because the heading has a certain word in it, might not mean that the topic is about that, hence the topic search. The idea that I have with clustering is to use rapidminer to cluster using a suitable value of k and then taking the cluster that has the most amount of objects as the most topical one. Reasoning for this is, let's say,Β  if a database of 10000 documents all have the word "china" in the title, then the cluster that is most closely related together probably has something to do with the heading/search term. The type of documents is financial. I want to ask you from your experience, if this is a viable way to interpret the topic of financial documents through clustering. Thank you for your advice.

    Cheers,
    BadBoy20
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi again,

    a small tip: It is often useful to add a supervised learning feature selection after your clustering. The result is: Which words make this cluster different from the others? I would do a one vs all strategy here.

    cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.