Options

Text Mining - Word Similarity

jansudesjansudes Member Posts: 4 Contributor I
edited June 2019 in Help
Hey,

I want to find the similarity between words used in a collection of articles; like which words have been used together more often than others. There are softwares like Automap and WordStat which are able to that; but the first doesn't consider the non-english letters (which is important for my case) and the latter is expensive!

I'm trying RM now and I noticed that it has the document similarity operator, but doesn't have one in a word-level. I gave a shot for association rules, but the ones that it finds didn't make much sense for my articles; like also-->able with probability 0.75

So I've decided to construct my own similarity model as below:

Process Documents from files ==> Wordlist to Data ==> Data to Similarity ==> Similarity to Data ==> Write Excel

The resulting table included the similarities between words as I wanted but there is double counting. For example, the similarity between the word #1068 and #963 appears twice like this:

FIRST_ID  SECOND_ID DISTANCE
963          1 068          103
1 068          963          103

This makes my results two times bigger than it should be, and it complicates the visualisations.

I couldn't find a thread about this double-counting in the forum, but I could use some help.

Thank you

Answers

  • Options
    jansudesjansudes Member Posts: 4 Contributor I
    Hey,

    :)

    Well actually my intention is to find word co-occurences within a collection of documents, really. Is there anyone who has done such a project in Rapidminer?


  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    in Process Documents, did you remove stopwords with the Filter Stopwords operator? That will most likely remove frequent words such as "also", "and", "I" etc. and thus clean up your association rules a bit.
    Furthermore, to use FPGrowth and Association Rules you most probably want to use the "binary occurences" mode for the word vector creation in Process Documents.

    Best regards,
    Marius
Sign In or Register to comment.