Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Text Mining - Documents Similarity (words position)
silviabastos
Member Posts: 2 Learner III
Hello,
I'm looking for a way to get the similarity between documents, but where the words positions is relevant.
I've already implemented the sample with "Data Similarity" operator (CosineSimilarity) like:
https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/How-to-compare-similarity-of-large-number-of-documents/td-p/16002
But I need to take into account the order/position of words, not only frecuency or occurrence.
I.E:
Example 1: A B C D E F G
Example 2: A X B D Y F G
Example 3: G F E A B C D
Example 1 and 2 have more similarity than Example 1 and 3 because although Example 3 has exactly the same words than Example 1 (CosineSimilarity=1), they are in different position. Example 2 only has two different words (X,Y), and other word in other position but near the original position...
I think is a problem difficult to explain and I'm not sure if RapidMiner can give me a solution.
Best regards,
Silvia
I'm looking for a way to get the similarity between documents, but where the words positions is relevant.
I've already implemented the sample with "Data Similarity" operator (CosineSimilarity) like:
https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/How-to-compare-similarity-of-large-number-of-documents/td-p/16002
But I need to take into account the order/position of words, not only frecuency or occurrence.
I.E:
Example 1: A B C D E F G
Example 2: A X B D Y F G
Example 3: G F E A B C D
Example 1 and 2 have more similarity than Example 1 and 3 because although Example 3 has exactly the same words than Example 1 (CosineSimilarity=1), they are in different position. Example 2 only has two different words (X,Y), and other word in other position but near the original position...
I think is a problem difficult to explain and I'm not sure if RapidMiner can give me a solution.
Best regards,
Silvia
Tagged:
0
Answers
Instead of tokenizing your documents, you may want want to take a look at "Data to Similarity" which allows the computation of various types of nominal distances between entities. I am not familar with all the details of several of those distance metrics (Dice, Jaccard, Tanimoto, etc.) but it is possible that one or more of them is suitable for your purposes.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Hi @silviabastos
This is a great questions. To 'remember' to location of the key words, you can use "generate nGrams" for phrases search with term max length for 7 + and of course it will need more time for text processing.
Supppose you do not have many words in each document, ideally just like the examples showed in your message, we have three documents as simple as
You can use the levenshtein distance offered in Dr Martin Schmitz's toolbox extension. https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_operator_toolbox
The Levenshtein distance is calculated as the number of changes needed to convert one string into the other. A common use case for this distance is spell checking.
Here is the xml of my process. HTH!
YY
Hi!
I will try both options.
Related to @yyhuang solution, I only wrote a small example in the first post, the texts I'm working have natural language, about 900 words, so I'm not sure if I can use it.
Related to @Telcontar120 solution, I make one first attempt, but I didn't get consistent results.
I will work a little more io this and I will post the found problems.
Any other solutions are wellcome.
Thank you.
Hi @silviabastos
Thanks for the followup! Maybe you can try word2vec for document with 900+ words?
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Training on a single corpus the word2vec algorithm will generate one multidimensional vector for each word. These vectors are known to have symantic meanings that help you understand the position and context of each word.
You can install word2vec extensions from marketplace.
HTH!
YY