Analyze two notepad files

Sneimeh · May 2019

dears,
I am too newbie to Rapidminer.
i have been asked to compare the similitry of the content of two notepad files using Rapidminer.
now i have installed and put the two notpad files on the design page and added the operator modeling data to similarity but i dont know how to connect the two files into the operator and how to see the results. how am I able to test the similitry from two files1.txt and file2.txt using the design area, and to see te results,

sgenzer · May 2019

hi @Sneimeh I'm sorry no one has chimed in here. Can you please post the notepad files so we can see what they look like?

jacobcybulski · June 2019

If the similarity is to be measured line by line then this is hard. If it is to be measured based on the number of words appearing in both documents then this is easy. What you can do, which may be hard for a newbie, is to do something like this (also see the attached process file):

Image: https://us.v-cdn.net/6030995/uploads/editor/vi/lod78w32f2zg.png

So, read both documents in (here I simply create them and turn then into a single attribute example set), then parse the first document into its binary representation (1 if the word appears and 0 if it does not), then use the word list from the first document as a start list for parsing the second document again in a binary representation. Now in the second parse you will only detect the presence of words which appeared in the first document, so that if you get 1 it means the word appeared in both and if you get zero it means the second document does not have such a word. Now if you want some measure of similarity you can add all ones and you get the number which tells you how many words appeared in both. Well this is a starting point

Jacob

Telcontar120 · June 2019

You can also use the Cross Distances operator after putting the documents into examplesets as noted in the first part of @jacobcybulski' s solution above. This will indicate whether the documents are the same (distance will be zero) or the higher the value, the father apart they are. Read the explanation and the tutorial of the Cross Distances operator for more information.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Analyze two notepad files

Answers