[Solved] Comparing Examplesets by rows

maxfax · January 2013

Hi I have 2 Databases and i would like to calculate the similarity between row number 1 of dataset1 with row number 1 of dataset 2 and row 2 with row 2 and so on . Both datasets contain text data.

thank you very much for your help!!

MariusHelf · January 2013

Hi,

how do you define similarity?

Best regards,
Marius

maxfax · January 2013

Mhh i want to know how similar one message is to the other one. I used the process documents operator to calculate TF/IDF Information for each row of both examplesets. I know that there are several ways to compute the similarity like cosine similarity or jaccard distance . I am not sure yet which to use. But i would like to compare Just row 1 of exampleset1 to row 1 of exampleset 2 . I noticed the cross distance operator always calculates the simillarity for the whole dataset. but that is not what i want . Only row 1 with row 1.

I hope i made it a little more clear thank you very much for your help.

Best regards,

Max

maxfax · January 2013

In addition i would like to extract some data from the exampleset matching a certain criteria. For example just like in an SQL Statement i Would like to Extract those Examples where the Column Request = the Column Document.

Is this possible and how ?

Thank you very much !!

MariusHelf · January 2013

maxfax wrote:

In addition i would like to extract some data from the exampleset matching a certain criteria. For example just like in an SQL Statement i Would like to Extract those Examples where the Column Request = the Column Document.

Is this possible and how ?

Thank you very much !!

You can use the operators Generate Attributes and Filter Examples for that: in Generate Attributes, create a new attribute "indicator" with the formula "if(Request == Document, 1, 0)", and then configure the Filter Examples to use the expression filter with an expression like "indicator = 1".

MariusHelf · January 2013

Concerning the comparison, you can use Cross Distances nevertheless and then filter only those rows from its result where request and document are equal, as described above.

The Cross Distances operator expects that both example sets contain the same attributes. For text processing, especially the Process Documents operators, that means that you have to use the same wordlist to create both tf/idf sets. You probably use Process Documents for both the left and the right data of your comparison data. To use the same word vector for the right data set as for the left dataset, just connect the WordVector output of the left Process Documents operator to the respective input of the right Process Documents operator.

If you have problems, please let me know.

Best regards,
Marius

maxfax · January 2013

THank you very much for your help

this is what i did - and it works -- Thought i could save some processing time by not calculating all the distances which wont be needed afterwards. Exampleset with 50k rows takes some time ;-)

Anyway this will work !! thank you !

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[Solved] Comparing Examplesets by rows

Answers