[Solved] Comparing Examplesets by rows

maxfaxmaxfax Member Posts: 17 Contributor II
edited November 2018 in Help
Hi I have 2 Databases and i would like to calculate the similarity between row number 1 of dataset1 with row number 1 of dataset 2 and row 2 with row 2 and so on . Both datasets contain text data.

thank you very much for your help!!

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    how do you define similarity?

    Best regards,
    Marius
  • maxfaxmaxfax Member Posts: 17 Contributor II
    Mhh i want to know how similar one message is to the other one. I used the process documents operator to calculate TF/IDF Information for each row of both examplesets. I know that there are several ways to compute the similarity like cosine similarity or jaccard distance . I am not sure yet  which to use. But i would like to compare Just row 1 of exampleset1 to row 1 of exampleset 2 . I noticed the cross distance operator always calculates the simillarity for the whole dataset. but that is not what i want . Only row 1 with row 1.

    I hope i made it a little more clear thank you very much for your help.

    Best regards,

    Max
  • maxfaxmaxfax Member Posts: 17 Contributor II
    In addition i would like to extract some data from the exampleset matching a certain criteria. For example just like in an SQL Statement i Would like to Extract those Examples where the Column Request = the Column Document.

    Is this possible and how ?

    Thank you very much !!
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    maxfax wrote:

    In addition i would like to extract some data from the exampleset matching a certain criteria. For example just like in an SQL Statement i Would like to Extract those Examples where the Column Request = the Column Document.

    Is this possible and how ?

    Thank you very much !!
    You can use the operators Generate Attributes and Filter Examples for that: in Generate Attributes, create a new attribute "indicator" with the formula "if(Request == Document, 1, 0)", and then configure the Filter Examples to use the expression filter with an expression like "indicator = 1".
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Concerning the comparison, you can use Cross Distances nevertheless and then filter only those rows from its result where request and document are equal, as described above.

    The Cross Distances operator expects that both example sets contain the same attributes. For text processing, especially the Process Documents operators, that means that you have to use the same wordlist to create both tf/idf sets. You probably use Process Documents for both the left and the right data of your comparison data. To use the same word vector for the right data set as for the left dataset, just connect the WordVector output of the left Process Documents operator to the respective input of the right Process Documents operator.

    If you have problems, please let me know.

    Best regards,
    Marius
  • maxfaxmaxfax Member Posts: 17 Contributor II
    THank you very much for your help :) this is what i did - and it works -- Thought i could save some processing time by not calculating all the distances which wont be needed afterwards. Exampleset with 50k rows takes some time ;-)

    Anyway this will work !! thank you !
Sign In or Register to comment.