Data to Similarity

limegreenman900limegreenman900 Member Posts: 26 Contributor II
edited November 2018 in Help

Is there any chance to stop RapidMiner from comparing ExampleSet A with B and vice versa B with A? As I have a few thousand files to compare I am running into Memory Issues.... Is there a possibility comparing only A with B and skip comparing B with A using the Data to Similarity Operator with Cosine Similarity and Euclidean Distance?

 

Thanks in advance.

 

Regards

Best Answers

  • bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist
    Solution Accepted

    Hello @Limegreenman900

     

    Actually the operator documentation does mention that if A>>B similarity is calcualted then B>>A is skipped. So that is automatic. But given you memory issues, it may be better to try n run on a bigger machine or see if the workaround I have attached helps ?

     

     It uses loops and macros, but the cross distance operator you can provide reference set and request set. Cross distance operator cna calcualte similarity too.

    Idea here is to break it into smaller steps, save intermediate results,clear memory and then repeat

     

     

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted

    Hi,

     

    yes I do. the problem is that you do not transfer over the wordlist from the lower to the upper Process Documents. The wordlist contains which words (=attributes) to generate (and it's normalization). This results in different attributes in the upper and lower stream. 

    One thing about cross distance is, that if a attribute is not in the data set it is interpreted as missing. You have then in your distance equation a missing value which results in a total missing value in the sum of the euclidian distance.

     

    Solution: Connect the lower wordlist port of Process Documents with the upper Process Documents from Files (2). Then it should be fixed.

     

    ~Martin

      

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted

    No,

     

    the wordvector contains the normalization (the IDF part of TF/IDF).  If you plug the wordvector from your training set into the process documents from your apply set, everything is fine.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist
    Solution Accepted

     

    @limegreenman900

    check out the following attached file, if it helps

     You will notice by adding the format numbers, I can get the order nicely..

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    are you intersted in all distances? If you use Cross Distances instead of data to similarity you can reduce it to the k-next/farest neighbours. That might reduce the memory consumption.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    I know, but Cross Distance only gives me the distance between my reference and my request set right? But I would need all distances between my files however no duplicates... I had a look at my output and I get similarity coefficients for 1 - 2, 1 - 3, 1 - 4, 2 - 1, 3 - 1, 4 - 1. That's too much, I simply need all similarity coefficients between 1-2, 1-3, 1-4, 2-3, 2-4, 3-4. I am using data to similarity -> similarity to data -> Write Excel. Any idea?

  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    @bhupendra_patil: Thanks for your answer. But I don't think that the operator is only calculating one way. I wrote a reply above to @mschmitz answer. I'll have a look at your rmp file and give you feedback later on whether it helped me.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Yes it does, but you can add the same data set at both ports.

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    You are eright. I haven't thought of doing this that way :smileywink:

  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    @mschmitz: Any Idea why the following process with cross distance is not working? I am simply trying to compare some files with "Process Documents" with a single File as reference.... I always get "?" as output for my similarity coefficient...

     

  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    Ah ok, I understand. Thanks for your help on that!

    I expected that the Cross Distance Operator will recognize identical attribute on the upper and lower Process Document Operator and therefore conduct it's similarity computation automatically. Good to know that I have to connect both operators to get results! :)

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    you can enforce to have the same attributes with a combination of data to weights and select by weights as well. Nevertheless that does not solve the issue of normalization if you use TF/IDF or frequency.

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    Indeed I am using Term Frequency as Vector Creation. So I do have to insert a "Normalize" Operator after my process as my results won't be normalized by any operator before?

  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    @mschmitz: But only if I use TF/IDF right? However I am using only TF, as I am interested in the relativ frequency of words in my document.

     

    I am still using the process I posted a few posts ago. I tried using Euclidean Distance as Similarity Measure however I receive negative results (is this even possible as Euclidean Distance is a square-root function which can't be negative?). Is there any problem within my process or the way the operator is calculating my distance value?

     

    Or is this a matter like in this Post: http://community.rapidminer.com/t5/RapidMiner-Studio/SOLVED-negative-distances/m-p/21636

  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    @bhupendra_patil: Sorry for the late reply, but it took a while until I had time to test your process. It works fine, however would you mind giving me a short explanation of what the operators are doing? I.e why do we use two similarity operators (Cross Distance and Data to Similarity) and what are the two filters inside the loop operator doing? Would appreciate a short explanation of this. :)

  • bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist

    The "data to similarity(2)" was just to compare final results during test phases. you can safely delete it..

     

    Here is how it works 

    First I extract total number of rows in macro called as "totalrows."

    then I loop over each example

    --- In the loop my first "filter example range" gets me current row

    --  the second "filter example range" gets me all row from (currentrow+1) to last row

    ----then I do the "cross distance" between the two sets..i.e 1 row against all rows below it

    ---- since you were having out of memory issues, you then can store this result and free memory after each loop iteration

    ---- in your store operator you can put the macro name in path /my/location/to/similarity%{example}

     

    once this process completes you will have n-1 datasets each smaller than previous one..

     

    then you basically can create another process that does loop directory>>Append to combine the similarity scrores..

    Basically this technique was "divide and conquer" 

    As you notice I get a chance to store/clear memory after everyrow

    let me know if this helps..

    I'll document the process when I get a chance.

     

  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    Sorry I had a look at the wrong output example set. It seems there is some problem with your code as I get different results with CD and S2D as well as looping stops after processing 5 files?!

    Attached you can see my process.

     

  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    @bhupendra_patil: I did some changes to your process that it suits my needs. Is there any chance to get the output as a matrix that has only value above/below the diagonal line? Normally I could do this with Data to Similarity --> Similarity to Data --> "Matrix", however when using Cross Distance I don't have a port to connect the similarity port from Similarity to Data?

    I attached my process, that should be easier.

  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    @bhupendra_patil: Perfect! Thanks a lot, this is exactly what I tried to do!
    Could you give me a short 1-2 phrase explanation what the operatores are doing? I tried to understand it, but it seems a little confusing to me (especially the Format Numbers in this context).

  • bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist

    Glad I could help @limegreenman900

     

    You key operator here is PIVOT, that does the reshaping of the table.

    the Format and Rename are just to make the output cleaner.

    You can disable Format and Rename and you will see what is rhe result without it.

     

    As you notice Pivot creates a new column for each unique value 

    However the way the columns are ordered in the resulting set are based of alphabetical order (Since column names internally are strings)

    hence the order you get is 1, 10,11,12,13,14,2,3,4,5

    BY using format number i converted the numbers as a three digit string where 1 becomes 001, 2 becomes 002, ...11 becomes 011

    After this even with string sorting it comes like 001,002, 003.... 014 and so on..

     

    The last operator is just renaming it and getting rid of the prefix the pivot operator puts to the column name

    Hope this is helpful

     

    the format number is a trick to get the columns in the right order

    else the 

  • limegreenman900limegreenman900 Member Posts: 26 Contributor II

    @bhupendra_patil: To be honest, I didn't get the part why using "Format Numbers". I understand what you wrote but I ran the process two times, one time with and one time without this operator and I yield exactly the same results in the same order?

    To get it right, if i have my data in the format of:

    Request Set |  Document |  Value

    Doc1 | Doc2 | 0,5

    Doc1 | Doc3 | 0,4

    Doc1 | Doc4 | 0,7

    ....

    Doc2 | Doc3 | 0,7

    Doc2 | Doc4 | 0,2

    Doc2 | Doc5 | 0,9

    ....

    The Pivot Operator first groups my request set into columns and transponses the value to each grouped request set?:

     

    Doc1 | 0,5 | 0,4 | 0,7

    Doc2 | 0,7 | 0,2 | 0,9

     

    Thanks again for your time spent on my issue. As I said it works perfectly, I just like to understand what I am doing, this is mostly helpful when fruther developing the process :smileywink:

  • bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist

    you just got lucky with your data set, you just happened to have names that lined up correctly..

    if you run the sample i shared as it is, it will show you why I added those

     

    also it basically creates a new column for each unique value in "Document" column of yours

Sign In or Register to comment.