Options

Comparing two sets of data

amotleyamotley Member Posts: 17 Contributor II
edited November 2018 in Help

Help needed!

 

I have two sets of data imported from my data repository. They both contain all the same attributes. I was wondering if there is a way to compare these two sets of data without combining them into one table? I have tried all the join operators, but these all combine them into one big table. I am trying to find entries (rows) in these two sets that are the exact same. 

 

Thank you!! 

Answers

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Are you simply trying to eliminate the duplicates between the two datasets?  If so, then merge all the data into a single dataset using the "append" operator, and then you can use the "remove duplicates" operator to remove duplicates based on any subset of attributes.   

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist

    Hi amotley,

     

    What you are trying to do may be easily possible if there was an id column,I assume you dont have one.

     

    Would it be easy to get some sort of ID in the original exampleset OR even generate a new ID by concatination of existing columns (concatination of all columns is also a  possibility but may not be great idea depending on type and number of attributes you have.

     

    If getting/generating ID is a possibility then then you can use the Intersect operator to get the common rows.

     

     

  • Options
    bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist

    See the attached exmaple. may be this may work for you.

    Please note you will need to put your two inputs into the append step. My steps before that are just using dummy data.

     

    Let me know if this worked for you

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Hi all,

     

    our new Jackhammer Extension comes with a few new operators performing set operations, including an "Intersect" operator that can work on any number of attributes and seems to be an exact match to your problem. We will release that on Friday but you will need to have to look on our website until the marketplace can host non-open-source extensions. I will also announce the new extension here on the forum.

    Unbenannt.png

    Here you see two data sets: All rows whose values of selected attributes occur in both (or all) sets are outputted on the right with all attributes of the respective input example set.

    Good news: While the more cool features like on disk memory and parallel execution are limited in the demo version, the set operators are free to use.

     

    Greetings,

      Sebastian

     

  • Options
    amotleyamotley Member Posts: 17 Contributor II

    Is there an operator you use to remove the IDs that were generated in the previous steps?

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Sure, use a Select Attributes operator:

    Select the mode Single, select the id attribute and then check inverse selection AND include special attributes.

     

    Greetings,

      Sebastian

  • Options
    amotleyamotley Member Posts: 17 Contributor II

    The problem is, I am not trying to pick out the matching ids. I am actually looking for all the values across a certain row in my table to be the exact same with the row of another table, and then pull that out. 

  • Options
    bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist

    So we were discussing this off thread and found a problem with my implementation.

    Apparently duplicates that exist in same data set were not handled.

    The solution was simple, adding remove duplicates before the append.

    please see the attached new workflow

Sign In or Register to comment.