How to remove duplicate data

jmphillipsjmphillips Member Posts: 18 Contributor II
edited August 2021 in Help
Hello community: I have a problem withe duplicate data, this is an example:


                                                Col 1                                                                                                                             Col 2
  
         39-2021                       49-2021
         49-2021                       39-2021

so I want to remove one of those rows.

Help please, you are my only hope.....

Answers

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi @jmphillips,

    this sounds like a good use case for De-Pivot, removing the duplicates, and Pivot again.

    De-Pivot will put all values into subsequent rows and keep the information about the column name (Col 1, Col 2) in another column. You would then remove the duplicates according to your rules and use Pivot to build the original data structure again. 

    Be careful to remove entire "rows" (original rows, which are multiple rows after the De-Pivot). If you don't do that, you will have missing data upon pivoting again.

    Regards,
    Balázs
  • jmphillipsjmphillips Member Posts: 18 Contributor II
    edited August 2021
    Thanks, Balazs, but I dont know how to use those operators....

    :(

  • jmphillipsjmphillips Member Posts: 18 Contributor II
    FIRST_ID     SECOND_ID SIMILARITY
    24260-2016      24413-2016 1.0
    24413-2016      24260-2016 1.0***
    22055-2016      22056-2016 1.0
    22056-2016      22055-2016       1.0
    22057-2016      22058-2016         1.0
    22057-2016      22060-2016 1.0
    22057-2016      22059-2016 1.0
    22058-2016      22057-2016 1.0
    22058-2016      22060-2016 1.0
    22058-2016      22059-2016 1.0
    22060-2016      22057-2016 1.0
    22060-2016      22058-2016 1.0
    22060-2016      22059-2016 1.0
    22059-2016      22057-2016 1.0
    22059-2016         22058-2016 1.0
    22059-2016     22060-2016 1.0

    If I de pivot all SECOND_ID are goit to a new collum, with the FIRST_ID data, so if I then remove duplicates Im going to loose data, the problem si that I want to eliminate for example *** 24413-2016  24260-2016 because I have 24260-2016 24413-2016 the first and the second Id in the case are the same but inverted.


  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    another approach would be duplicating the data with Multiply and then joining with Left Outer join on FIRST_ID = SECOND_ID *and* SECOND_ID = FIRST_ID. You could then decide which matches to remove.

    Regards,
    Balázs
  • jmphillipsjmphillips Member Posts: 18 Contributor II
    Balazs, sorry dont know how to do it...  can you sent me the box and parameter to use?  Please.

    Regards.
  • frederic_gomer_frederic_gomer_ Member Posts: 4 Contributor I
    HI,

    jmphillips You can try this process



    Attach file : How_to_remove_duplicate_data.rmp
  • jmphillipsjmphillips Member Posts: 18 Contributor II
    Thanks Frederic for the answer but, row one is asociated at first and second columm

    the second row is the same values but inverted, so I want to remove the complete second row.

    Regards. 



    FIRST_ID
       





     SECOND_ID






    SIMILARITY
    24260-2016 24413-2016  1.0
    24413-2016 24260-2016  1.0
Sign In or Register to comment.