[SOLVED] Remove duplicates selecting which examples must remain

arthurgouveiaarthurgouveia Member Posts: 5 Contributor II
edited November 2018 in Help
Hello.

I have a large customer dataset with some values "duplicate". Let me try to make myself clear: I have a dataset with over 50 attributes ย of over 200k contracts. One of these attributes is contract_status. Some of these statuses are valid and some are invalid. I've created a boolean attribute named is_valid_status.

I'd like to remove duplicates based on a subset of attributes and keep only the examples where is_valid_status is true.

How can I do it?

Thanks

Answers

  • RalfKlinkenbergRalfKlinkenberg Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member, Unconfirmed, University Professor Posts: 68 RM Founder
    Hello Arthur Gouveia,

    you can use the RapidMiner operator called Filter Examples.

    Cheers,
    Ralf
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Arthur,

    in addition to Filter Examples for filtering on is_valid_status you can use the Remove Duplicates operator. It allows to select which attributes are considered for finding duplicates. You should try the operators first on a smaller subset to get a feeling for their settings.

    Best regards,
    Marius
  • arthurgouveiaarthurgouveia Member Posts: 5 Contributor II
    Thank you! It worked almost perfectly. I can't believe I didn't think about the solution you guys gave me....ย  ::)

    But now I have another problem. I found several contracts with valid statuses and I'd like to remove duplicates but keep only the most recent status. I have an attribute named date_processing that I can use to achieve that but I can't figure how.

    Is there any way to remove duplicates keeping only the most recent data?
  • RalfKlinkenbergRalfKlinkenberg Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member, Unconfirmed, University Professor Posts: 68 RM Founder
    Hello Arthur Gouveia,

    this requires several steps:
    • You can use the RapidMiner operator Aggregate to determine the maximum of date_processing for each client (group by client ID).
    • The operator Rename can be used to rename the new attribute max(date_processing) to max_date.
    • With Join you can add the max_date column to the original data table (select the client ID as ID for both tables).
    • With Filter Examples you keep only the data lines where date_processing >= max_date and you are done.
    Cheers,
    Ralf
  • arthurgouveiaarthurgouveia Member Posts: 5 Contributor II
    It didn't work. Almost everything went ok but I couldn't filter just the max_date. It cannot parse value 'max_date' with date pattern yyyy-MM-dd HH:mm:ss Z

    image

    What is amazing is that when I look to the meta data view the max_date attribute is type date_time. I've tried to change the type using Date to Nominal, Date to Numerical, Numerical to Date, Nominal to Date, Guess Types but all of them either don't list max_date at the attribute list or don't make Filter Example work.
  • arthurgouveiaarthurgouveia Member Posts: 5 Contributor II
    I found a solution! I used the Generate Attributes to create a date_dif attribute using the function date_diff(max_date,date_processing). Then I just had to filter the examples where date_dif=0.

    It's working! Thanks!
Sign In or Register to comment.