RapidMiner

RapidMiner

[SOLVED] Remove duplicates selecting which examples must remain

Contributor II

[SOLVED] Remove duplicates selecting which examples must remain

Hello.

I have a large customer dataset with some values "duplicate". Let me try to make myself clear: I have a dataset with over 50 attributes  of over 200k contracts. One of these attributes is contract_status. Some of these statuses are valid and some are invalid. I've created a boolean attribute named is_valid_status.

I'd like to remove duplicates based on a subset of attributes and keep only the examples where is_valid_status is true.

How can I do it?

Thanks
6 REPLIES
Regular Contributor

Re: Remove duplicates selecting which examples must remain

Hello Arthur Gouveia,

you can use the RapidMiner operator called Filter Examples.

Cheers,
Ralf
Super Contributor

Re: Remove duplicates selecting which examples must remain

Hi Arthur,

in addition to Filter Examples for filtering on is_valid_status you can use the Remove Duplicates operator. It allows to select which attributes are considered for finding duplicates. You should try the operators first on a smaller subset to get a feeling for their settings.

Best regards,
Marius
Contributor II

Re: Remove duplicates selecting which examples must remain

Thank you! It worked almost perfectly. I can't believe I didn't think about the solution you guys gave me....  :Smiley Happy

But now I have another problem. I found several contracts with valid statuses and I'd like to remove duplicates but keep only the most recent status. I have an attribute named date_processing that I can use to achieve that but I can't figure how.

Is there any way to remove duplicates keeping only the most recent data?
Regular Contributor

Re: Remove duplicates selecting which examples must remain

Hello Arthur Gouveia,

this requires several steps:

  • You can use the RapidMiner operator Aggregate to determine the maximum of date_processing for each client (group by client ID).

  • The operator Rename can be used to rename the new attribute max(date_processing) to max_date.

  • With Join you can add the max_date column to the original data table (select the client ID as ID for both tables).

  • With Filter Examples you keep only the data lines where date_processing >= max_date and you are done.



Cheers,
Ralf
Contributor II

Re: Remove duplicates selecting which examples must remain

It didn't work. Almost everything went ok but I couldn't filter just the max_date. It cannot parse value 'max_date' with date pattern yyyy-MM-dd HH:mm:ss Z



What is amazing is that when I look to the meta data view the max_date attribute is type date_time. I've tried to change the type using Date to Nominal, Date to Numerical, Numerical to Date, Nominal to Date, Guess Types but all of them either don't list max_date at the attribute list or don't make Filter Example work.
Contributor II

Re: Remove duplicates selecting which examples must remain

I found a solution! I used the Generate Attributes to create a date_dif attribute using the function date_diff(max_date,date_processing). Then I just had to filter the examples where date_dif=0.

It's working! Thanks!