[SOLVED] Remove duplicates selecting which examples must remain

arthurgouveia · January 2014

Hello.

I have a large customer dataset with some values "duplicate". Let me try to make myself clear: I have a dataset with over 50 attributes of over 200k contracts. One of these attributes is contract_status. Some of these statuses are valid and some are invalid. I've created a boolean attribute named is_valid_status.

I'd like to remove duplicates based on a subset of attributes and keep only the examples where is_valid_status is true.

How can I do it?

Thanks

RalfKlinkenberg · January 2014

Hello Arthur Gouveia,

you can use the RapidMiner operator called Filter Examples.

Cheers,
Ralf

MariusHelf · January 2014

Hi Arthur,

in addition to Filter Examples for filtering on is_valid_status you can use the Remove Duplicates operator. It allows to select which attributes are considered for finding duplicates. You should try the operators first on a smaller subset to get a feeling for their settings.

Best regards,
Marius

arthurgouveia · January 2014

Thank you! It worked almost perfectly. I can't believe I didn't think about the solution you guys gave me.... ::)

But now I have another problem. I found several contracts with valid statuses and I'd like to remove duplicates but keep only the most recent status. I have an attribute named date_processing that I can use to achieve that but I can't figure how.

Is there any way to remove duplicates keeping only the most recent data?

RalfKlinkenberg · January 2014

Hello Arthur Gouveia,

this requires several steps:

You can use the RapidMiner operator Aggregate to determine the maximum of date_processing for each client (group by client ID).
The operator Rename can be used to rename the new attribute max(date_processing) to max_date.
With Join you can add the max_date column to the original data table (select the client ID as ID for both tables).
With Filter Examples you keep only the data lines where date_processing >= max_date and you are done.

Cheers,
Ralf

arthurgouveia · January 2014

It didn't work. Almost everything went ok but I couldn't filter just the max_date. It cannot parse value 'max_date' with date pattern yyyy-MM-dd HH:mm:ss Z

What is amazing is that when I look to the meta data view the max_date attribute is type date_time. I've tried to change the type using Date to Nominal, Date to Numerical, Numerical to Date, Nominal to Date, Guess Types but all of them either don't list max_date at the attribute list or don't make Filter Example work.

arthurgouveia · January 2014

I found a solution! I used the Generate Attributes to create a date_dif attribute using the function date_diff(max_date,date_processing). Then I just had to filter the examples where date_dif=0.

It's working! Thanks!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[SOLVED] Remove duplicates selecting which examples must remain

Answers