Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Removing duplicates does nothing
Hi,
I've just carried out the following experiment: got an excel table of some thousands of records and some hundred of attributes. I pressed the 'remove duplicates' options, and I got 500 less rows in my dataset. I don't want to use excel for this though, so I tried the same in RapidMiner. Saved the worksheet as csv, loaded it to Rapidminer. I inspected manually that there are indeed a number of duplicate rows. Then I used the Remove duplicates in Rapidminer, and no rows were removed.
I was thinking about the cause and I think it's because the dataset contains missing data at various places (for various examples, and attributes of various types).
Is there any way to remove duplicates by considering the missing values as 'equally missing'? Or is it a bug somewhere? I couldn't figure out the solution so far.
Thanks in advance.
I've just carried out the following experiment: got an excel table of some thousands of records and some hundred of attributes. I pressed the 'remove duplicates' options, and I got 500 less rows in my dataset. I don't want to use excel for this though, so I tried the same in RapidMiner. Saved the worksheet as csv, loaded it to Rapidminer. I inspected manually that there are indeed a number of duplicate rows. Then I used the Remove duplicates in Rapidminer, and no rows were removed.
I was thinking about the cause and I think it's because the dataset contains missing data at various places (for various examples, and attributes of various types).
Is there any way to remove duplicates by considering the missing values as 'equally missing'? Or is it a bug somewhere? I couldn't figure out the solution so far.
Thanks in advance.
0
Answers
I have also realized this problem that missing numerical values are never counted as equal. It works with missing nominal values, but not with numericals. I have posted a fix for this (and also a faster implementation of this operator) to the bug tracker:
http://bugs.rapid-i.com/show_bug.cgi?id=438
Hopefully it will get into the next bugfix release of RapidMiner.
Best, Zoltan
the fix will be included in the upcoming RapidMiner Version with the slight change that there must be a switch to turn Unknown Equalness on and off. Otherwise the behavior would not be consistent with older process versions.
Nevertheless you could use a trick to come around this until 5.1 is released:
- Replace missings by a non existing value
- Remove Duplicates
- Declare the value used above as Missing.
Greetings,
Sebastian
I was thinking about solving it in a way that first I read a file without setting the value types (as most of my missings are marked as NULL in the original excel file), remove the duplicates, then save the matrix, and load that, this time setting all the attribute types as needed.
Anyway, I'm looking forward to the next release, particularly because:
as I'm trying to load an excel file, I use the wizard to mark first rows as names, the names are shown in the preview window (which sometimes seems to freeze though I can press the Finish button) but I end up with attribute_0, attribute_1, and so on.
Best,
SX
you mean if you apply the replace missing values on the date attribute nothing happens? Might it be the case that the attribute is special but you didn't check "include specials"?
Well, unless you send me a small and executable sample process, I can't say much about this.
Greetings,
Sebastian