Options

Difference between result of Rapid miner and Excel removing duplicates function

sshabanosshabano Member Posts: 2 Contributor I
edited December 2018 in Help

Good day.
I am new in using RM. 
I need to remove duplicates from my dataset within preprocessing step. 
SO,

I have 7621 examples as original set. 

I used "remove duplicates' function of excel and got 6830 rows ( examples) as a result.

Since, I` m runing the project in RM , I need to clean my data via its operator. Thus, I used "Remove Duplicates operator" , I have choosen "Project name" attribute and run process. As an outcome I got 6854 examples. 
My question is why do  I have difference between the resulting examples ( 6854 via RM & 6830 via Excel). 
I attached my process to this message and asking support for dealing with this problem, please. 

Thank you in advance. 

Answers

  • Options
    earmijoearmijo Member Posts: 270 Unicorn

    Without you providing the dataset, there is no way of knowing for sure. Experiment checking/unchecking the option "Include Special Attributes". 

  • Options
    sshabanosshabano Member Posts: 2 Contributor I

    I attached xml file. 

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    The xml file doesn't allow us to actually see the data though, just that you retrieve it as part of your process...

    Another thing to test is whether you have any leading or trailing spaces as part of the attribute you are trying to dedupe if it is polynominal.  You can use the "Trim" operator in RapidMiner first to make sure it doesn't.  

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.