Duplicate Data but different value in target

k_vishnu772k_vishnu772 Member Posts: 34 Contributor I
edited December 2018 in Help

Hi All,

 

I am dealing with a small data of 120 rows and 5 features with binary target Valid or Not Valid.I have some duplicate rows where all the input features are same but the target values is different as you can see below (sample data its nor original data).How will the model treat those values ? is it ambiguous data ? i ran the model and it was not able to classify the not valid cases as i have only 32 cases out of 120 as Not Valid and most of them are having the duplicates where it has a valid result also with same inputs ? what should i do ?

 

 

Att1             Att2           Att3                 Target          

F3               G929         P2                  Valid

F3               G929             P2              Not Valid

F2               G929             P3              Not Valid

F2               G929                 P3          Valid

 

 

Regards,

Vishnu

 

Best Answers

  • Knut-RMKnut-RM Administrator, Employee, Member, University Professor Posts: 111 Administrator
    Solution Accepted

    given that you have valid and invalid flags for the same combination of values in the attributes how can you expect the model to learn and consequently identify those?

    The model needs to find patterns in order to make a prediction. If you are not providing a pattern then there is no real result to be expected. You should go through the data and make sure you have one lable with the same combination of data. So you want to use a remove duplicate. Probably you need to sort them first on order to maintain (Valid/invalid) the "right" one from the filtering or you do it manually given your small data set.

     

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    Sorry, this is a community support forum but not an academic research journal!  And I'm an experienced data scientist but not an academic myself--so this type of thinking is actually somewhat mystifying to me.  There is much about current best practice in data science that you would have a hard time finding specific academic references to substantiate.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Actually, in these ambiguous cases, you might be better off removing BOTH of the conflicting input records.  It somewhat depends on the data and the use case, but the consequence of removing only one duplicate and leaving the other in is that you are teaching the model to associate a particular pattern with one particular outcome that is actually ambiguous in real life.  If one outcome is much more important to you than another, this may be sensible (e.g., in fraud detection), but in other types of outcomes, this may lead to undesirable results.  So if you have a large enough sample and your misclassification costs are somewhat symmetrical, I would recommend to omit them all.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • k_vishnu772k_vishnu772 Member Posts: 34 Contributor I

    @Telcontar120@Knut-RM

     

    Hi All,

     

    i just want to confirm one thing regarding the duplicates.if i have 10 record all are duplicates and 9 of them have taget label as pass and 1 as fail.so in this case if i remove the diplicates then i will end up with 2 record with all input features are same but the target is different(one pass and one fail) which is ambiguous . if i don't remove those duplicates i am giving more weight to those 9 records than the last record ? is it correct?

     

     

    Regards,

    Vishnu

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Correct.  And if you remove all the ambiguous records (per my suggestion) then you are not giving weight to either side.  

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • k_vishnu772k_vishnu772 Member Posts: 34 Contributor I

    @Telcontar120 is there any offical page or book where it was mentioned the same information,actually  my mananger asked me to show the proper referene for this explanation.

     

     

    Regards,

    Vishnu

Sign In or Register to comment.