🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

Is it possible to get 100% for split validation accuracy ?

Joannach0ngJoannach0ng Member Posts: 7 Learner I
edited July 31 in Help
Is it possible to get 100% for split validation accuracy and what are the pros of getting 100% accuracy ?Thank you 
Tagged:

Answers

  • jmerglerjmergler Administrator, Moderator, Employee, RapidMiner Certified Analyst, Member, University Professor Posts: 14  University Professor
    Hi @Joannach0ng,
    In my opinion, most of the time this would be alarming. For some problems it may be possible, and for most real business problems not. A point of reference that might be helpful is to ask, 'If a team of experts were to look closely at the data, how good would they be at making their predictions?' That can sometimes give you an idea for what a good accuracy might be. For some simple problems it may be near or at 100%, for many problems in business it won't be anywhere close. 

    If you have 100% accuracy, I would check for attributes that are too closely correlated with the outcome; they may contain information that wouldn't be available until after the outcome is observed. There's some more information about correct validation in this course: https://academy.rapidminer.com/learn/course/applications-use-cases-professional/

    I'd recommend taking a little time to go through the course. Also, if you have come up with 100% accuracy, are you able to share more about the use-case and data, or the process you are using? We might be able to provide better help.
    varunm1Tghadially
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 417   Unicorn
    If you come across this problem, check if you included any ID’s in your data source. This happens especially when you are using Decision Trees (or another tree-based algorithm): the tree tries to overfit and the best way to identify a row becomes the ID, so your algorithm isn’t useful, because every single row will have an unseen ID in production.

    my 2 cents.
    varunm1Tghadially
  • Joannach0ngJoannach0ng Member Posts: 7 Learner I
    @jmergler Hi thank you for you reply !Actually I was told by my tutor to have a 100% accuracy prediction ,so I was wondering if it is possible as I have tried from 0-1 but could get to 100% ,can adding some operator do so ?Thanks!
  • Joannach0ngJoannach0ng Member Posts: 7 Learner I
    @rfuentealba ; Hi thank you for you reply !Actually I was told by my tutor to have a 100% accuracy prediction ,so I was wondering if it is possible as I have tried from 0-1 but could I  get to 100% accuracy  by adding some operator do so ?Thanks!
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 280   Unicorn
    Hi @Joannach0ng

    I am taking a risk of being accused by others for teaching you bad things :smiley: but technically you can achieve it this way, if you train and test model on exactly same dataset:


    But still, take other commenters concerns into account, because this thing:
    • Makes no sense for and real life / machine learning problems.
    • Is a serious mistake from data science point of view.
    Are you sure this is exactly the thing you are asked bu the tutor?? If yes, I suggest to study the problem in question and convince your tutor this is a totally wrong thing.
    Tghadiallyvarunm1
  • varunm1varunm1 Moderator, Member Posts: 840   Unicorn
    @kypexin Your solution perfectly fits tutor requirements :wink:
    Regards,
    Varun
    Rapidminer Wisdom 2020 (User Track): Call for proposals 

    https://www.varunmandalapu.com/
    kypexinTghadially
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,256   Unicorn
    I want to echo the many cautions here--in real life, 100% accuracy on any test dataset is almost always an indicator that there is some performance leakage occurring---an id, or a surrogate for the label that would not really be available at the time of the prediction.  It should be viewed very skeptically, not as a realistic goal.

    One possible exception might be if you have a small number of examples in the test dataset but a large number of attributes in the model, in which case your model can be "over-specified" (basically too many attributes will lead to some unique combination serving as a kind of id to make the predictions).  Or if you just have too few examples in the test set altogether (e.g., imagine the reductio of 1 test case, which would then either be 100% accurate or 0%!) this can also happen by random chance.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    Tghadially
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 417   Unicorn
    Now that you mention, I had a requirement once, years ago. I didn't even exist here. If you are familiar with logic gates, you know how they work. Else, there is an explanation here.

    The thing is that I had a dataset with some 12 attributes working like this (for the sake of reducing complexity, I'm going to explain with an OR logic gate):

    a1 a2 ax
    0 0 0
    0 1 1
    1 0 1
    1 1 1

    The idea was to actually build a program that could act like that because the program was compiled in C, there was no source and the logic controller it was compiled on needed a replacement. I ended up training a decision tree because I had no clue on what the order of the logic gates could be, and the logic controller ended up being an old computer.

    Not the most elegant solution but hell of a win for data science.

    All the best,

    Rodrigo.
    Tghadially
Sign In or Register to comment.