RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

Is it possible to get 100% for split validation accuracy ?

Joannach0ngJoannach0ng Member Posts: 7 Learner I
edited July 2019 in Help
Is it possible to get 100% for split validation accuracy and what are the pros of getting 100% accuracy ?Thank you 
Tagged:

Answers

  • jmerglerjmergler Administrator, Moderator, Employee, RapidMiner Certified Analyst, Member, University Professor Posts: 19  Maven
    Hi @Joannach0ng,
    In my opinion, most of the time this would be alarming. For some problems it may be possible, and for most real business problems not. A point of reference that might be helpful is to ask, 'If a team of experts were to look closely at the data, how good would they be at making their predictions?' That can sometimes give you an idea for what a good accuracy might be. For some simple problems it may be near or at 100%, for many problems in business it won't be anywhere close. 

    If you have 100% accuracy, I would check for attributes that are too closely correlated with the outcome; they may contain information that wouldn't be available until after the outcome is observed. There's some more information about correct validation in this course: https://academy.rapidminer.com/learn/course/applications-use-cases-professional/

    I'd recommend taking a little time to go through the course. Also, if you have come up with 100% accuracy, are you able to share more about the use-case and data, or the process you are using? We might be able to provide better help.
    varunm1Tghadially
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 505   Unicorn
    If you come across this problem, check if you included any ID’s in your data source. This happens especially when you are using Decision Trees (or another tree-based algorithm): the tree tries to overfit and the best way to identify a row becomes the ID, so your algorithm isn’t useful, because every single row will have an unseen ID in production.

    my 2 cents.
    varunm1Tghadially
  • Joannach0ngJoannach0ng Member Posts: 7 Learner I
    @jmergler Hi thank you for you reply !Actually I was told by my tutor to have a 100% accuracy prediction ,so I was wondering if it is possible as I have tried from 0-1 but could get to 100% ,can adding some operator do so ?Thanks!
  • Joannach0ngJoannach0ng Member Posts: 7 Learner I
    @rfuentealba Hi thank you for you reply !Actually I was told by my tutor to have a 100% accuracy prediction ,so I was wondering if it is possible as I have tried from 0-1 but could I  get to 100% accuracy  by adding some operator do so ?Thanks!
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 286   Unicorn
    Hi @Joannach0ng

    I am taking a risk of being accused by others for teaching you bad things :smiley: but technically you can achieve it this way, if you train and test model on exactly same dataset:


    But still, take other commenters concerns into account, because this thing:
    • Makes no sense for and real life / machine learning problems.
    • Is a serious mistake from data science point of view.
    Are you sure this is exactly the thing you are asked bu the tutor?? If yes, I suggest to study the problem in question and convince your tutor this is a totally wrong thing.
    Tghadiallyvarunm1
  • varunm1varunm1 Moderator, Member Posts: 1,185   Unicorn
    @kypexin Your solution perfectly fits tutor requirements :wink:
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

    kypexinTghadially
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,368   Unicorn
    I want to echo the many cautions here--in real life, 100% accuracy on any test dataset is almost always an indicator that there is some performance leakage occurring---an id, or a surrogate for the label that would not really be available at the time of the prediction.  It should be viewed very skeptically, not as a realistic goal.

    One possible exception might be if you have a small number of examples in the test dataset but a large number of attributes in the model, in which case your model can be "over-specified" (basically too many attributes will lead to some unique combination serving as a kind of id to make the predictions).  Or if you just have too few examples in the test set altogether (e.g., imagine the reductio of 1 test case, which would then either be 100% accurate or 0%!) this can also happen by random chance.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    Tghadially
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 505   Unicorn
    Now that you mention, I had a requirement once, years ago. I didn't even exist here. If you are familiar with logic gates, you know how they work. Else, there is an explanation here.

    The thing is that I had a dataset with some 12 attributes working like this (for the sake of reducing complexity, I'm going to explain with an OR logic gate):

    a1 a2 ax

    The idea was to actually build a program that could act like that because the program was compiled in C, there was no source and the logic controller it was compiled on needed a replacement. I ended up training a decision tree because I had no clue on what the order of the logic gates could be, and the logic controller ended up being an old computer.

    Not the most elegant solution but hell of a win for data science.

    All the best,

    Rodrigo.
    Tghadially
Sign In or Register to comment.