Why does Auto Model only use a portion of the available cases?

ramsey_hiltonramsey_hilton Member Posts: 1 Learner I
edited June 2019 in Help

When I do auto model with a set of data, I am looking for a binary classification, and it shows me the accuracy each classification method got when trying to optimize the input parameters.  The problem is that the numbers in the table don't add up to the actual number of possible cases I entered into the software.  The table below shows 45 cases, but it is for a data set with approximately 224 entries, so there could potentially be a lot more model evaluations in this table to see just how robust it is when taking all of the data into account, but it only seems to have used about 20% of the available cases.  Why is this, and is there anything I can do to change it?

 

(Table not working in this format, but it basically said it predicted 38 cases for X and got all of them right, and 7 for Y and got all of them right.)

 

Tagged:

Answers

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Hi @ramsey_hilton,

     

    Let's see how to explain this.

     

    Remember when you were a child, and your mom gave you a red apple and a green apple, and you learned that red apples are... red, and green apples are green? Then you saw rounded red apples and squared red apples, and found that these were apples, but different kinds of apples? Now, how many times did you have to ask if this was an apple or not once you learned? The first ones I mentioned are your training set, and the other ones you saw once you learned how does an apple look like are your testing set. Once you have your algorithm trained with a % of your data, it should be able to recognize the testing set with some precision.

     

    There are many concepts here:

     

    • Unless you are using machine learning creatively, you should never train your algorithms with all the data you have: always leave a control group to see how accurate are your predictions.
    • If you put a small amount of data on your algorithm, it might go "underfitting", meaning that it falls short in trying to predict new behaviours. If on the other hand you put too much data on your algorithm, it might go "overfitting", meaning that it won't accurately answer when something is slightly different.

    Validating and optimizing your algorithms is a skill you need. AutoModel does that for you, and it optimizes the algorithm for you to have the best possible accuracy, calculated in the form of a confusion matrix, like this one:

     

    Screen Shot 2018-10-26 at 15.23.21.png

    With this confusion matrix, you can see how good is your algorithm. In this example, my algorithm applied has a bit more than 80% of accuracy, which might be good or bad, depending on your use case. (e.g. if you are performing data science to predict rain, a 75% is good enough, but if you are using it to identify cancer cells or for preventing fraud, your want to thrive for a 95+% of accuracy).

     

    Hope this helps,

     

Sign In or Register to comment.