non-binomial target label column in decision tree to measure accuracy

koknayayakoknayaya Member Posts: 20 Contributor I
edited June 14 in Help
how i want to measure the accuracy of my model if my target label column is not a binomial attributes? it is not in (yes/no) type. but it consists of crime types such as burglary, robbery, fraud, assault etc.

Best Answers

  • IngoRMIngoRM Posts: 1,642  RM Founder
    Solution Accepted
    This cannot be said in general and depends on your business problem and how you solve things today.  Think about predicting the outcome of a coin flip.  If you make random guesses, your accuracy would be 50%.  If by using machine learning you can predict the outcome with 51% accuracy, this would be sufficient.  Why?  It does not sound like a good model with only 51% accuracy?  Wrong!  Because you can now start betting against people without this model (who only have 50% accuracy) and will become rich over time :smile:
    It looks like you have multiple classes, let's say 5 for the sake of the argument.  If the classes would be equally distributed, a random guess would lead to 20% accuracy or 80% error rate.  Getting 62% accuracy (or 38% error rate) might be a fantastic result already - you just have been cutting your error rate down by 50%!  Or not.  Again, without understanding the business problem you want to solve this is impossible to say.
    If, however, you have your 5 classes and one of the classes is the correct class in 62% of all the cases, then a model with 62% is not very impressive in any case since always predicting that class (and never anything else) would lead do 62% accuracy already.
    You see there is no easy answer to this and only your or the owner of the business problem can decide if that is good or not.  But comparing the value to the distribution of the class is at least a first step to determine if the model learned anything at all or not.
    Hope this helps,
  • IngoRMIngoRM Posts: 1,642  RM Founder
    Solution Accepted
    Again, nobody will be able to tell you if this is good or not.  Only you can decide (or whoever the business owner is).
    A couple of observations though: the classes RAPE and KIDNAPPING are (fortunately!) very rare events.  There is only one "kidnapping" case in the whole test set and I would assume it is extremely rare in the training set as well.  It is very unlikely that any model will ever be able to pick up this pattern if it is that rare.  I would consider removing the class altogether.
    Although the class "rape" is more frequent, the problem here is similar and you again may decide to remove the class from the predictions altogether.  If you do that, you would end up with only four classes ROBBERY, VEHICLE (something), BURGLARY, and DANGEROUS DRUGS.  There is less chance that models are confused if the tiny classes are removed although it will likely not move the needle a lot.  Anyway, every little bit may help.
    Now I would try a couple of different model types (starting with Auto Model first) and see where this gets you.  You can then try to improve the performance of the best model(s) further with additional parameter optimization, feature engineering, or ensemble learning by opening the processes generated by Auto Model as a starting point for those optimizations.
    Finally, out of the roughly 9,000 examples in your test set above, about 4,000 have the true class DANGEROUS DRUGS.  So always predicting this class is already delivering roughly 44% accuracy.  A model with 62% is already much better than that obviously, but, again, if it is good enough depends on the underlying problem and its owners and is not a data science question per se.  Also keep in mind that some prediction errors may be more costly than others.  So accuracy is not the only thing which may be of importance here.
    Welcome to the world of data science - this is where the fun begins now :smiley: 


  • koknayayakoknayaya Member Posts: 20 Contributor I
    Thank you for answering my question! One more thing, is it acceptable if accuracy of my prediction model is 62%? 
  • koknayayakoknayaya Member Posts: 20 Contributor I
    Thank you for your great answer! However this is my prediction :/ The high percentage only on burglary and dangerous drugs. I want to predict the crime based on place
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,213   Unicorn
    One problem is that these classes are not well balanced.  For instance, the kidnapping category is almost completely useless, with only 1 example it is very unlikely to be picked up by any alogorithm.  You might want to consider weighting by class to help the learner.  Another option worth exploring would be combining and consolidating down to fewer categories to start, such as robbery, burglary, drugs, and all other. 
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • koknayayakoknayaya Member Posts: 20 Contributor I
    Thank you for the answer! Im so excited to learn more!! 
  • koknayayakoknayaya Member Posts: 20 Contributor I
    @IngoRM @lionelderkrikor ; @Telcontar120 Thank you so much for the amazing answers! It really helps!  o:)  <3
Sign In or Register to comment.