let's say we have a dataset with >2 label values - let it be 3 for the sake of simplicity. label values are unevenly distributed. my question is: what's the best accuracy a random classifier can have on such dataset?
0
Answers
IngoRMAdministrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts: 1,751 RM Founder
Hi,
let's first define what is meant by "random classifier":
Option A: The classifier randomly selects a prediction from the possible label values for each prediction. This prediction might follow a specific distribution or not, for example the prediction could be chosen according to the label distribution of the training data.
Option B: The classifier simply alway predicts the major class. This is called "Default Learner" in RapidMiner but I also have heard that people call this random classifier in the past.
For the best accuracy which can be reached I would say:
Option A: 100%. By chance, the classifier can predict all cases correctly. Of course this is less likely as the number of examples grows.
Option B: number of examples in major class / total number of examples.
Although the best reachable accuracy will stay 100% for option A, it is more likely that you would end up with the major class fraction for larger numbers of test examples.
Thanks for posting this intuitive question and giving me a chance to clarify my understanding about random classifiers. May I know if the random classifier also tells us anything about the worst performance one can achieve in an 'n' class problem. Suppose n=2 for the sake of simplicity,and the data is equibalanced, then does a random classifier's performance tells us that the performance of any other classifier on this data cant be less than 50%. If not how is it used to assess the quality of any classifier in case of balanced and unbalanced data both? I hope the question is clear enough to respond,if not kindly let me know. Thanks!
Answers
let's first define what is meant by "random classifier":
Option A: The classifier randomly selects a prediction from the possible label values for each prediction. This prediction might follow a specific distribution or not, for example the prediction could be chosen according to the label distribution of the training data.
Option B: The classifier simply alway predicts the major class. This is called "Default Learner" in RapidMiner but I also have heard that people call this random classifier in the past.
For the best accuracy which can be reached I would say:
Option A: 100%. By chance, the classifier can predict all cases correctly. Of course this is less likely as the number of examples grows.
Option B: number of examples in major class / total number of examples.
Although the best reachable accuracy will stay 100% for option A, it is more likely that you would end up with the major class fraction for larger numbers of test examples.
Cheers,
Ingo