RapidMiner

Confidence values

Confidence values

Hi friends,

I'm using rapidminer to make text classification with svm(libsvm), k-nn and naive bayes algorithms. So, when i get the results of my test data, i'm not sure about how each one calculates the confidence values of each instance on each class. Can anyone help me? I need this information to my article.

Thanks in advance.
8 REPLIES
RM Staff
RM Staff

Re: Confidence values

Hi,

this is different for each of those algorithms:

naive bayes: the confidence is directly the calculated probability delivered by the algorithm (actually, this is one of the rare cases where the confidence IS a real probability)
k-nn: the confidence is the number of the k neighbors with the predicted class divided by k (the single values are weighted by distace in case of weighted predictions)
svm (I am not so sure about the LibSVM which brings another calculation in the multiclass case): for binomial classes, a good estimation of the probability for the positive class which is also used by RapidMiner is 1 / (1 + exp(-function_value))) where function_value is the SVM prediction

Hope that helps,
Ingo

How to load processes in XML from the forum into RapidMiner: Read this!

Re: Confidence values

Thank you very much Ingo!!!

Re: Confidence values

Just one thing....what´s the concept of confidence on text classification?


Thanks
RM Staff
RM Staff

Re: Confidence values

Hi,


what´s the concept of confidence on text classification?


well, pretty much the same as for all other kinds of classification tasks. The confidence describes how certain a prediction is. Although similar to a probability of a prediction of a specific class, it is most often not the same (with exception of some learners like Naive Bayes).

The same applies for text classification, the confidence of a class value states how certain the model is that a document belongs to this class.

Cheers,
Ingo

How to load processes in XML from the forum into RapidMiner: Read this!

Re: Confidence values

Hi, Thank you very mucho for your help!
I need to clarify some aspects of my project:

I'm using three different methods to classify approximately 3000 documents in 11 categories. The methods are: k-NN, Naive Bayes and SVM (libsvm linear Kernel C-SVC). After submitting the documents for each of the testing methods generates an output value with a confidence (0-1) of the document for each category and the category chosen is having the biggest confidence.
What i´m doing is to sum the confidences of the document on each category on each 3 models and choose the label with the highest value, i guess this is called bagging, right?. Well, the fact is: my accuracy was improved about 2%. I´m yet not sure about how this confidence values are generated and normalized by Rapidminer on each model to support my conclusions. Do I have to normalize the values of each method to work together or i can consider them normalized and my result makes sense?

Many thanks in advance!
Newbie jing_ma
Newbie

Re: Confidence values

Ingo, is there any documentation available for helping understand each algorithm's definition of confidence? Thanks!

Jing

RM Staff
RM Staff

Re: Confidence values

Dear Jing,

 

first of all: welcome to the community. There is no documentation on how our 250+ learners are calculating confidence. Most of the things are either readable in text books or in our code. Is there any operator in specific where we can help you?

 

~Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Newbie BenLie
Newbie

Re: Confidence values

Here just look at the sampel

 Copy from Help: 

Note that in the testing set, the attributes of the first example are Outlook = sunny and Wind = false. Naive Bayes does calculation for all possible label values and selects the label value that has maximum calculated probability.


Calculation for label = yes


Find product of following:

Posterior probability of label = yes (i.e. 9/14)
value from distribution table when Outlook = sunny and label = yes (i.e. 0.223)
value from distribution table when Wind = false and label = yes (i.e. 0.659)
Thus the answer = 9/14*0.223*0.659 = 0.094

Calculation for label = no


Find product of following:

posterior probability of label = no (i.e. 5/14)
value from distribution table when Outlook = sunny and label = no (i.e. 0.581)
value from distribution table when Wind = false and label = no (i.e. 0.397)
Thus the answer = 5/14*0.581*0.397= 0.082

As the value for label = yes is the maximum of all possible label values, label is predicted to be yes.

And this ist how the confidence is calculated:

 

conf(yes) = 0.094/(0.094+0.082) = 0.534

conf(no) = 0.082/(0.094+0.082) = 0,465

 

Without round-off error you get:

Bayes.PNG