turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Home
- :
- Product Help
- :
- RapidMiner Studio Forum
- :
- Confidence values

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

03-19-2012 02:13 PM

03-19-2012 02:13 PM

Hi friends,

I'm using rapidminer to make text classification with svm(libsvm), k-nn and naive bayes algorithms. So, when i get the results of my test data, i'm not sure about how each one calculates the confidence values of each instance on each class. Can anyone help me? I need this information to my article.

Thanks in advance.

I'm using rapidminer to make text classification with svm(libsvm), k-nn and naive bayes algorithms. So, when i get the results of my test data, i'm not sure about how each one calculates the confidence values of each instance on each class. Can anyone help me? I need this information to my article.

Thanks in advance.

8 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

03-19-2012 03:54 PM

03-19-2012 03:54 PM

Hi,

this is different for each of those algorithms:

naive bayes: the confidence is directly the calculated probability delivered by the algorithm (actually, this is one of the rare cases where the confidence IS a real probability)

k-nn: the confidence is the number of the k neighbors with the predicted class divided by k (the single values are weighted by distace in case of weighted predictions)

svm (I am not so sure about the LibSVM which brings another calculation in the multiclass case): for binomial classes, a good estimation of the probability for the positive class which is also used by RapidMiner is 1 / (1 + exp(-function_value))) where function_value is the SVM prediction

Hope that helps,

Ingo

How to load processes in XML from the forum into RapidMiner: Read this!

this is different for each of those algorithms:

naive bayes: the confidence is directly the calculated probability delivered by the algorithm (actually, this is one of the rare cases where the confidence IS a real probability)

k-nn: the confidence is the number of the k neighbors with the predicted class divided by k (the single values are weighted by distace in case of weighted predictions)

svm (I am not so sure about the LibSVM which brings another calculation in the multiclass case): for binomial classes, a good estimation of the probability for the positive class which is also used by RapidMiner is 1 / (1 + exp(-function_value))) where function_value is the SVM prediction

Hope that helps,

Ingo

How to load processes in XML from the forum into RapidMiner: Read this!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

03-19-2012 09:26 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

03-19-2012 11:52 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

03-20-2012 05:22 AM

03-20-2012 05:22 AM

Hi,

well, pretty much the same as for all other kinds of classification tasks. The confidence describes how certain a prediction is. Although similar to a probability of a prediction of a specific class, it is most often not the same (with exception of some learners like Naive Bayes).

The same applies for text classification, the confidence of a class value states how certain the model is that a document belongs to this class.

Cheers,

Ingo

How to load processes in XML from the forum into RapidMiner: Read this!

what´s the concept of confidence on text classification?

well, pretty much the same as for all other kinds of classification tasks. The confidence describes how certain a prediction is. Although similar to a probability of a prediction of a specific class, it is most often not the same (with exception of some learners like Naive Bayes).

The same applies for text classification, the confidence of a class value states how certain the model is that a document belongs to this class.

Cheers,

Ingo

How to load processes in XML from the forum into RapidMiner: Read this!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

04-03-2012 12:48 PM

04-03-2012 12:48 PM

Hi, Thank you very mucho for your help!

I need to clarify some aspects of my project:

I'm using three different methods to classify approximately 3000 documents in 11 categories. The methods are: k-NN, Naive Bayes and SVM (libsvm linear Kernel C-SVC). After submitting the documents for each of the testing methods generates an output value with a confidence (0-1) of the document for each category and the category chosen is having the biggest confidence.

What i´m doing is to sum the confidences of the document on each category on each 3 models and choose the label with the highest value, i guess this is called bagging, right?. Well, the fact is: my accuracy was improved about 2%. I´m yet not sure about how this confidence values are generated and normalized by Rapidminer on each model to support my conclusions. Do I have to normalize the values of each method to work together or i can consider them normalized and my result makes sense?

Many thanks in advance!

I need to clarify some aspects of my project:

I'm using three different methods to classify approximately 3000 documents in 11 categories. The methods are: k-NN, Naive Bayes and SVM (libsvm linear Kernel C-SVC). After submitting the documents for each of the testing methods generates an output value with a confidence (0-1) of the document for each category and the category chosen is having the biggest confidence.

What i´m doing is to sum the confidences of the document on each category on each 3 models and choose the label with the highest value, i guess this is called bagging, right?. Well, the fact is: my accuracy was improved about 2%. I´m yet not sure about how this confidence values are generated and normalized by Rapidminer on each model to support my conclusions. Do I have to normalize the values of each method to work together or i can consider them normalized and my result makes sense?

Many thanks in advance!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

11-14-2016 02:29 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

11-15-2016 04:47 AM

11-15-2016 04:47 AM

Dear Jing,

first of all: welcome to the community. There is no documentation on how our 250+ learners are calculating confidence. Most of the things are either readable in text books or in our code. Is there any operator in specific where we can help you?

~Martin

--------------------------------------------------------------------------

Head of Data Science Services at RapidMiner

Head of Data Science Services at RapidMiner

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

2 weeks ago

2 weeks ago

Here just look at the sampel

Copy from Help:

Note that in the testing set, the attributes of the first example are

Outlook = sunnyandWind = false. Naive Bayes does calculation for all possible label values and selects thelabelvalue that has maximum calculated probability.

Calculation for

label = yes

Find product of following:

Thus the answer = 9/14*0.223*0.659 = 0.094

Posterior probability of label = yes(i.e. 9/14) value from distribution table when Outlook = sunnyandlabel = yes(i.e. 0.223) value from distribution table when Wind = falseandlabel = yes(i.e. 0.659)

Calculation for

label = no

Find product of following:

Thus the answer = 5/14*0.581*0.397= 0.082

posterior probability of label = no(i.e. 5/14) value from distribution table when Outlook = sunnyandlabel = no(i.e. 0.581) value from distribution table when Wind = falseandlabel = no(i.e. 0.397)

As the value for

label = yesis the maximum of all possible label values, label is predicted to beyes.

*And this ist how the confidence is calculated:*

*conf(yes) = 0.094/(0.094+0.082) = 0.534*

*conf(no) = 0.082/(0.094+0.082) = 0,465*

* *

*Without round-off error you get:*