Options

# "Which Learning Algorithm to use for probability estimation?"

Member Posts: 60 Contributor II
edited May 2019 in Help
I have several (around 30) attributes that I want to feed into a learning algorithm.  The attributes are all numeric.  The result that I am after is a probability about whether one event will or will not happen (I'm only trying to predict the probability of one event, not multiple events / classification).  The probability of event has a non-linear dependence on the attributes.  What I mean by this, sometimes a 70% chance of event occurring can be given based upon the conditions of several attributes when taken as a whole.  Sometimes, a 70% chance of event occurring can be inferred based on condition of one attribute in particular.  The example space is huge so a fast algorithm would be preferred.  Can anyone make some recommendations on which learning algorithm to use?  If it's not part of RM, but has an open-source Java library, I'd still consider it.

EDIT/UPDATE: One example of what I am looking for is more commonly known as a probabilistic neural network.  Link: http://www.statsoft.com/textbook/neural-networks/. ; The disadvantage of such a network, however, is that the model stores the training data.  Anyone know of a learning algorithm which outputs probability for each class (in my case, only one...maybe 3 eventually) that does not require storing all training examples?
Tagged:

• Options
RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
Hi,
you can use Naive Bayes if you want to have a straight forward probability calculation.

But I wonder why you have the constraint that the result must be the result of a probability calculation?

Greetings,
Sebastian
• Options
Member Posts: 347 Maven
Hello,

I recommend Logistic Regression since you only have numeric predictors and a binary response variable. It is indeed slower than NaiveBayes, but the output is a generally better approximation to the probability you seek to calculate. NaiveBayes probabilities are not that well calibrated and tend to clump in regions near 0 and 1.
Regarding general model quality (AUC etc.), logistic regression and naive bayes perform both well.

greetings,

steffen