Options

# Data distribution

Member Posts: 0 Newbie
edited November 2018 in Help
Hi,

I've a general question about data mining.

It is well known that to find a suitable learning algorithm, the  distribution
of data must be known in advance. How is this done in practice? Let's
say I've a dataset consisting of numerical and nominal features and
binary labels, how can I determine its distribution? Can RapidMiner help me
here? :-)

Otherwise, if it is not possible to determine the distribution, how do I find a
good learning algorithms for my data that minimizes the classification error?
By trail-and-error?

Regards,
Tim

• Options
RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
Hi Tim,
the problem with reald world data is: You don't know the underlying distribution. If you would, you wouldn't need to apply any learning algorithm at all.
The task of such an learning algorithm is always to try to model this distribution. Naive Bayes directly tries it by building independent normal distributions per attribute. A decision tree learner does it by constructing subspaces with ortogonal cuts and giving every subspace one uniform distribution. And so on...

So your task on real data is to find the learning algorithm approximating the real distribution best. This could be done by trial and error, but each learner has its own assumptions. This assumptions are often related and might guid the search for the correct algorithm. For example Linear Regression and SVMs with linear kernel are both linear models. Rule Learner and Decision Trees both use ortogonal cuts...
But you must have gained deep insight into the statistical methods behind the learners to have this knowledge. Trial and Error might be more handy

And there are many methods within rapid miner to do the trails of trial and error automatically. XValidation allows you to estimate the success of the modeling of the underlying distribution. With the OperatorSelector and a ParameterIterator several Learning Algorithms might be applied on the same dataset to compare their performance. The ParameterOptimizations are a tool to find the best parameters for the learners.