Taking the best probability from several experiments

Lewisham · May 2009

Hi everyone,
I'm new to data mining and RapidMiner, and I'm having some difficulty figuring out how to set up an experiment my researcher friend has told me about.

I need to classify records, for which I'm using Nearest Neighbor with nominal values. There are seven possible labels for each record, and each of the 19 attributes in the record is a nominal value. I've been told the data is far too noisy to classify into seven distinct sets in one go, so what I should do is try and classify by running a binary split over each label: "Is Label 1, Is Not Label 1", "Is Label 2, Is Not Label 2"... and then using the result with the highest confidence as being the actual label.

eg. if Is Label 1 has a confidence of 70%, and Is Label 2 is 90%, I should use Label 2.

I have no idea how to set up this experiment. I don't believe Nearest Neighbor is suitable for this, but I don't know what learner to use. Nor do I know how to setup RapidMiner to run several experiments and choose the best output.

When I asked my friend what I should do, he came back with "I use a very expensive software package with proprietary algorithms, so I'm not sure how you would do it".

Does anyone have any ideas?

Thanks in advance!

IngoRM · May 2009

Hi and welcome to RapidMiner,

what you have is a polynominal (i.e. more than two classes) classification problem. Beside directly using a learning scheme which is capable of working with such a polynominal label, there is also the possibility to divide the k-class classficiation problem into k 2-class classification problems in exactly the way you described here (called 1-vs-all in data mining terminology).

There is a meta learner for this called "Binary2MultiClassLearner" which can be combined with all classification schemes including nearest neighbors. Just wrap the meta learner around your learning scheme like in the following example:


<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="gaussian mixture clusters"/>
        <parameter key="number_examples"	value="200"/>
        <parameter key="number_of_attributes"	value="3"/>
    </operator>
    <operator name="Binary2MultiClassLearner" class="Binary2MultiClassLearner" expanded="yes">
        <operator name="NearestNeighbors" class="NearestNeighbors">
        </operator>
    </operator>
</operator>

The meta model will than decide for the label with the highest confidence.

When I asked my friend what I should do, he came back with "I use a very expensive software package with proprietary algorithms, so I'm not sure how you would do it".

So you now know what you can answer: "I use a as great piece of software called RapidMiner for free - and I can even look inside of it's algorithms and there is nothing proprietary at all. Your call."

All the best,
Ingo

Lewisham · May 2009

Ingo Mierswa wrote:

There is a meta learner for this called "Binary2MultiClassLearner" which can be combined with all classification schemes including nearest neighbors.

This is awesome, thank you!

When the process finishes, I don't receive any usable output, I just get tabs of "Labal 1 vs all other" "Label 2 vs all other" and such, with the text:


KNNClassification
KNNClassification (prediction model for label multiclass_working_label)

When I was using cross-validation, I would get a confusion matrix outputted, which was very helpful. What do I wrap the Binary2MultiClassLearner in in order for it to output results? Is it also relevant to combine this with feature selection and cross validation? I've put it back into my previous data flow with feature selection and cross validation, but I don't know if that makes sense or not!

Thanks ever so much!

IngoRM · May 2009

Hi,

When the process finishes, I don't receive any usable output, I just get tabs of "Labal 1 vs all other" "Label 2 vs all other" and such, with the text:

There is nothing like a usable output for a nearest neighbors model - this is a lazy model which just stores the training data and performs all calculations during application time. Hence the simple text. Just replace the learner by something different like NaiveBayes or LinearRegression or... and you will get some nice output.

Is it also relevant to combine this with feature selection and cross validation? I've put it back into my previous data flow with feature selection and cross validation, but I don't know if that makes sense or not!

Welcome to the world of data mining: there is no definite answer but "Just try it". Usually, for KNN learning, a normalization should be applied before and feature weighting (or at least selection) usually drastically improves the performance. Last but not least you should tune the parameter "k", i.e. the number of used neighbors. Or try a different learning scheme. Or...

Have fun. Cheers,
Ingo

Lewisham · May 2009

Thanks Ingo, you're a star.

Now to play around!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Taking the best probability from several experiments

Answers