RapidMiner

RapidMiner

weighted nearest neighbor + crossvalidation

weighted nearest neighbor + crossvalidation

Hi, I cannot figure out how to integrate learning feature weights in a nearest
neighbor algorithm using 10 fold cross-validation.
Nearest Neighbor and cross validation alone is no problem. But the
usage of weights complicate this a lot. The weights should be learned on the
training data and using the cross-validation operator applied on the evaluation data.
Is this possible to do this with the GUI or do I have to write the cross validation
myself without employing the cross validation operator?

Thank you for any help.
1 REPLY
RMStaff

Re: weighted nearest neighbor + crossvalidation

Hi Ulli,

no coding is necessary for this (actually, problems like these were the reason for the modular operator concept or RapidMiner). This can actually be done with nested cross validations, i.e. an outer cross valivation where the learner is embedded into a feature weighting scheme like EvolutionaryWeighting containing an inner cross validation for optimizing the weights. However, it is even more comfortable to use the operator "WrapperXValidation" as outer cross validation for this task. From the operator info dialog (F1) of this operator:


This operator evaluates the performance of feature weighting and selection algorithms. The first inner operator is the algorithm to be evaluated itself. It must return an attribute weights vector which is applied on the test data. This fold is used to create a new model using the second inner operator and retrieve a performance vector using the third inner operator. This performance vector serves as a performance indicator for the actual algorithm. This implementation of a MethodValidationChain works similar to the XValidation.


And here are the inner conditions (also from the operator info dialog:



  • Operator 1 (Wrapper) must be able to handle [ExampleSet] and must deliver [AttributeWeights].

  • Operator 2 (Training) must be able to handle [ExampleSet] and must deliver [Model].

  • Operator 3 (Testing) must be able to handle [ExampleSet, Model] and must deliver [PerformanceVector].




So this is how could setup a process for Nearest Neighbors together with evolutionary attribute weighting:


<operator name="Root" class="Process" expanded="yes">
    <operator name="DataGeneration" class="OperatorChain" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="number_examples" value="200"/>
            <parameter key="number_of_attributes" value="3"/>
            <parameter key="target_function" value="sum classification"/>
        </operator>
        <operator name="NoiseGenerator" class="NoiseGenerator">
            <parameter key="label_noise" value="0.0"/>
            <list key="noise">
            </list>
            <parameter key="random_attributes" value="3"/>
        </operator>
        <operator name="Normalization" class="Normalization">
            <parameter key="z_transform" value="false"/>
        </operator>
    </operator>
    <operator name="WrapperXValidation" class="WrapperXValidation" expanded="yes">
        <parameter key="number_of_validations" value="5"/>
        <operator name="EvolutionaryWeighting" class="EvolutionaryWeighting" expanded="yes">
            <parameter key="maximum_number_of_generations" value="20"/>
            <parameter key="p_crossover" value="0.5"/>
            <parameter key="population_size" value="2"/>
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="number_of_validations" value="5"/>
                <operator name="WeightLearner" class="NearestNeighbors">
                    <parameter key="k" value="5"/>
                </operator>
                <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="Performance" class="Performance">
                    </operator>
                </operator>
            </operator>
        </operator>
        <operator name="WeightedModelLearner" class="NearestNeighbors">
            <parameter key="k" value="5"/>
        </operator>
        <operator name="WeightedApplierChain" class="OperatorChain" expanded="yes">
            <operator name="WeightedModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="WeightedPerformance" class="Performance">
            </operator>
        </operator>
    </operator>
</operator>



The process will run several minutes. After the process has finished, the performance is delivered together with an averaged weight vector from all runs. This vector for example could be saved and applied on new data sets for application. In the example above, the found weights should be something like


att1          0.9181578856281039
att3          0.8079093341177875
att2          0.5669022824248217
random1      0.4395652799419607
random2      0.25727249709958755
random        0.047672333763268744


Cheers,
Ingo