How to do Y-randomization in Rapidminer?

pengiepengie Member Posts: 21 Maven
edited November 2018 in Help
Hi,

I was wondering how do I do Y-randomization in Rapidminer? In Y-randomization, the y value of an example is randomly exchanged with the y value of another example. This is used in validation of QSAR models, whereby the performance of the original model (r2) is compared to that of models built for permuted (randomly shuffled) response.

Regards

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    although there is no operator for Y-Randomization in RapidMiner yet, we can make use of its modularity. I have created a process, doing Y-randomization. You could encapsulate it within an OperatorChain to use it within your process.
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="one third classification"/>
        </operator>
        <operator name="IdTagging" class="IdTagging">
        </operator>
        <operator name="IOMultiplier" class="IOMultiplier">
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
        <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
            <parameter key="attribute_name_regex" value="label|id"/>
            <parameter key="condition_class" value="attribute_name_filter"/>
            <parameter key="keep_subset_only" value="true"/>
            <operator name="NoiseGenerator" class="NoiseGenerator">
                <parameter key="label_noise" value="0.0"/>
                <list key="noise">
                </list>
                <parameter key="random_attributes" value="1"/>
            </operator>
            <operator name="Sorting" class="Sorting">
                <parameter key="attribute_name" value="random"/>
            </operator>
            <operator name="IdTagging (2)" class="IdTagging">
            </operator>
        </operator>
        <operator name="IOSelector" class="IOSelector">
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="select_which" value="2"/>
        </operator>
        <operator name="ExampleSetJoin" class="ExampleSetJoin">
        </operator>
        <operator name="AttributeFilter (2)" class="AttributeFilter">
            <parameter key="condition_class" value="attribute_name_filter"/>
            <parameter key="invert_filter" value="true"/>
            <parameter key="parameter_string" value="random"/>
        </operator>
    </operator>
    Hope that helps.


    Greetings,
      Sebastian
  • pengiepengie Member Posts: 21 Maven
    Hi,

    thank you for your help. The code worked perfectly. I am now trying to use Rapidminer to do y-randomization, train a model, evaluate the model using leave-one-out and repeat this 100 times to get an average classification error for the y-randomization. I am using the following code

    <operator name="Root" class="Process" expanded="yes">
        <parameter key="random_seed" value="-1"/>
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="one third classification"/>
        </operator>
        <operator name="RepeatUntilOperatorChain" class="RepeatUntilOperatorChain" expanded="yes">
            <parameter key="max_iterations" value="100"/>
            <operator name="IdTagging" class="IdTagging">
            </operator>
            <operator name="IOMultiplier" class="IOMultiplier">
                <parameter key="io_object" value="ExampleSet"/>
            </operator>
            <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="no">
                <parameter key="attribute_name_regex" value="label|id"/>
                <parameter key="condition_class" value="attribute_name_filter"/>
                <parameter key="keep_subset_only" value="true"/>
                <operator name="NoiseGenerator" class="NoiseGenerator">
                    <parameter key="label_noise" value="0.0"/>
                    <list key="noise">
                    </list>
                    <parameter key="random_attributes" value="1"/>
                </operator>
                <operator name="Sorting" class="Sorting">
                    <parameter key="attribute_name" value="random"/>
                </operator>
                <operator name="IdTagging (2)" class="IdTagging">
                </operator>
            </operator>
            <operator name="IOSelector" class="IOSelector">
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="select_which" value="2"/>
            </operator>
            <operator name="ExampleSetJoin" class="ExampleSetJoin">
            </operator>
            <operator name="AttributeFilter (2)" class="AttributeFilter">
                <parameter key="condition_class" value="attribute_name_filter"/>
                <parameter key="invert_filter" value="true"/>
                <parameter key="parameter_string" value="random"/>
            </operator>
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="leave_one_out" value="true"/>
                <operator name="NearestNeighbors" class="NearestNeighbors">
                    <parameter key="k" value="3"/>
                </operator>
                <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="ClassificationPerformance" class="ClassificationPerformance">
                        <list key="class_weights">
                        </list>
                        <parameter key="classification_error" value="true"/>
                    </operator>
                </operator>
            </operator>
        </operator>
    </operator>
    However, it seems to give me an error about RepeatUntilOperatorChain.
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi,

    just a hint: why do you not use the [tt]IteratingPerformanceAverage[/tt] operator which also iterates for a predifined number of times and also averages the performance vectors resulting from the inner operator chain?

    Regards,
    Tobias
  • pengiepengie Member Posts: 21 Maven
    Great hint!

    Met another error..."Message: The attribute 'random' does not exist.". Done a bit of tracing. It seems like the AttributeFilter (2) removes the attribute 'random' after the first round but on the second round, the NoiseGenerator generates attribute 'random1' instead of 'random', thus causing the error.

    <operator name="Root" class="Process" expanded="yes">
        <parameter key="random_seed" value="-1"/>
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="one third classification"/>
        </operator>
        <operator name="IteratingPerformanceAverage" class="IteratingPerformanceAverage" expanded="yes">
            <operator name="IdTagging" class="IdTagging">
            </operator>
            <operator name="IOMultiplier" class="IOMultiplier">
                <parameter key="io_object" value="ExampleSet"/>
            </operator>
            <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
                <parameter key="attribute_name_regex" value="label|id"/>
                <parameter key="condition_class" value="attribute_name_filter"/>
                <parameter key="keep_subset_only" value="true"/>
                <operator name="NoiseGenerator" class="NoiseGenerator" breakpoints="after">
                    <parameter key="label_noise" value="0.0"/>
                    <list key="noise">
                    </list>
                    <parameter key="random_attributes" value="1"/>
                </operator>
                <operator name="Sorting" class="Sorting">
                    <parameter key="attribute_name" value="random"/>
                </operator>
                <operator name="IdTagging (2)" class="IdTagging">
                </operator>
            </operator>
            <operator name="IOSelector" class="IOSelector">
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="select_which" value="2"/>
            </operator>
            <operator name="ExampleSetJoin" class="ExampleSetJoin">
            </operator>
            <operator name="AttributeFilter (2)" class="AttributeFilter">
                <parameter key="condition_class" value="attribute_name_filter"/>
                <parameter key="invert_filter" value="true"/>
                <parameter key="parameter_string" value="random"/>
            </operator>
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="leave_one_out" value="true"/>
                <operator name="NearestNeighbors" class="NearestNeighbors">
                    <parameter key="k" value="3"/>
                </operator>
                <operator name="OperatorChain" class="OperatorChain" expanded="no">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="ClassificationPerformance" class="ClassificationPerformance">
                        <list key="class_weights">
                        </list>
                        <parameter key="classification_error" value="true"/>
                    </operator>
                </operator>
            </operator>
        </operator>
    </operator>
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    try to use our Permutation Operator. I forgot it myself in the previous solution. So many Operators... :)
    <operator name="Root" class="Process" expanded="yes">
        <parameter key="random_seed" value="-1"/>
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="one third classification"/>
        </operator>
        <operator name="IteratingPerformanceAverage" class="IteratingPerformanceAverage" expanded="yes">
            <operator name="IdTagging" class="IdTagging">
            </operator>
            <operator name="IOMultiplier" class="IOMultiplier">
                <parameter key="io_object" value="ExampleSet"/>
            </operator>
            <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
                <parameter key="attribute_name_regex" value="label|id"/>
                <parameter key="condition_class" value="attribute_name_filter"/>
                <parameter key="keep_subset_only" value="true"/>
                <operator name="Permutation" class="Permutation">
                </operator>
                <operator name="IdTagging (2)" class="IdTagging">
                </operator>
            </operator>
            <operator name="IOSelector" class="IOSelector">
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="select_which" value="2"/>
            </operator>
            <operator name="ExampleSetJoin" class="ExampleSetJoin">
            </operator>
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="leave_one_out" value="true"/>
                <operator name="NearestNeighbors" class="NearestNeighbors">
                    <parameter key="k" value="3"/>
                </operator>
                <operator name="OperatorChain" class="OperatorChain" expanded="no">
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="ClassificationPerformance" class="ClassificationPerformance">
                        <list key="class_weights">
                        </list>
                        <parameter key="classification_error" value="true"/>
                    </operator>
                </operator>
            </operator>
        </operator>
    </operator>

    This should help.

    Greetings,
      Sebastian
  • pengiepengie Member Posts: 21 Maven
    Thank you so much. It worked perfectly.  ;D

    Just one last question, when I do a breakpoint in ExampleSetJoin, I noticed that the id number of the dataset keeps increasing. Why is that so and will it have any impact on the memory?
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    no this won't increase the memory consumption. Memory of ExampleSets will be freed, if no ExampleSet exists adressing this memory. Keep in mind, that it have not be freed immediately. Java will free its memory when it thinks thats appropriate or needs it.

    Greetings,
      Sebastian
Sign In or Register to comment.