RapidMiner 9.8 Beta is now available

Be one of the first to get your hands on the new features. More details and downloads here:

GET RAPIDMINER 9.8 BETA

different values for regressionPerformance for the same data

Legacy UserLegacy User Member Posts: 0 Newbie
edited November 2018 in Help
Hallo,

I have the problem, that I get different values for regressionPerformance  for the attribute.
I have used the model1 (with featureselection) and model 2 (without featureselection - but only with attributefilter  

Attribut
Model1: att3 root_mean_sqared_error 0.334 squared_correlaton 10.651
Model2: att3 root_mean_sqared_error 0.326 squared_correlaton 11.189
???

The same attribute (e.g. att3)  has different value for  regressionPerformance in both models. Can anyone tell me why?

Model 1

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator" breakpoints="after">
        <parameter key="target_function" value="sum"/>
    </operator>
    <operator name="FS" class="FeatureSelection" expanded="yes">
        <parameter key="user_result_individual_selection" value="true"/>
        <parameter key="keep_best" value="64"/>
        <parameter key="maximum_number_of_generations" value="1"/>
        <operator name="BootstrappingValidation" class="BootstrappingValidation" expanded="yes">
            <parameter key="keep_example_set" value="true"/>
            <parameter key="create_complete_model" value="true"/>
            <operator name="LinearRegression" class="LinearRegression">
                <parameter key="feature_selection" value="none"/>
            </operator>
            <operator name="ApplierChain" class="OperatorChain" expanded="yes">
                <operator name="Applier" class="ModelApplier">
                    <parameter key="keep_model" value="true"/>
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="RegressionPerformance" class="RegressionPerformance">
                    <parameter key="main_criterion" value="squared_correlation"/>
                    <parameter key="root_mean_squared_error" value="true"/>
                    <parameter key="squared_correlation" value="true"/>
                </operator>
            </operator>
        </operator>
    </operator>
</operator>


Model 2

<operator name="Root" class="Process" expanded="yes">
   <operator name="Daten laden  und vorbereiten" class="OperatorChain" expanded="yes">
       <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
           <parameter key="target_function" value="sum"/>
       </operator>
   </operator>
   <operator name="Attribute identifizieren, Ranking, Correalation" class="OperatorChain" expanded="yes">
       <operator name="AttributeFilter" class="AttributeFilter">
           <parameter key="condition_class" value="attribute_name_filter"/>
           <parameter key="parameter_string" value="att3"/>
       </operator>
       <operator name="BootstrappingValidation" class="BootstrappingValidation" expanded="yes">
           <parameter key="keep_example_set" value="true"/>
           <parameter key="create_complete_model" value="true"/>
           <operator name="LinearRegression" class="LinearRegression">
               <parameter key="feature_selection" value="none"/>
           </operator>
           <operator name="OperatorChain" class="OperatorChain" expanded="yes">
               <operator name="ModelApplier (2)" class="ModelApplier">
                   <list key="application_parameters">
                   </list>
               </operator>
               <operator name="RegressionPerformance" class="RegressionPerformance">
                   <parameter key="root_mean_squared_error" value="true"/>
                   <parameter key="absolute_error" value="true"/>
                   <parameter key="relative_error" value="true"/>
                   <parameter key="correlation" value="true"/>
                   <parameter key="squared_correlation" value="true"/>
                   <parameter key="skip_undefined_labels" value="false"/>
                   <parameter key="use_example_weights" value="false"/>
               </operator>
           </operator>
       </operator>
   </operator>
   <operator name="ModelApplier" class="ModelApplier">
       <parameter key="keep_model" value="true"/>
       <list key="application_parameters">
       </list>
   </operator>
</operator>

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531   Unicorn
    Hi,
    the quick answer is: Because you have two different processes. Even another usage order of random numbers can affect performance. You could use local_random_seeds to avoid this.

    Greetings,
      Sebastian
  • Legacy UserLegacy User Member Posts: 0 Newbie
    dear

    I don't no what you are meaning with local_random_seeds.
    I have only integrate in model 1 the featureselection. I think, that is a posibility to test alle attributes itself an with combination to find out the best fit
    with a linear model. But this is not a random process itself.
    I will find out, what are the best attributes for prediction the label. And for this I gues the performance criteria - like the squared-corellation and the root-mean-squared-error.

    best regards

    Angela
  • haddockhaddock Member Posts: 849  Guru
    I don't no what you are meaning with local_random_seeds.
    Have you really not thought of searching this forum, say on "local_random"?
  • Legacy UserLegacy User Member Posts: 0 Newbie
    haddock wrote:

    Have you really not thought of searching this forum, say on "local_random"?
    I now, what local_random_seeds is !
    That is also a feature of RapidMiner, which makes it so special.
    But please read my entire question  ::)

    Even a random process should not alter the quality (parameter of the regressionperformance) of each value.
    I therefore assume that I can not compare parameter of the regressionsperformance for specific attributes in 2 different modells.

    best regards
  • haddockhaddock Member Posts: 849  Guru
    But please read my entire question
    I have, and Seb has answered it, and....?

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531   Unicorn
    Hi Angela,
    of course a random sampling of examples affects the measured quality. And a random sampling is done by the BootstrappingValidations. Without the same random number sequence, it is not guaranteed that the same examples are selected. For example if one example which can be perfectly matched is not selected, but a outlier is selected twice, this will affect the performance heavily.
    I would recommend using local random seed on your bootstrappingValidations, this should do the trick.

    Greetings,
      Sebastian
  • AngelaAngela Member Posts: 4 Contributor I
    Hi Sebastian,

    many thanks for this answer. I have change the local_random_seed from: -1 to other values 1, 10,100 but I get the same values for
    squared_correlation for the attributes.

    But I found a other way to get the correct squared_correlation from the imfortance values.
    Manys thanks for your help.

    Angela
Sign In or Register to comment.