The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

[SOLVED] Polynomial Regression - Wrong results

DanielaDaniela Member Posts: 2 Contributor I
edited November 2018 in Help
Hi there,

I'm new to RapidMiner or even Data Mining. My task right now is to compare different tools for later use with large data sets and predictive analytics.

I started with some very simple examples and checked the results against R. It worked fine for linear regression with only one attribute.

Now, I want RapidMiner to calculate the results for a Polynomial Regression for a data set with two columns, a numeric label (y) and one numeric attribute (x), 300 entries (x is a sequence from 0 to 30 with steps of 0.1)

The result should be
y = 0.15 x^2 - 7.34 x + 106,38
But it is:
 87.714 * x ^ 1.000
- 90.314 * x ^ 1.000
+ 79.563
I must be missing something very obvious, still can't figure it out.  

process attached:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.003">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="6.0.003" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="60" name="Read CSV" width="90" x="112" y="75">
       <parameter key="csv_file" value="decay_RapidMiner.txt"/>
       <parameter key="first_row_as_names" value="false"/>
       <list key="annotations">
         <parameter key="0" value="Name"/>
       </list>
       <parameter key="encoding" value="windows-1252"/>
       <list key="data_set_meta_data_information">
         <parameter key="0" value="x.true.numeric.attribute"/>
         <parameter key="1" value="y.true.numeric.label"/>
       </list>
     </operator>
     <operator activated="true" class="split_data" compatibility="6.0.003" expanded="true" height="94" name="Split Data" width="90" x="313" y="75">
       <enumeration key="partitions">
         <parameter key="ratio" value="0.7"/>
         <parameter key="ratio" value="0.3"/>
       </enumeration>
       <parameter key="sampling_type" value="stratified sampling"/>
     </operator>
     <operator activated="true" class="polynomial_regression" compatibility="6.0.003" expanded="true" height="76" name="Polynomial Regression" width="90" x="514" y="75">
       <parameter key="replication_factor" value="2"/>
     </operator>
     <operator activated="true" class="apply_model" compatibility="6.0.003" expanded="true" height="76" name="Apply Model" width="90" x="648" y="75">
       <list key="application_parameters"/>
     </operator>
     <connect from_op="Read CSV" from_port="output" to_op="Split Data" to_port="example set"/>
     <connect from_op="Split Data" from_port="partition 1" to_op="Polynomial Regression" to_port="training set"/>
     <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
     <connect from_op="Polynomial Regression" from_port="model" to_op="Apply Model" to_port="model"/>
     <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
     <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
     <portSpacing port="sink_result 3" spacing="0"/>
   </process>
 </operator>
</process>

Thank you so much for your help

Regards,
Daniela

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Daniela,

    Birgit told me that you are using the decay.txt from the zip file. That file contains only 30 data points, which by far not enough to derive the formula that you posted. If you plot y vs. x you see that this looks rather linear with a coefficient of -3 (visual estimation :-) ), so RapidMiner does not perform that bad :)
    Does R work better on the same data set?
    Do you really have 300 entries in your data? I only see 30 with a step size of 1.

    Best regards,
    Marius
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    One additional remark: the Polynomial Regression uses a numerical approach. The algorithm of the Linear Regression may be better suited in many cases. Of course it is necessary to manually calculate some interactions and quadratic terms. You can use the Generate Function Set operator for that. Please have a look at the process below for an example. There, the model finds the relation pretty perfect.

    Best regards,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="subprocess" compatibility="6.0.008" expanded="true" height="76" name="Generate Data (2)" width="90" x="45" y="30">
            <process expanded="true">
              <operator activated="true" class="generate_data" compatibility="6.0.008" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
                <parameter key="number_examples" value="300"/>
                <parameter key="number_of_attributes" value="1"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="6.0.008" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
                <parameter key="invert_selection" value="true"/>
                <parameter key="include_special_attributes" value="true"/>
              </operator>
              <operator activated="true" class="generate_id" compatibility="6.0.008" expanded="true" height="76" name="Generate ID" width="90" x="313" y="30"/>
              <operator activated="true" class="generate_attributes" compatibility="6.0.008" expanded="true" height="76" name="Generate Attributes" width="90" x="447" y="30">
                <list key="function_descriptions">
                  <parameter key="x" value="id/10"/>
                  <parameter key="y" value="0.15 * x*x - 7.34 *x + 106.38"/>
                </list>
              </operator>
              <operator activated="true" class="materialize_data" compatibility="6.0.008" expanded="true" height="76" name="Materialize Data (2)" width="90" x="581" y="30"/>
              <connect from_op="Generate Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
              <connect from_op="Generate ID" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_op="Materialize Data (2)" to_port="example set input"/>
              <connect from_op="Materialize Data (2)" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_role" compatibility="6.0.008" expanded="true" height="76" name="Set Role (2)" width="90" x="179" y="30">
            <parameter key="attribute_name" value="y"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_function_set" compatibility="6.0.008" expanded="true" height="76" name="Generate Function Set" width="90" x="313" y="30">
            <parameter key="use_mult" value="true"/>
          </operator>
          <operator activated="true" class="rename_by_constructions" compatibility="6.0.008" expanded="true" height="76" name="Rename by Constructions" width="90" x="447" y="30"/>
          <operator activated="true" class="split_data" compatibility="6.0.008" expanded="true" height="94" name="Split Data" width="90" x="45" y="210">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.7"/>
              <parameter key="ratio" value="0.3"/>
            </enumeration>
            <parameter key="sampling_type" value="stratified sampling"/>
          </operator>
          <operator activated="true" class="linear_regression" compatibility="6.0.008" expanded="true" height="94" name="Linear Regression" width="90" x="179" y="165"/>
          <operator activated="true" class="apply_model" compatibility="6.0.008" expanded="true" height="76" name="Apply Model" width="90" x="313" y="210">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Generate Data (2)" from_port="out 1" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Generate Function Set" to_port="example set input"/>
          <connect from_op="Generate Function Set" from_port="example set output" to_op="Rename by Constructions" to_port="example set input"/>
          <connect from_op="Rename by Constructions" from_port="example set output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Linear Regression" to_port="training set"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Linear Regression" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
          <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="180"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • DanielaDaniela Member Posts: 2 Contributor I
    Hello Marius,

    thank you so much for your help.

    (R does calculate the formula pretty well, with only the 30 data points. But you're right, that's no 300: What I did for Rapidminer was creating a dataset with 300 data points following the calculated formula. So basically what you did with the Subprocess "Generate Data". )

    I understand, the mistake was using the wrong operator? (I should have guessed that from the option "max iterations"... )


    Best regards,
    Daniela
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Daniela,

    well, I wouldn't call it a mistake, but yes, it seems that that was the problem :-)

    Finally, you can also discover the formula with only 30 data points in RapidMiner using the linear regression. You need to disable all integrateds feature selection methods of the Linear Regression, though, otherwise the heuristics remove seemingly colinear features: set the feature selection method to "none" and disable "eliminate colinear features" in the Linear Regression.
    As you can see, with 300 data points the heuristics have enough input to keep all relevant attributes.

    Best regards,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="subprocess" compatibility="6.0.008" expanded="true" height="76" name="Generate Data (2)" width="90" x="45" y="30">
            <process expanded="true">
              <operator activated="true" class="generate_data" compatibility="6.0.008" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
                <parameter key="number_examples" value="30"/>
                <parameter key="number_of_attributes" value="1"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="6.0.008" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
                <parameter key="invert_selection" value="true"/>
                <parameter key="include_special_attributes" value="true"/>
              </operator>
              <operator activated="true" class="generate_id" compatibility="6.0.008" expanded="true" height="76" name="Generate ID" width="90" x="313" y="30"/>
              <operator activated="true" class="generate_attributes" compatibility="6.0.008" expanded="true" height="76" name="Generate Attributes" width="90" x="447" y="30">
                <list key="function_descriptions">
                  <parameter key="x" value="id"/>
                  <parameter key="y" value="0.15 * x*x - 7.34 *x + 106.38"/>
                </list>
              </operator>
              <operator activated="true" class="materialize_data" compatibility="6.0.008" expanded="true" height="76" name="Materialize Data (2)" width="90" x="581" y="30"/>
              <connect from_op="Generate Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
              <connect from_op="Generate ID" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_op="Materialize Data (2)" to_port="example set input"/>
              <connect from_op="Materialize Data (2)" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_role" compatibility="6.0.008" expanded="true" height="76" name="Set Role (2)" width="90" x="179" y="30">
            <parameter key="attribute_name" value="y"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_function_set" compatibility="6.0.008" expanded="true" height="76" name="Generate Function Set" width="90" x="313" y="30">
            <parameter key="use_mult" value="true"/>
          </operator>
          <operator activated="true" class="rename_by_constructions" compatibility="6.0.008" expanded="true" height="76" name="Rename by Constructions" width="90" x="447" y="30"/>
          <operator activated="true" class="split_data" compatibility="6.0.008" expanded="true" height="94" name="Split Data" width="90" x="45" y="210">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.7"/>
              <parameter key="ratio" value="0.3"/>
            </enumeration>
            <parameter key="sampling_type" value="stratified sampling"/>
          </operator>
          <operator activated="true" class="linear_regression" compatibility="6.0.008" expanded="true" height="94" name="Linear Regression" width="90" x="179" y="165">
            <parameter key="feature_selection" value="none"/>
            <parameter key="eliminate_colinear_features" value="false"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="6.0.008" expanded="true" height="76" name="Apply Model" width="90" x="313" y="210">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Generate Data (2)" from_port="out 1" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Generate Function Set" to_port="example set input"/>
          <connect from_op="Generate Function Set" from_port="example set output" to_op="Rename by Constructions" to_port="example set input"/>
          <connect from_op="Rename by Constructions" from_port="example set output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Linear Regression" to_port="training set"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Linear Regression" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
          <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="180"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.