The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
[SOLVED] Polynomial Regression - Wrong results
Hi there,
I'm new to RapidMiner or even Data Mining. My task right now is to compare different tools for later use with large data sets and predictive analytics.
I started with some very simple examples and checked the results against R. It worked fine for linear regression with only one attribute.
Now, I want RapidMiner to calculate the results for a Polynomial Regression for a data set with two columns, a numeric label (y) and one numeric attribute (x), 300 entries (x is a sequence from 0 to 30 with steps of 0.1)
The result should be
process attached:
Regards,
Daniela
I'm new to RapidMiner or even Data Mining. My task right now is to compare different tools for later use with large data sets and predictive analytics.
I started with some very simple examples and checked the results against R. It worked fine for linear regression with only one attribute.
Now, I want RapidMiner to calculate the results for a Polynomial Regression for a data set with two columns, a numeric label (y) and one numeric attribute (x), 300 entries (x is a sequence from 0 to 30 with steps of 0.1)
The result should be
y = 0.15 x^2 - 7.34 x + 106,38But it is:
87.714 * x ^ 1.000I must be missing something very obvious, still can't figure it out.
- 90.314 * x ^ 1.000
+ 79.563
process attached:
Thank you so much for your help
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="60" name="Read CSV" width="90" x="112" y="75">
<parameter key="csv_file" value="decay_RapidMiner.txt"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="x.true.numeric.attribute"/>
<parameter key="1" value="y.true.numeric.label"/>
</list>
</operator>
<operator activated="true" class="split_data" compatibility="6.0.003" expanded="true" height="94" name="Split Data" width="90" x="313" y="75">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
<parameter key="sampling_type" value="stratified sampling"/>
</operator>
<operator activated="true" class="polynomial_regression" compatibility="6.0.003" expanded="true" height="76" name="Polynomial Regression" width="90" x="514" y="75">
<parameter key="replication_factor" value="2"/>
</operator>
<operator activated="true" class="apply_model" compatibility="6.0.003" expanded="true" height="76" name="Apply Model" width="90" x="648" y="75">
<list key="application_parameters"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Polynomial Regression" to_port="training set"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Polynomial Regression" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Regards,
Daniela
0
Answers
Birgit told me that you are using the decay.txt from the zip file. That file contains only 30 data points, which by far not enough to derive the formula that you posted. If you plot y vs. x you see that this looks rather linear with a coefficient of -3 (visual estimation :-) ), so RapidMiner does not perform that bad
Does R work better on the same data set?
Do you really have 300 entries in your data? I only see 30 with a step size of 1.
Best regards,
Marius
Best regards,
Marius
thank you so much for your help.
(R does calculate the formula pretty well, with only the 30 data points. But you're right, that's no 300: What I did for Rapidminer was creating a dataset with 300 data points following the calculated formula. So basically what you did with the Subprocess "Generate Data". )
I understand, the mistake was using the wrong operator? (I should have guessed that from the option "max iterations"... )
Best regards,
Daniela
well, I wouldn't call it a mistake, but yes, it seems that that was the problem :-)
Finally, you can also discover the formula with only 30 data points in RapidMiner using the linear regression. You need to disable all integrateds feature selection methods of the Linear Regression, though, otherwise the heuristics remove seemingly colinear features: set the feature selection method to "none" and disable "eliminate colinear features" in the Linear Regression.
As you can see, with 300 data points the heuristics have enough input to keep all relevant attributes.
Best regards,
Marius