RapidMiner

How to train linear regression model effectively?

Contributor II

How to train linear regression model effectively?

[ Edited ]

I'm a year 2 computer science student and I'm trying to build a linear regression model and predict house prices (A Kaggle Quest). I build my model but it does not seem impressive at all. 

 

First, I ran a process to see the attributes relation though the correlation matrix operator and had a good grasp about their relationship and where should I manipulate them in the future. Then I select some appropriate attributes based on a mixture of my common sense and the result from the correlation matrix. After that, I tried to impute the missing values with Optimize Parameters operator (nested with cross validation operator and k-NN) to find out the best k value. The next thing I did is detect and remove outliers. 

 

Afterward, I wired up a cross validation operator with ensemble model inside (SVM + Deep Learning + Gradient Boosted Tree + k-NN), (Linear Regression as the stack model learner).

 

However, the result did not seem promising. Ran a few test and the RMSE value I get was always around 26000 - 27000 which makes me think maybe my approach is wrong.

 

Can anyone look at my model and advice?

 

Attributes Relation Process

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.5.001" expanded="true" height="68" name="Retrieve Modified Train" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//My First Prediction/MiniProject/Modified Train"/>
      </operator>
      <operator activated="false" class="multiply" compatibility="7.5.001" expanded="true" height="68" name="Multiply" width="90" x="45" y="187"/>
      <operator activated="false" class="select_attributes" compatibility="7.5.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="238">
        <parameter key="attribute_filter_type" value="value_type"/>
        <parameter key="value_type" value="nominal"/>
      </operator>
      <operator activated="false" class="nominal_to_numerical" compatibility="7.5.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="313" y="238">
        <list key="comparison_groups"/>
      </operator>
      <operator activated="false" class="principal_component_analysis" compatibility="7.5.001" expanded="true" height="103" name="PCA" width="90" x="447" y="238"/>
      <operator activated="true" class="correlation_matrix" compatibility="7.5.001" expanded="true" height="103" name="Correlation Matrix" width="90" x="246" y="34"/>
      <operator activated="true" class="converters:matrix_2_example_set" compatibility="0.2.000" expanded="true" height="82" name="Matrix to ExampleSet" width="90" x="380" y="34">
        <parameter key="pairwise_list" value="true"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.5.001" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="First Attribute.matches.SalePrice"/>
        </list>
      </operator>
      <connect from_op="Retrieve Modified Train" from_port="output" to_op="Correlation Matrix" to_port="example set"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="PCA" to_port="example set input"/>
      <connect from_op="Correlation Matrix" from_port="matrix" to_op="Matrix to ExampleSet" to_port="matrix"/>
      <connect from_op="Matrix to ExampleSet" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Main Process

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.5.001" expanded="true" height="68" name="Test Data" width="90" x="45" y="289">
        <parameter key="repository_entry" value="Modified Test"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="7.5.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="289">
        <parameter key="coding_type" value="unique integers"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.5.001" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="289">
        <parameter key="condition_class" value="no_missing_attributes"/>
        <list key="filters_list"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.5.001" expanded="true" height="68" name="Training Data" width="90" x="45" y="34">
        <parameter key="repository_entry" value="Modified Train"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.5.001" expanded="true" height="82" name="Set Role" width="90" x="112" y="136">
        <parameter key="attribute_name" value="SalePrice"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.5.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="SalePrice|1stFlrSF|FullBath|GarageArea|GarageCars|GrLivArea|LotArea|OverallCond|OverallQual|TotRmsAbvGrd|TotalBsmtSF|YearBuilt"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="7.5.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="246" y="136">
        <parameter key="coding_type" value="unique integers"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="impute_missing_values" compatibility="7.5.001" expanded="true" height="68" name="Impute Missing Values" width="90" x="313" y="34">
        <process expanded="true">
          <operator activated="true" class="optimize_parameters_grid" compatibility="7.5.001" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="246" y="34">
            <list key="parameters">
              <parameter key="k-NN.k" value="[1;10;10;linear]"/>
            </list>
            <process expanded="true">
              <operator activated="true" class="concurrency:cross_validation" compatibility="7.5.001" expanded="true" height="145" name="Cross Validation (2)" width="90" x="179" y="34">
                <process expanded="true">
                  <operator activated="true" class="k_nn" compatibility="7.5.001" expanded="true" height="82" name="k-NN" width="90" x="112" y="34">
                    <parameter key="k" value="10"/>
                    <parameter key="weighted_vote" value="true"/>
                  </operator>
                  <connect from_port="training set" to_op="k-NN" to_port="training set"/>
                  <connect from_op="k-NN" from_port="model" to_port="model"/>
                  <portSpacing port="source_training set" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true">
                  <operator activated="true" class="apply_model" compatibility="7.5.001" expanded="true" height="82" name="Apply Model (3)" width="90" x="45" y="34">
                    <list key="application_parameters"/>
                  </operator>
                  <operator activated="true" class="performance" compatibility="7.5.001" expanded="true" height="82" name="Performance (2)" width="90" x="179" y="34"/>
                  <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
                  <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
                  <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
                  <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
                  <connect from_op="Performance (2)" from_port="example set" to_port="test set results"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_test set results" spacing="0"/>
                  <portSpacing port="sink_performance 1" spacing="0"/>
                  <portSpacing port="sink_performance 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="input 1" to_op="Cross Validation (2)" to_port="example set"/>
              <connect from_op="Cross Validation (2)" from_port="model" to_port="result 1"/>
              <connect from_op="Cross Validation (2)" from_port="performance 1" to_port="performance"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
              <portSpacing port="sink_result 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="false" class="k_nn" compatibility="7.5.001" expanded="true" height="82" name="k-NN (2)" width="90" x="246" y="340">
            <parameter key="k" value="3"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
          </operator>
          <connect from_port="example set source" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
          <connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="model sink"/>
          <portSpacing port="source_example set source" spacing="0"/>
          <portSpacing port="sink_model sink" spacing="0"/>
          <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="206" y="194">Optimise kNN, only improve by a bit though</description>
        </process>
      </operator>
      <operator activated="true" class="detect_outlier_distances" compatibility="7.5.001" expanded="true" height="82" name="Detect Outlier (Distances)" width="90" x="380" y="136">
        <parameter key="number_of_outliers" value="20"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.5.001" expanded="true" height="103" name="Remove Outliers" width="90" x="447" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="outlier.equals.false"/>
        </list>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="7.5.001" expanded="true" height="145" name="Cross Validation" width="90" x="581" y="34">
        <process expanded="true">
          <operator activated="true" class="stacking" compatibility="7.5.001" expanded="true" height="68" name="Stacking" width="90" x="112" y="34">
            <process expanded="true">
              <operator activated="true" class="support_vector_machine_linear" compatibility="7.5.001" expanded="true" height="82" name="SVM (2)" width="90" x="313" y="34"/>
              <operator activated="true" class="h2o:deep_learning" compatibility="7.5.000" expanded="true" height="82" name="Deep Learning" width="90" x="179" y="85">
                <parameter key="activation" value="ExpRectifier"/>
                <enumeration key="hidden_layer_sizes">
                  <parameter key="hidden_layer_sizes" value="50"/>
                  <parameter key="hidden_layer_sizes" value="50"/>
                </enumeration>
                <enumeration key="hidden_dropout_ratios"/>
                <parameter key="compute_variable_importances" value="true"/>
                <parameter key="missing_values_handling" value="Skip"/>
                <list key="expert_parameters"/>
                <list key="expert_parameters_"/>
              </operator>
              <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="7.5.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="112" y="187">
                <list key="expert_parameters"/>
              </operator>
              <operator activated="true" class="k_nn" compatibility="7.5.001" expanded="true" height="82" name="k-NN (3)" width="90" x="45" y="289">
                <parameter key="k" value="3"/>
              </operator>
              <connect from_port="training set 1" to_op="SVM (2)" to_port="training set"/>
              <connect from_port="training set 2" to_op="Deep Learning" to_port="training set"/>
              <connect from_port="training set 3" to_op="Gradient Boosted Trees" to_port="training set"/>
              <connect from_port="training set 4" to_op="k-NN (3)" to_port="training set"/>
              <connect from_op="SVM (2)" from_port="model" to_port="base model 1"/>
              <connect from_op="Deep Learning" from_port="model" to_port="base model 2"/>
              <connect from_op="Gradient Boosted Trees" from_port="model" to_port="base model 3"/>
              <connect from_op="k-NN (3)" from_port="model" to_port="base model 4"/>
              <portSpacing port="source_training set 1" spacing="0"/>
              <portSpacing port="source_training set 2" spacing="0"/>
              <portSpacing port="source_training set 3" spacing="0"/>
              <portSpacing port="source_training set 4" spacing="0"/>
              <portSpacing port="source_training set 5" spacing="0"/>
              <portSpacing port="sink_base model 1" spacing="0"/>
              <portSpacing port="sink_base model 2" spacing="0"/>
              <portSpacing port="sink_base model 3" spacing="0"/>
              <portSpacing port="sink_base model 4" spacing="0"/>
              <portSpacing port="sink_base model 5" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="linear_regression" compatibility="7.5.001" expanded="true" height="103" name="Linear Regression" width="90" x="179" y="34"/>
              <connect from_port="stacking examples" to_op="Linear Regression" to_port="training set"/>
              <connect from_op="Linear Regression" from_port="model" to_port="stacking model"/>
              <portSpacing port="source_stacking examples" spacing="0"/>
              <portSpacing port="sink_stacking model" spacing="0"/>
            </process>
          </operator>
          <connect from_port="training set" to_op="Stacking" to_port="training set"/>
          <connect from_op="Stacking" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.5.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_regression" compatibility="7.5.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <connect from_op="Performance" from_port="example set" to_port="test set results"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.5.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="187">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Test Data" from_port="output" to_op="Nominal to Numerical (2)" to_port="example set input"/>
      <connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Training Data" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Impute Missing Values" to_port="example set in"/>
      <connect from_op="Impute Missing Values" from_port="example set out" to_op="Detect Outlier (Distances)" to_port="example set input"/>
      <connect from_op="Detect Outlier (Distances)" from_port="example set output" to_op="Remove Outliers" to_port="example set input"/>
      <connect from_op="Remove Outliers" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
      <connect from_op="Cross Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Cross Validation" from_port="performance 1" to_port="result 1"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="63"/>
      <portSpacing port="sink_result 2" spacing="63"/>
      <portSpacing port="sink_result 3" spacing="126"/>
    </process>
  </operator>
</process>

 

See more topics labeled with:

1 REPLY
Moderator

Re: How to train linear regression model effectively?

My initial thought is that you might have to re-look at your data set and break it up into subsets and train multiple models, make the predictions seperately and then append it into one data set. 

 

From what I know about the RE market is that zoning is critical as well as SF and $/SF. You probably want to loop across those Zoning subsets and see if the RMSE improves or gets worse. Additionally you might need to generate a few new features like $/SF and even difference between the Year Built and Year Remodeled. The other pieces of data should be converted to Dummy Coding in the Nominal to Numerical operators. Unique Integers implies order so it can screw up your test set. 

 

Nice optimization inside the Impute Missing Values. However, you should use a Normalize operator before the K-nn because K-nn is suspectible to scaling problems. The neat thing about RapidMiner's Cross Validation is that you can put that Normalize on the training side and use a Group models to pass the models to the testing side in order. This way the training data get's normalized first with a pre-processed modeled, the transformed data get's built by the K-nn, and then the pre-processed model gets passed to the Testing set and makes the conversion to the same mean as the training set before the k-nn model is applied and tested for performance. 

Embed Normalize.png

 

 

 

The sample applies to the Nominal to Numerical conversions. I checked out your Stacking operator and I think the various algorithms could benefit from Optimization in there.   For now, I would just work on the model and forget the testing set, just work to get the RSME down. You can of course optimize for the RSME too, so I would try that.