"Question about prediction values and applying weights from results"

ElPato · March 2010

To begin with, I am very new to data mining and Rapid Miner. I have been experimenting with the software for several months now and it has certainly proven to be a fantastic and powerful learning and exploration tool. Thank you for your wonderful efforts on this software application!

I do have a couple of questions, though, about how to accurately replicate the prediction values I am receiving. I am performing both classification and regression tests on labelled data. Overall, I am running experiments where I:
1. optimize the number of attributes (either through PCA or Genetic/Evolutionary "optimize selection"). I typically normalize the resulting weights to receive either a 0 or 1 and pass those attributes with a weight of 1 onto the next processing step.
2. I run the same data set with the "selected" attributes through the same learner as the "optimize selection" (typically SVM) in order to obtain the weights/model for the data with the selected attributes.
3. I then apply these weights/model to a new set of unseen data with just the selected attributes and obtain the performance of the weighted model of the data.

When the test is complete, I view the data set which displays the selected attributes along with the label value and the predicted value. I also view the weights of the selected attributes. In an effort to replicate the predicted value, I basically perform matrix multiplication with the transposed weight matrix and the attribute value matrix. However, the values I obtain when I do this are usually nowhere near the predicted value which is displayed. I perform this for both the classification and regression problems. I also realize there are often times biases associated with the various learners which I add to/subtract from the calculated values I obtain. However, these values are still not near the predicted values. It seems like this should be pretty straight forward, but I know I am definitely missing something.

Is there anyone who might be able to explain how to apply the weights obtained from the various learners in order to obtain accurate prediction values, especially for binominal classification?

Thanks in advance for anyone able to assist!

David

haddock · March 2010

Hi David,

I follow what you are doing up to...

In an effort to replicate the predicted value, I basically perform matrix multiplication with the transposed weight matrix and the attribute value matrix.

Could you post the XML of your process so that I can see what you mean?

PS. The "Create Formula" operator can handle binominal SVMs, like this...

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Root">
    <description>For many learning tasks, Support Vector Machines are among the best suited learning schemes.They adapt the idea of structural risk minimization and allows for non-linear generalizations with help of kernel functions.</description>
    <process expanded="true" height="391" width="915">
      <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="75">
        <parameter key="target_function" value="random classification"/>
      </operator>
      <operator activated="true" class="support_vector_machine" expanded="true" height="112" name="SVM" width="90" x="179" y="75"/>
      <operator activated="true" class="create_formula" expanded="true" height="76" name="Create Formula" width="90" x="313" y="30"/>
      <connect from_op="Generate Data" from_port="output" to_op="SVM" to_port="training set"/>
      <connect from_op="SVM" from_port="model" to_op="Create Formula" to_port="model"/>
      <connect from_op="SVM" from_port="estimated performance" to_port="result 3"/>
      <connect from_op="Create Formula" from_port="formula" to_port="result 1"/>
      <connect from_op="Create Formula" from_port="model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

ElPato · March 2010

Thank you for the reply, haddock!

I realize now I tried to sound a lot more educated in my post about data mining than what I really am. Sorry about that

However, you are definitely correct about me wanting to obtain some formula for getting the predicted values seen in RapidMiner. I added the CreateFormula operator into my process, and had it write out the formula for an SVM regression model. The resulting file listed a different formula for every single instance I was testing/evaluating with different attribute coefficients! The file was huge. Is this correct? Simply put, I am assuming the SVM regression model is very similar to a Linear regression model where there is a single formula, you plug in the attribute values into the formula, multiply these values by the corresponding coefficient, add some offset ... and there is your predicted value. Is this not the case?

Below is some of the XML for the process. I didn't include everything because there is a lot. Thanks again for your assistance!

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
      <location/>
      <location/>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Root">
    <parameter key="random_seed" value="1976"/>
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true" height="758" width="882">
      <operator activated="true" class="read_database" expanded="true" height="60" name="Read Database" width="90" x="45" y="30">
        <parameter key="connection" value="DataWarehouse"/>
        <parameter key="query" value="blah">
      </operator>
      <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role (2)" width="90" x="45" y="255">
        <parameter key="name" value="NET_PROFIT"/>
        <parameter key="target_role" value="label"/>
      </operator>
      <operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes (2)" width="90" x="179" y="255">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="NET_PROFIT|USER1|USER2|USER3|USER4|USER7|USER8|USER9|USER10|USER11|USER12|USER13|USER18|USER19|USER21|USER32|USER33|USER34|USER39|USER40|USER41|USER42|USER43|USER47|USER48|USER49|USER50|USER51|USER52|USER53|USER54|USER55|USER56"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values (2)" width="90" x="313" y="255">
        <list key="columns"/>
      </operator>
      <operator activated="true" class="optimize_selection_evolutionary" expanded="true" height="94" name="Optimize Selection (Evolutionary)" width="90" x="246" y="30">
        <parameter key="maximum_number_of_generations" value="20"/>
        <parameter key="parallelize_evaluation_process" value="true"/>
        <process expanded="true" height="758" width="882">
          <operator activated="true" class="support_vector_machine" expanded="true" height="112" name="SVM (2)" width="90" x="246" y="75">
            <parameter key="kernel_type" value="radial"/>
            <parameter key="max_iterations" value="100"/>
          </operator>
          <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model (2)" width="90" x="447" y="75">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" expanded="true" height="76" name="Performance (2)" width="90" x="673" y="64"/>
          <connect from_port="example set" to_op="SVM (2)" to_port="training set"/>
          <connect from_op="SVM (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="SVM (2)" from_port="exampleSet" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
          <connect from_op="Performance (2)" from_port="performance" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="select_by_weights" expanded="true" height="94" name="Select by Weights" width="90" x="380" y="30"/>
      <operator activated="true" class="support_vector_machine" expanded="true" height="112" name="SVM" width="90" x="514" y="30">
        <parameter key="max_iterations" value="1000"/>
      </operator>
      <operator activated="true" class="read_database" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="435">
        <parameter key="connection" value="DataWarehouse"/>
        <parameter key="query" value="blah">
      </operator>
      <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role (3)" width="90" x="45" y="570">
        <parameter key="name" value="NET_PROFIT"/>
        <parameter key="target_role" value="label"/>
      </operator>
      <operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes (3)" width="90" x="179" y="570">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="NET_PROFIT|USER1|USER2|USER3|USER4|USER7|USER8|USER9|USER10|USER11|USER12|USER13|USER18|USER19|USER21|USER32|USER33|USER34|USER39|USER40|USER41|USER42|USER43|USER47|USER48|USER49|USER50|USER51|USER52|USER53|USER54|USER55|USER56"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values (3)" width="90" x="313" y="570">
        <list key="columns"/>
      </operator>
      <operator activated="true" class="select_by_weights" expanded="true" height="94" name="Select by Weights (2)" width="90" x="514" y="480"/>
      <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="648" y="480">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="create_formula" expanded="true" height="76" name="Create Formula" width="90" x="514" y="660"/>
      <operator activated="true" class="write" expanded="true" height="60" name="Write" width="90" x="740" y="624">
        <parameter key="object_file" value="C:\formula.ioo"/>
        <parameter key="output_type" value="XML"/>
      </operator>
      <operator activated="true" class="performance" expanded="true" height="76" name="Performance" width="90" x="648" y="255"/>
      <connect from_op="Read Database" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Replace Missing Values (2)" to_port="example set input"/>
      <connect from_op="Replace Missing Values (2)" from_port="example set output" to_op="Optimize Selection (Evolutionary)" to_port="example set in"/>
      <connect from_op="Optimize Selection (Evolutionary)" from_port="example set out" to_op="Select by Weights" to_port="example set input"/>
      <connect from_op="Optimize Selection (Evolutionary)" from_port="weights" to_op="Select by Weights" to_port="weights"/>
      <connect from_op="Select by Weights" from_port="example set output" to_op="SVM" to_port="training set"/>
      <connect from_op="Select by Weights" from_port="weights" to_op="Select by Weights (2)" to_port="weights"/>
      <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="SVM" from_port="weights" to_port="result 3"/>
      <connect from_op="Read Database (2)" from_port="output" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Select Attributes (3)" to_port="example set input"/>
      <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Replace Missing Values (3)" to_port="example set input"/>
      <connect from_op="Replace Missing Values (3)" from_port="example set output" to_op="Select by Weights (2)" to_port="example set input"/>
      <connect from_op="Select by Weights (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Apply Model" from_port="model" to_op="Create Formula" to_port="model"/>
      <connect from_op="Create Formula" from_port="formula" to_op="Write" to_port="object"/>
      <connect from_op="Create Formula" from_port="model" to_port="result 4"/>
      <connect from_op="Write" from_port="object" to_port="result 5"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
      <portSpacing port="sink_result 6" spacing="0"/>
    </process>
  </operator>
</process>

land · April 2010

Hi,
the SVM regression model only collapses to a solution similar to Linear Regression if you use the linear kernel. Otherwise it depends on the Kernelmatrix and hence the factors are changing with the test example. Otherwise the SVM wouldn't be so much more flexible than Linear Regression, wouldn't it?

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Question about prediction values and applying weights from results"

Answers