Feature Selection

MuehliMan · June 2010

Hi,

I want to make a Backward and/or Forward Selection in RM5. Unfortunately the Workflow from the Samples (04.../09) does not work when opened. So I would like to ask for an example workflow of a forward and/or backward feature selection. If I understand it right I have to Define a performance evaluation within the Feature Selection, correct?
Would it be possible to do something like a F Value calulcation for each added descriptor?

Best regards,
Markus

haddock · June 2010

Hi,

Sadly 04/09 has problems as you point out, but if you want to do Backward and/or Forward Selection is not 04/10 of any use?

MuehliMan · June 2010

Here is my basic workflow:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Global">
    <parameter key="logfile" value="advanced1.log"/>
    <process expanded="true" height="766" width="1254">
      <operator activated="true" class="read_excel" expanded="true" height="60" name="Read Excel (2)" width="90" x="45" y="120">
        <parameter key="excel_file" value="c:\Data\spreedsheet.xls"/>
        <list key="annotations"/>
      </operator>
      <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role (2)" width="90" x="179" y="120">
        <parameter key="name" value="ID"/>
        <parameter key="target_role" value="id"/>
      </operator>
      <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role (3)" width="90" x="313" y="120">
        <parameter key="name" value="target"/>
        <parameter key="target_role" value="label"/>
      </operator>
      <operator activated="true" class="optimize_selection" expanded="true" height="94" name="FS" width="90" x="581" y="30">
        <process expanded="true" height="668" width="1094">
          <operator activated="true" class="x_validation" expanded="true" height="112" name="XValidation" width="90" x="45" y="75">
            <parameter key="number_of_validations" value="5"/>
            <parameter key="sampling_type" value="shuffled sampling"/>
            <process expanded="true" height="668" width="522">
              <operator activated="true" class="linear_regression" expanded="true" height="76" name="Linear Regression (2)" width="90" x="179" y="120"/>
              <connect from_port="training" to_op="Linear Regression (2)" to_port="training set"/>
              <connect from_op="Linear Regression (2)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="668" width="522">
              <operator activated="true" class="apply_model" expanded="true" height="76" name="Applier" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_regression" expanded="true" height="76" name="Performance" width="90" x="179" y="120">
                <parameter key="root_mean_squared_error" value="false"/>
                <parameter key="squared_error" value="true"/>
              </operator>
              <connect from_port="model" to_op="Applier" to_port="model"/>
              <connect from_port="test set" to_op="Applier" to_port="unlabelled data"/>
              <connect from_op="Applier" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log" expanded="true" height="76" name="ProcessLog" width="90" x="581" y="120">
            <parameter key="filename" value="c:\data\fs_forward.log"/>
            <list key="log">
              <parameter key="generation" value="operator.FS.value.generation"/>
              <parameter key="performance" value="operator.FS.value.performance"/>
            </list>
          </operator>
          <connect from_port="example set" to_op="XValidation" to_port="training"/>
          <connect from_op="XValidation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/>
          <connect from_op="ProcessLog" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="write_weights" expanded="true" height="60" name="Write Weights" width="90" x="715" y="165">
        <parameter key="attribute_weights_file" value="c:\datar\fs_weights.wgt"/>
      </operator>
      <connect from_op="Read Excel (2)" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="FS" to_port="example set in"/>
      <connect from_op="FS" from_port="weights" to_op="Write Weights" to_port="input"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

Tp this I would have some questions:

1) What are "good" leaners to use in the Training or which one should I better not use? IS Linear Regression bad for the feature selection? Why is KMeans used in the Online TUtorial?

2) Can I make the Feature Selection in a way, that each generation needs to improve by some performance (lets say 0.05 R²) otherwise it stops?

3) Is there a way to write up each taken descriptor right after the run is over (so to say an on-the-fly log file)?

Best regards,
Markus

land · June 2010

Hi Markus,
beside the samples (which are partly outdated) I would recommend using the explicit Forward Attribute Selection operator. It's much more efficient than the old Attribute Selection operator and offers exactly what you are longing for: Detailed Stopping Criteria definition.

You cannot answer the question what a good learner is in general. This depends on your task, on your data and in last on your patience. The linear regression is a relatively fast learner and since the learner will be applied for each attribute in each round, it should be fast. But if it suits the data, one cannot say. Just try it and exchange with another lateron to compare. RapidMiner is designed for this kind of experimenting...

What do you mean by descriptor?

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Feature Selection

Answers