Options

Forward Selection confused result (need answer please)

talebmuhsintalebmuhsin Member Posts: 4 Contributor I
Hello Everyone

I am trying to use Forward selection to select the best attributes from iris dataset and the subset with (a1,a3,a4) was selected. However when I look at the performance of each subset in a log file I can see that even with subset of 4 features (ALL) the performance is 1 as  below
Number  of features          Performance          attributes name
4.0                                   0.86                 a3, a4, a1, a2
4.0                                   0.93                 a3, a4, a1, a2
4.0                                   0.93                 a3, a4, a1, a2
4.0                                   1.0                 a3, a4, a1, a2
4.0                                   1.0                 a3, a4, a1, a2
4.0                                      0.86                 a3, a4, a1, a2
4.0                                     0.93                 a3, a4, a1, a2
4.0                                   1.0                         a3, a4, a1, a2

but it is still select the only three feature
Number  of features          Performance          attributes name
3.0                                       1.0                   a3, a4, a1

can any body explain to me why it is selecting only three feature not 4 or not 2 although the performance is 1.

I am having a real-world dataset with 896 attributes and the same thing happening to me only the best 3 are selected

can anybody tell me please if I am doing some wrong steps or it is correct. I am very confused

Best Regards,

Taleb

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Taleb,

    can you please post your process setup such that I can have a look at it? We did not observe such behavior so far, and the Forward Selection should always select the best subset...

    Best regards,
    Marius
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    One idea that comes to my mind is that you did not log the performance of the X-Validation, but of the Performance operator. The reasoning for that is described here: http://rapid-i.com/rapidforum/index.php/topic,6599.msg23304.html#msg23304

    Best regards,
    Marius
  • Options
    talebmuhsintalebmuhsin Member Posts: 4 Contributor I
    Hi Marius

    Thanks for the reply. I want to show you the process to see whether I am doing it right or wrong

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="5.3.013" expanded="true" height="60" name="Retrieve Sonar" width="90" x="45" y="75">
            <parameter key="repository_entry" value="//Samples/data/Sonar"/>
          </operator>
          <operator activated="true" class="optimize_selection_forward" compatibility="5.3.013" expanded="true" height="94" name="Forward Selection" width="90" x="246" y="75">
            <process expanded="true">
              <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="112" name="Validation" width="90" x="112" y="75">
                <description>A cross validation including a linear regression.</description>
                <parameter key="average_performances_only" value="false"/>
                <process expanded="true">
                  <operator activated="true" class="k_nn" compatibility="5.3.013" expanded="true" height="76" name="k-NN" width="90" x="112" y="30">
                    <parameter key="k" value="9"/>
                  </operator>
                  <connect from_port="training" to_op="k-NN" to_port="training set"/>
                  <connect from_op="k-NN" from_port="model" to_port="model"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true">
                  <operator activated="true" class="apply_model" compatibility="5.3.013" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                    <list key="application_parameters"/>
                  </operator>
                  <operator activated="true" class="performance" compatibility="5.3.013" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
                  <connect from_port="model" to_op="Apply Model" to_port="model"/>
                  <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                  <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
                  <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="log" compatibility="5.3.013" expanded="true" height="76" name="Log" width="90" x="313" y="120">
                <list key="log">
                  <parameter key="Number of attributes" value="operator.Forward Selection.value.number of attributes"/>
                  <parameter key="Feature Names" value="operator.Forward Selection.value.feature_names"/>
                  <parameter key="Validation_Performance" value="operator.Validation.value.performance"/>
                </list>
              </operator>
              <connect from_port="example set" to_op="Validation" to_port="training"/>
              <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="performance"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Sonar" from_port="output" to_op="Forward Selection" to_port="example set"/>
          <connect from_op="Forward Selection" from_port="example set" to_port="result 3"/>
          <connect from_op="Forward Selection" from_port="attribute weights" to_port="result 1"/>
          <connect from_op="Forward Selection" from_port="performance" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    Actually as I said in my previous post, I have a dataset consists of 2400 samples and 896 features extracted from each. I am working on a white blood cells classification problem and I have extracted various features from each cell (Shape, Texture, Color). so when I am trying to use forward selection to get the best subset out of 896 then it give me only three features, I expected the forward selection will select at least 40-50 features but it is always get me 3.

    I have another issue, when I classify the  cells using neural network with the default setting  except changing the number of hidden layer, then i got an accuracy of 96%. is that normal? I thought that if I classify the cells based on all the feature I will get a low accuracy because of many feature could be irrelevant. Could you please assist me in this matter

    Best Regards,

    Taleb
Sign In or Register to comment.