Optimal SVM parameters but very different results?

Sasch · October 2012

Hi all,
I am using the grid search parameter optimizer to determine the best parameters (C and gamma) for my SVM. The SVM is embedded in a 10-fold-validation.
After the process is finished I get the parameter set and a performance of 100 (!)%. (see code 1)
Code 1:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="467" width="748">
      <operator activated="true" class="read_excel" compatibility="5.2.008" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="D:\PSY-DATA\06_HERZRATEN_PROJEKT\HR_KlassDaten.xlsx"/>
        <parameter key="imported_cell_range" value="A1:M4901"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Probandennummer.false.integer.attribute"/>
          <parameter key="1" value="Alter.false.integer.attribute"/>
          <parameter key="2" value="Altersgruppe.false.integer.attribute"/>
          <parameter key="3" value="Geschlecht.false.integer.attribute"/>
          <parameter key="4" value="Geschlechtsfaktor.false.integer.attribute"/>
          <parameter key="5" value="Mens.false.integer.attribute"/>
          <parameter key="6" value="RMSSD(ms).true.real.attribute"/>
          <parameter key="7" value="mean_RR(ms).true.real.attribute"/>
          <parameter key="8" value="std_RR(ms).true.real.attribute"/>
          <parameter key="9" value="mean_HR.true.real.attribute"/>
          <parameter key="10" value="std_HR.true.real.attribute"/>
          <parameter key="11" value="label.false.polynominal.attribute"/>
          <parameter key="12" value="label valenz.true.polynominal.label"/>
        </list>
      </operator>
      <operator activated="true" class="optimize_parameters_grid" compatibility="5.2.008" expanded="true" height="94" name="Optimize Parameters (Grid)" width="90" x="447" y="30">
        <list key="parameters">
          <parameter key="SVM.C" value="0.03125,0.125,0.5,2,8,32,128,512,2048,8192,32768"/>
          <parameter key="SVM.gamma" value="0.000030517578125,0.00012207,0.000488281,0.001953125,0.0078125,0.03125,0.125,0.5,2,8"/>
        </list>
        <process expanded="true" height="487" width="826">
          <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="112" name="Validation" width="90" x="179" y="75">
            <parameter key="average_performances_only" value="false"/>
            <process expanded="true" height="487" width="346">
              <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.2.008" expanded="true" height="76" name="SVM" width="90" x="246" y="30">
                <parameter key="gamma" value="0.001953125"/>
                <parameter key="C" value="32768"/>
                <parameter key="cache_size" value="250"/>
                <list key="class_weights"/>
                <parameter key="calculate_confidences" value="true"/>
              </operator>
              <connect from_port="training" to_op="SVM" to_port="training set"/>
              <connect from_op="SVM" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="487" width="300">
              <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="5.2.008" expanded="true" height="76" name="Performance" width="90" x="180" y="30">
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

When I actually apply the received parameters I only get 52,05 %. (see code 2)
Code2:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="467" width="748">
      <operator activated="true" class="read_excel" compatibility="5.2.008" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="D:\PSY-DATA\06_HERZRATEN_PROJEKT\HR_KlassDaten.xlsx"/>
        <parameter key="imported_cell_range" value="A1:M4901"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Probandennummer.false.integer.attribute"/>
          <parameter key="1" value="Alter.false.integer.attribute"/>
          <parameter key="2" value="Altersgruppe.false.integer.attribute"/>
          <parameter key="3" value="Geschlecht.false.integer.attribute"/>
          <parameter key="4" value="Geschlechtsfaktor.false.integer.attribute"/>
          <parameter key="5" value="Mens.false.integer.attribute"/>
          <parameter key="6" value="RMSSD(ms).true.real.attribute"/>
          <parameter key="7" value="mean_RR(ms).true.real.attribute"/>
          <parameter key="8" value="std_RR(ms).true.real.attribute"/>
          <parameter key="9" value="mean_HR.true.real.attribute"/>
          <parameter key="10" value="std_HR.true.real.attribute"/>
          <parameter key="11" value="label.false.polynominal.attribute"/>
          <parameter key="12" value="label valenz.true.polynominal.label"/>
        </list>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.2.008" expanded="true" height="112" name="Validation" width="90" x="313" y="30">
        <parameter key="average_performances_only" value="false"/>
        <process expanded="true" height="511" width="365">
          <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.2.008" expanded="true" height="76" name="SVM" width="90" x="137" y="30">
            <parameter key="gamma" value="3.0517578125E-5"/>
            <parameter key="C" value="0.03125"/>
            <parameter key="cache_size" value="250"/>
            <list key="class_weights"/>
            <parameter key="calculate_confidences" value="true"/>
          </operator>
          <connect from_port="training" to_op="SVM" to_port="training set"/>
          <connect from_op="SVM" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="511" width="365">
          <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="5.2.008" expanded="true" height="76" name="Performance" width="90" x="205" y="30">
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

[The codes above are only showing my setups for trouble-shooting.]

How is that possible?
Or am I doing something wrong? (by the way: my classification task is binary and my two classes are well balanced, (128 training vectors a 5 features for each class))

Thanks a lot in advance,
Sasch

TonyLaing · October 2012

I'm having exactly the same problem running the grid optimization over a decision tree, on the parameters "minimal size for split", "minimal leaf size" and "maximal depth". When I actually apply the values of the parameters that the optimization tells me are optimal, my overall model precision is about 1% lower than what the optimizer told me the accuracy would have been if I had applied those parameter values.

Sasch · October 2012

Hi TonyLaing,
are you testing your model on a different dataset than you trained on? -> The results may vary slightly.

See these posts (they are about paramater optimization):

http://rapid-i.com/rapidforum/index.php/topic,4034.msg14915.html#msg14915
http://rapid-i.com/rapidforum/index.php/topic,4018.msg14881.html

Maybe they'll help to solve your problem ...(?)

Greetz,
Sasch

MariusHelf · October 2012

Hi, it is normal that the performance an a test set differs a bit from the performance during evaluation - after all both values are only an estimate, and especially on small datasets the statistical fluctuations may have a visible impact.

However, an accuracy of 100% is unusual. Your processes look fine, so you should have a look at your data: how many examples are you using for the optimization? Are the classes balanced? How did you create the sample? Is it drawn from the same distribution as your test data?

Best, Marius

Sasch · October 2012

Hi Marius,
thanks for taking time to help me.

-Your processes look fine
=> Uff, thank god, first problem solved, that helped me a lot

- how many examples are you using for the optimization?
=> I have 256 examples, 128 examples for each class => That means the classes are perfectly balanced. I also use 10-fold-validation for accuracy estimation.

- How did you create the sample?
=> Each example consists of 5 features and a label for the condition (negative/positive). In my case, all features are derived from heart rate data (e.g. mean, std, RMSS etc.)

- Is it drawn from the same distribution as your test data?
=> Yes.

So, when I put the optimal parameters in a SVM and train now on the same data with an 10-fold-val, I only get 52 % accuracy (from my point of view this result doesn't reflect the term "slightly differ"

)

My problem here isn't the 100% accuracy, it's that fatal drop of over 40% percent...

Thanks again,
Sasch

MariusHelf · October 2012

Well, but 100% accuracy always create a kind of suspicion in the heart of a data miner

That value does not leave much room for fluctuations during the 10 folds, however, what's the accuracy's standard deviation in the first process?

How much data do you have in total? 256 examples is not very much, if possible you should really increase it by a factor of 10.

Sasch · October 2012

- Well, but 100% accuracy always create a kind of suspicion in the heart of a data miner
=> Yes, I know, we always say "god is angry" if you get an 100% result on bio signal data

- 256 examples is not very much, if possible you should really increase it by a factor of 10
=> We are talking about bio signal data derived from humans. It's hard to get those features exactly for the conditions we examine. We are all sent to hell by our chief if we increase the examples artificially..

- That value does not leave much room for fluctuations during the 10 folds, however, what's the accuracy's standard deviation in the first process?
=> Perhaps this helps:

http://imageshack.us/photo/my-images/3/61745688.jpg

MariusHelf · October 2012

Which columns does your data contain? Are you training on an Id attribute or something by accident?
What happens if you run the optimization on the test set?

Sasch · October 2012

I use the ReadExcel-ImportWizard and import only my needed 6 columns (5 feature columns and 1 label column). I don't think I imported something by accident but I'll definitely check that!
What do you mean with your last question?
My second code runs with the optimal parameters on the same data set as the first code. I thought the 10-fold-val will do the rest? (Splitting in test and training sets and so on..)

MariusHelf · October 2012

Well, if you used the same dataset for both optimization and testing, my last question is already answered.

On such a small dataset the splits created by the X-Validation may have a big impact. Try to set the same local random seed for all X-Validations. That won't improve your analysis, but at least it will make the results comparable, and if the processes are setup correctly, you should get exactly the same performances with equal parameter sets.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Optimal SVM parameters but very different results?

Answers