Different results with leave-one-out X-Val

Sasch · June 2010

Hello everyone,

first of all thanks for your fantstic data mining tool RM.
I'm using version 5.0.008 und I've got a problem:

When I do a X-Validation with leave-one-out on my data set (random seed in main process is still 2001) I get different results.
I've found out that when I don't set the leave-one-out option you can now set and unset the "use local random seed" option.
Okay so far. When I set the "use local random seed" with let's say 1000 and now set the leave-one-out option again I get a result of 69% accuracy.
But if I leave the 'use local random seed" unset and now set the leave-one-out option again I get about 74% accuracy?

How can that be? ???
It seems a bit absurd to me as these option mustn't even come in effect since the leave-one-out option is set...(?)

Any suggestions or am I doing sth wrong?

Thanx in advance & best regards,
Sasch

land · June 2010

Hi Sasch,
would you be so kind to provide your process? I will check it then. Please include it in the code area of the #-button.

Greetings,
Sebastian

Sasch · June 2010

Thanx a lot Sebastian,
it's another data set with other accuracies but it shows the same effect

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
    <process expanded="true" height="455" width="701">
      <operator activated="true" class="retrieve" compatibility="5.0.8" expanded="true" height="60" name="Retrieve" width="90" x="26" y="34">
        <parameter key="repository_entry" value="Prob#05_all"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.0.8" expanded="true" height="112" name="Validation" width="90" x="246" y="30">
        <parameter key="average_performances_only" value="false"/>
        <parameter key="leave_one_out" value="true"/>
        <parameter key="use_local_random_seed" value="true"/>
        <parameter key="local_random_seed" value="100"/>
        <process expanded="true" height="473" width="354">
          <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.0.8" expanded="true" height="76" name="SVM" width="90" x="112" y="30">
            <parameter key="gamma" value="1.2207E-4"/>
            <parameter key="C" value="0.5"/>
            <parameter key="cache_size" value="250"/>
            <list key="class_weights"/>
            <parameter key="calculate_confidences" value="true"/>
          </operator>
          <connect from_port="training" to_op="SVM" to_port="training set"/>
          <connect from_op="SVM" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="473" width="354">
          <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="5.0.8" expanded="true" height="76" name="Performance" width="90" x="179" y="30">
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

land · July 2010

Hi,
in fact you are right. This behavior results from the way the cross-validation sets are built: Instead of treating the case with x=n different, there are simply built n random sets all consisting of one single example. The result is the same, unlike you are using an algorithm incorporating randomness like the LibSVM does.
Hence the XValidation then consumes the first random numbers of the global random number sequence, the LibSVM behaves different, because receiving different numbers...

Greetings,
Sebastian

Sasch · July 2010

Hi Sebastian,
thank you for your detailed answer. That was my second thought that it depends on the SVM.
But now how shall I deal with it?
Any idea to get that behaviour out of the process?

Have a nice day,
Sasch.

land · July 2010

Hi Sasch,
this depends on what you are going to achieve. Why does this behavior disturb you anyway?

Greetings,
Sebastian

Sasch · July 2010

Hi Sebastian,
that's because me and my group we're trying to achieve best results (accuracies) in classifiaction of our data.
First we're doing a grid search for the best parameters for the SVM (gamma and C) and after applying these we're doing a x-val again.
(I know about overfitting the model but in this case it doesn't matter...)
And at that point I noticed the effect with the leave-one-out option.
By the way, we've also got the same problem like in thread topic http://rapid-i.com/rapidforum/index.php/topic,214.msg831.html#msg831 but the solution given there doesn't work at all. (But that also doesn't matter.)

So I just wanna know which accuracy I should choose, because I don't know which one's the right one.
we need to know this in order to finish our study...

Thanks so much for your patience,
Sasch.

land · July 2010

Hi,
I guess it doesn't matter

As long as you optimize without a valid performance estimation, the accuracy in a following validation would increase anyway. So go ahead with the higher value. But down to the point: You can't say. It's just an estimation and the differences seem to come from the randomness of the process itself...So you might repeat it several times with varying randomseeds / settings and average to get a valid estimation...

Greetings,
Sebastian

Sasch · July 2010

Hello Sebastian,
that's a good idea. Thanks again for your answers and your suggestions.

Regards,
Sasch.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Different results with leave-one-out X-Val

Answers