Options

"Different results for X-Validation (libSVM) in version 4.6

alexxalexx Member Posts: 12 Contributor II
edited June 2019 in Help
Dear community,

I am upgrading from rapidminer version 4.6 to 5 and I'm having some difficulties that I hope maybe someone can help me with.

I am using a data set consisting of 40 example set rows with 73 attributes (72 numerical + 1 numerical label). If anyone wants to reproduce the steps, here is the data in Excel format: http://jump.fm/PFMGS.

In rapidminer 4.6 I start the wizard, open x-validation with svm, import my data, and start the process. The result is 100% accuracy. Here are some screenshots: http://img696.imageshack.us/img696/2939/rapidminer4results.png

I tried to reconstruct this in rapidminer 5:
- I imported the data into my repository and created a new process
- Since the imported data was marked nominal by rm, I use Nominal to Numerical converter for the complete dataset
- the output goes into X-Validation module (default parameters as in rm 4.6). from there ave-output goes to results
- in the Validation module it looks like this
-- in training module there is the libSVM module (C-SVC, rbf kernel, gamma=0, C=32, epsilon = 0.0010, same as in rm 4.6)
-- in testing module I use Apply Model and then Performance Module (same default values as in rm 4.6

executing the process results in 90% accuracy. Screenshots: http://img42.imageshack.us/img42/9720/rapidminer5results.png

Did I make a mistake? Thanks for your help.
Alex

Answers

  • Options
    harri678harri678 Member Posts: 34 Contributor II
    Hello Alex,

    did you ever try to set gamma != 0? As i understand correctly gamma=0 means, that it will be effectively set to 1 / num_attributes. I would recommend to set it fixed in both versions for comparable results (1/72). Also I recognized a difference in the random_seed parameter of the X-Validation operator which could affect the process.
    I'm curious if this changes anything!


    Just my two cents ;)

    Greetings, Harald
  • Options
    alexxalexx Member Posts: 12 Contributor II
    Thanks for your reply, Harald.

    I used different values for gamma and played around with with random seed settings. Still the accuracy results from version 4.6 and 5 differ a lot using same input. Does anyone know why?
  • Options
    Stefan_EStefan_E Member Posts: 53 Maven
    Alex,

    if you don't do an XValidation - just build one model: Does it differ? - that would implicate the learner (as opposed to the applier).

    Stefan
  • Options
    dragoljubdragoljub Member Posts: 241 Contributor II
    Cross validation results will always be slightly different since you are randomly splitting the training set into subsets for training and validation. Unless you can ensure that the cross validation splitting is performed exactly the same between each run you should expect slightly different results. If you notice a huge difference there my be something wrong.
  • Options
    alexxalexx Member Posts: 12 Contributor II
    dragoljub,

    thanks for your input. If I use the same random seed parameters on both versions, I should get the same results in my understanding. Anyway, the results differ not just slightly (100% in RM 4 vs 90% in RM 5).
  • Options
    haddockhaddock Member Posts: 849 Maven
    Hi Folks,

    If you import the xls and run the following you'll see what the problem is ....
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
     <context>
       <input>
         <location/>
       </input>
       <output>
         <location/>
         <location/>
         <location/>
       </output>
       <macros/>
     </context>
     <operator activated="true" class="process" expanded="true" name="Root">
       <process expanded="true" height="296" width="915">
         <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="27" y="74">
           <parameter key="repository_entry" value="dataset"/>
         </operator>
         <operator activated="true" class="nominal_to_numerical" expanded="true" height="94" name="Nominal to Numerical" width="90" x="380" y="165"/>
         <operator activated="true" class="x_validation" expanded="true" height="112" name="XValidation" width="90" x="648" y="165">
           <parameter key="local_random_seed" value="-1"/>
           <process expanded="true" height="296" width="432">
             <operator activated="true" class="support_vector_machine_libsvm" expanded="true" height="76" name="LibSVMLearner" width="90" x="171" y="30">
               <parameter key="C" value="32.0"/>
               <list key="class_weights"/>
             </operator>
             <connect from_port="training" to_op="LibSVMLearner" to_port="training set"/>
             <connect from_op="LibSVMLearner" from_port="model" to_port="model"/>
             <portSpacing port="source_training" spacing="0"/>
             <portSpacing port="sink_model" spacing="0"/>
             <portSpacing port="sink_through 1" spacing="0"/>
           </process>
           <process expanded="true" height="296" width="432">
             <operator activated="true" class="apply_model" expanded="true" height="76" name="ModelApplier" width="90" x="45" y="30">
               <list key="application_parameters"/>
             </operator>
             <operator activated="true" class="performance" expanded="true" height="76" name="Performance" width="90" x="238" y="30"/>
             <connect from_port="model" to_op="ModelApplier" to_port="model"/>
             <connect from_port="test set" to_op="ModelApplier" to_port="unlabelled data"/>
             <connect from_op="ModelApplier" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
             <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
             <portSpacing port="source_model" spacing="0"/>
             <portSpacing port="source_test set" spacing="0"/>
             <portSpacing port="source_through 1" spacing="0"/>
             <portSpacing port="sink_averagable 1" spacing="0"/>
             <portSpacing port="sink_averagable 2" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Retrieve" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
         <connect from_op="Nominal to Numerical" from_port="example set output" to_op="XValidation" to_port="training"/>
         <connect from_op="XValidation" from_port="model" to_port="result 1"/>
         <connect from_op="XValidation" from_port="training" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
    The operator "Nominal to Numerical" has replaced each attribute column with 0-39  :(  The fact that it still produces 90% satisfies our gullibility.

    PS Rather ironically, if you replace the offending operator with a "Guess Types" operator all is well, like this....
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Root">
        <process expanded="true" height="296" width="915">
          <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="27" y="74">
            <parameter key="repository_entry" value="dataset"/>
          </operator>
          <operator activated="true" breakpoints="after" class="guess_types" expanded="true" height="76" name="Guess Types" width="90" x="447" y="165"/>
          <operator activated="true" class="x_validation" expanded="true" height="112" name="XValidation" width="90" x="648" y="165">
            <parameter key="local_random_seed" value="-1"/>
            <process expanded="true" height="296" width="432">
              <operator activated="true" class="support_vector_machine_libsvm" expanded="true" height="76" name="LibSVMLearner" width="90" x="171" y="30">
                <parameter key="C" value="32.0"/>
                <list key="class_weights"/>
              </operator>
              <connect from_port="training" to_op="LibSVMLearner" to_port="training set"/>
              <connect from_op="LibSVMLearner" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="296" width="432">
              <operator activated="true" class="apply_model" expanded="true" height="76" name="ModelApplier" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" expanded="true" height="76" name="Performance" width="90" x="238" y="30"/>
              <connect from_port="model" to_op="ModelApplier" to_port="model"/>
              <connect from_port="test set" to_op="ModelApplier" to_port="unlabelled data"/>
              <connect from_op="ModelApplier" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Guess Types" to_port="example set input"/>
          <connect from_op="Guess Types" from_port="example set output" to_op="XValidation" to_port="training"/>
          <connect from_op="XValidation" from_port="model" to_port="result 1"/>
          <connect from_op="XValidation" from_port="training" to_port="result 2"/>
          <connect from_op="XValidation" from_port="averagable 1" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
  • Options
    alexxalexx Member Posts: 12 Contributor II
    thanks Haddock for finding the problem.

    Is there any way I can fix the "nominal to numerical" operator in rm5? Or any other workaround?
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    what exactly is the problem with the nominal to numerical operator? It's behavior is exactly as it was in 4.x if you don't change the default parameter settings. Please remember, that you had to include the nominal to numerical operator in 4.x in an AttributeSubetPreprocessing operator to restrict the attributes it was working on. You might now either use the equivalent Select Subset operator or simply use the built in filter.

    Greetings,
      Sebastian
  • Options
    alexxalexx Member Posts: 12 Contributor II
    Sebastian,

    thank you for your answer. I imported values from a csv file that looked like this.
    2.3647619e+000,9.5738476e-001,9.6855298e-001,...
    Unfortunately the real values were recognized as nominal so I wanted to use the nominal to numerical operator to mark them as numerical. But that operator simply converted the values to numerical 1, 2, 3 and so on. So I guess I just misunderstood the intention of the operator. I needed a 'real' converter.

    My problem still remains. I cannot import the data as numerical, but at least I could figure out why. My data is in scientific notation (Matlab standard). A value with the exp != 000 is correctly imported as numerical (real), whereas a value with the exponent == 000 is imported as nominal.

    so
    2.6855298e-001
    is correctly imported as numerical

    and
    2.3647619e+000
    is incorrectly imported as nominal.

    I would really appreciate if anyone has a solution for me. Again, RM4 correctly imports those values as numerical :(
    Thanks!
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    please replace the nominal to numerical operator by the parse numbers operator. That will help you solve your problem.

    Greetings,
      Sebastian
  • Options
    alexxalexx Member Posts: 12 Contributor II
    Sebastian,

    thanks for your help. Unfortunately that did not solve the problem. The Parse Numbers operator still labels numbers like 2.3647619e+000 as nominal, but I want them to be numerical/real.

    See screenshot: http://img684.imageshack.us/img684/7505/nominalnumericalproblem.png

    Any idea how I can achieve that?
  • Options
    haddockhaddock Member Posts: 849 Maven
    Hi Folks,


    http://rapid-i.com/rapidforum/index.php/topic,1791.msg7012.html#msg7012

    Using the solution so darkly hidden therein on this csv data..

    2.6855298e-001,2.3647619e+000
    2.3647619e+000,2.6855298e-001

    I find that the numbers are read as reals by the following code...
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="-20" width="-50">
          <operator activated="true" class="read_csv" expanded="true" height="60" name="Read CSV" width="90" x="28" y="43">
            <parameter key="file_name" value="C:\Documents and Settings\Alien\My Documents\rm_workspace\R5 Forum\scients.csv"/>
          </operator>
          <operator activated="true" class="guess_types" expanded="true" height="76" name="Guess Types" width="90" x="169" y="43"/>
          <connect from_op="Read CSV" from_port="output" to_op="Guess Types" to_port="example set input"/>
          <connect from_op="Guess Types" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • Options
    alexxalexx Member Posts: 12 Contributor II
    Haddock,

    thank you for your help. Your solution works partially... I'm getting weird behavior here:

    In your example, the values are labeled as real in the results workspace (screenshot: http://img140.imageshack.us/img140/6470/88436391.png)

    but I need to work with the values in the process. THERE the same values in that example are labeled nominal (sreenshot: http://img179.imageshack.us/img179/6517/18861165.png)

    So in the process I cannot use the values as input for libSVM etc. I really don't understand this, maybe someone can explain/post a solution?
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Alexx,

    the reason is quite simple: everything is fine and this is just the way "Guess Types" behaves. It guesses the types but from the real data (which is not available in the meta data transformation) and not from the meta data. That means that the meta data cannot be correctly updated during process design. I would recommend to perform Haddocks process and store the data in the RM repository. There, you will easily see that the type is correct. Just use the data from the respository then and feed it into the learner and everything will be fine.

    Alternatively, you could simply feed the data into the LibSVM after the transformation process. It wíll complain but you disable those complains in the preferences: simply activate "general.capabilities.warn". However, the best way is to use the repository here.

    Cheers,
    Ingo
  • Options
    alexxalexx Member Posts: 12 Contributor II
    Thank you for your help. By disabling the complains I could get it to work the way I wanted to.

    An importing wizard like used in RM4 would make it a lot easier. Hope something like that will find its way into the new release. I'm very much looking forward to that ;)
  • Options
    dragoljubdragoljub Member Posts: 241 Contributor II
    If you want to avoid the headache you can just have MATLAB generate CSV files in decimal, without using the scientific notation. RM should be able to handle the scientific notation, but I think that you should have no problem reading your results as decimal.

    -Gagi
  • Options
    alexxalexx Member Posts: 12 Contributor II
    Gagi,
    sure I could do that. But IMO rapidminer should have no problem reading scientific notation. I'll stick with the headache solution until there is an improved import utility in RM.
    Thanks to everyone for helping me out with that one.
Sign In or Register to comment.