Problem/bug - CV and parameter optimization

marcin_blachnikmarcin_blachnik Member Posts: 61 Guru
edited July 2019 in Help
Hallo

Below is a typical process with embedded parameter optimization subprocess, however it just doesn't work. It looks like a bug,

Best

Marcin

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Process">
    <process expanded="true" height="361" width="804">
      <operator activated="true" class="retrieve" compatibility="5.2.000" expanded="true" height="60" name="Retrieve" width="90" x="49" y="88">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation" width="90" x="370" y="108">
        <description>A cross-validation evaluating a decision tree model.</description>
        <process expanded="true" height="483" width="547">
          <operator activated="true" class="multiply" compatibility="5.2.000" expanded="true" height="94" name="Multiply" width="90" x="30" y="120"/>
          <operator activated="true" class="optimize_parameters_grid" compatibility="5.2.000" expanded="true" height="94" name="Optimize Parameters (Grid)" width="90" x="179" y="30">
            <list key="parameters">
              <parameter key="SVM_Opti.C" value="[0.001;1000;3;logarithmic]"/>
              <parameter key="SVM_Opti.gamma" value="[0.01;1;3;linear]"/>
              <parameter key="Normalize_Opti.method" value="Z-transformation,range transformation"/>
            </list>
            <process expanded="true" height="483" width="844">
              <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="112" name="Validation (2)" width="90" x="190" y="48">
                <description>A cross-validation evaluating a decision tree model.</description>
                <process expanded="true" height="654" width="466">
                  <operator activated="true" class="normalize" compatibility="5.2.000" expanded="true" height="94" name="Normalize_Opti" width="90" x="39" y="261">
                    <parameter key="method" value="range transformation"/>
                  </operator>
                  <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.2.000" expanded="true" height="76" name="SVM_Opti" width="90" x="208" y="264">
                    <parameter key="gamma" value="1.0"/>
                    <parameter key="C" value="1000.0"/>
                    <list key="class_weights"/>
                  </operator>
                  <operator activated="true" class="group_models" compatibility="5.2.000" expanded="true" height="94" name="Group Models" width="90" x="231" y="19"/>
                  <connect from_port="training" to_op="Normalize_Opti" to_port="example set input"/>
                  <connect from_op="Normalize_Opti" from_port="example set output" to_op="SVM_Opti" to_port="training set"/>
                  <connect from_op="Normalize_Opti" from_port="preprocessing model" to_op="Group Models" to_port="models in 1"/>
                  <connect from_op="SVM_Opti" from_port="model" to_op="Group Models" to_port="models in 2"/>
                  <connect from_op="Group Models" from_port="model out" to_port="model"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true" height="654" width="466">
                  <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (2)" width="90" x="45" y="30">
                    <list key="application_parameters"/>
                  </operator>
                  <operator activated="true" class="performance" compatibility="5.2.000" expanded="true" height="76" name="Performance (2)" width="90" x="179" y="30"/>
                  <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
                  <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
                  <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
                  <connect from_op="Performance (2)" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="log" compatibility="5.2.000" expanded="true" height="76" name="Log" width="90" x="417" y="176">
                <list key="log">
                  <parameter key="normalize" value="operator.Normalize_Opti.parameter.method"/>
                  <parameter key="C" value="operator.SVM_Opti.parameter.C"/>
                  <parameter key="gamma" value="operator.SVM_Opti.parameter.gamma"/>
                  <parameter key="acc" value="operator.Validation (2).value.performance"/>
                  <parameter key="num" value="operator.Optimize Parameters (Grid).value.applycount"/>
                </list>
              </operator>
              <connect from_port="input 1" to_op="Validation (2)" to_port="training"/>
              <connect from_op="Validation (2)" from_port="averagable 1" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="performance"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_parameters" compatibility="5.2.000" expanded="true" height="60" name="Set Parameters" width="90" x="313" y="30">
            <list key="name_map">
              <parameter key="SVM_Opti" value="SVM"/>
              <parameter key="Normalize_Opti" value="Normalize"/>
            </list>
          </operator>
          <operator activated="true" class="normalize" compatibility="5.2.000" expanded="true" height="94" name="Normalize" width="90" x="78" y="370"/>
          <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.2.000" expanded="true" height="76" name="SVM" width="90" x="246" y="165">
            <parameter key="gamma" value="1.0"/>
            <parameter key="C" value="1000.0"/>
            <list key="class_weights"/>
          </operator>
          <operator activated="true" class="group_models" compatibility="5.2.000" expanded="true" height="94" name="Group Models (2)" width="90" x="380" y="300"/>
          <connect from_port="training" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_op="Set Parameters" to_port="parameter set"/>
          <connect from_op="Normalize" from_port="example set output" to_op="SVM" to_port="training set"/>
          <connect from_op="Normalize" from_port="preprocessing model" to_op="Group Models (2)" to_port="models in 1"/>
          <connect from_op="SVM" from_port="model" to_op="Group Models (2)" to_port="models in 2"/>
          <connect from_op="Group Models (2)" from_port="model out" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="483" width="397">
          <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.2.000" expanded="true" height="76" name="Performance" width="90" x="180" y="30"/>
          <operator activated="true" class="log" compatibility="5.2.000" expanded="true" height="76" name="Log (2)" width="90" x="199" y="190">
            <list key="log">
              <parameter key="normalize" value="operator.Normalize.parameter.method"/>
              <parameter key="C" value="operator.SVM.parameter.C"/>
              <parameter key="gamma" value="operator.SVM.parameter.gamma"/>
              <parameter key="acc" value="operator.Performance.value.performance"/>
            </list>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_op="Log (2)" to_port="through 1"/>
          <connect from_op="Log (2)" from_port="through 1" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="2"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Tagged:

Answers

  • Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Hi Marcin,

    what is not working for you? If i run the process everything seems to work. At least i receive a performance vector as a result.

    Best,
    Nils
  • marcin_blachnikmarcin_blachnik Member Posts: 61 Guru
    The problem is with the performance. Its IRIS dataset,  accuracy of the internal CV and parameter optimization step is around 97% (that is OK) but the final accuracy of global CV is 64%!!!
    It looks like the Normalization step is not applied in the testing process of the final CV.

  • Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Hi,

    this behavior isn't a bug. If you check "create view" in all normalize and apply model operators everything works as expected.
    If you do NOT check "create view" every normalize and apply model operator will work on the real data of the loaded iris data set which can't be handled by the SVM because it changes all the time the data is normalized.

    Best,
    Nils
  • marcin_blachnikmarcin_blachnik Member Posts: 61 Guru
    Thank you very much for your explanation, however it is very inefficient because the normalization is realized on the fly that is very computationally expensive. I have found another solution to run Materialize Data before parameter optimization process.

    In my opinion current solution of Normalization but also other operators of RM is very dangerous. I have used that process for many real problems, and now I dno't know which results are good and which are bad.
    Moreover after running this process the output exampleTable includes 534 attributes!!! which can not be collected as garbage by JVM. That may lead to Out of Memory problems.
    Alternative solution is creating new exampleTable whenever operator modify the data. This can be also unefficient in case of single numerical and many categorical attributes but at least it will be so error free solution.

    Best regards
    Marcin
  • Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Yeah that's correct. Materialize Data is a better approach :-)

    But where exactly do you get an exampleTable with 534 Attributes? When i run the process with Materialize Data and then look at the output port of validation,
    there are only 4 regular and 2 special attributes.

    Best,
    Nils
  • marcin_blachnikmarcin_blachnik Member Posts: 61 Guru
    The results were obtained in the basic scenario ( the incorrect one) - without "create view" and "meterialize data". Although that process can be simply corrected, so the results will be correct, but the problem with number of attributes remain the same. For example when one checks the "create view" in first CV operator.

    I think that this may be the problem also described in thread http://rapid-i.com/rapidforum/index.php/topic,4195.0.html . So any operator that adds new attributes to the exampleTable like normalization, PCA etc. should be used very careful, especially inside the loop operators and process optimization. My observation is that whenever one uses any of that kind of operators inside the loop or optimization, the subprocess should start with MaterializeData. The only question is which operators add new attributes. How would you comment this?

    Thank you very much for all your answers
    Marcin
  • Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Hi,

    i just ran the incorrect process again and still received only 4 attributes. Did you use the newest version 5.2.1? The problem in the thread you have mentioned should have been fixed with 5.2.1.
    If you did use the version 5.2.1. where exactly get aware of the huge  amount of attributes? Did you see them in the result view as a example set result of the first CV?

    Best,
    Nils
  • marcin_blachnikmarcin_blachnik Member Posts: 61 Guru
    For that experiment I've used version 5.1.013 which I use for development. I've obtained that results while debugging RM.

    Marcin
Sign In or Register to comment.