🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

"LearningCurve edit: Training ratio bugged?"

wesselwessel Member Posts: 537  Guru
edited May 23 in Help
Dear All,

How to make a smooth learning curve?
Which shows the averaged result over many runs?

edit:
The training ratio, the ratio which shall be maximally used for training, doesn't seem to work.
When looking in the results the max fraction is 0.95, and training_ratio was set to 0.2.
When changing training ratio to 0.6, nothing changes!

edit:
I found 07_meta\04_LearningCurve.xml
I modified this xml as following:

<?xml version="1.0" encoding="windows-1252"?>
<process version="4.6">

 <operator name="Root" class="Process" expanded="yes">
     <description text="This process plots the learning curve, i.e. the performance with respect to the number of examples which is used for learning."/>
     <parameter key="logverbosity" value="warning"/>
     <parameter key="random_seed" value="2004"/>
     <parameter key="send_mail" value="never"/>
     <parameter key="process_duration_for_mail" value="30"/>
     <parameter key="encoding" value="SYSTEM"/>
     <operator name="ArffExampleSource" class="ArffExampleSource">
         <parameter key="data_file" value="D:\wessel\Desktop\CYT_rest.arff"/>
         <parameter key="label_attribute" value="class"/>
         <parameter key="datamanagement" value="double_array"/>
         <parameter key="decimal_point_character" value="."/>
         <parameter key="sample_ratio" value="1.0"/>
         <parameter key="sample_size" value="-1"/>
         <parameter key="local_random_seed" value="-1"/>
     </operator>
     <operator name="LearningCurve" class="LearningCurve" expanded="yes">
         <parameter key="training_ratio" value="0.5"/>
         <parameter key="step_fraction" value="0.01"/>
         <parameter key="start_fraction" value="-1.0"/>
         <parameter key="sampling_type" value="shuffled sampling"/>
         <parameter key="local_random_seed" value="-1"/>
         <operator name="W-J48" class="W-J48">
             <parameter key="keep_example_set" value="false"/>
             <parameter key="U" value="false"/>
             <parameter key="C" value="0.25"/>
             <parameter key="M" value="2.0"/>
             <parameter key="R" value="false"/>
             <parameter key="B" value="false"/>
             <parameter key="S" value="false"/>
             <parameter key="L" value="false"/>
             <parameter key="A" value="false"/>
         </operator>
         <operator name="ApplierChain" class="OperatorChain" expanded="yes">
             <operator name="ModelApplier" class="ModelApplier">
                 <parameter key="keep_model" value="false"/>
                 <list key="application_parameters">
                 </list>
                 <parameter key="create_view" value="false"/>
             </operator>
             <operator name="Performance" class="Performance">
                 <parameter key="keep_example_set" value="false"/>
                 <parameter key="use_example_weights" value="true"/>
             </operator>
         </operator>
         <operator name="ProcessLog" class="ProcessLog">
             <list key="log">
               <parameter key="fraction" value="operator.LearningCurve.value.fraction"/>
               <parameter key="performance" value="operator.LearningCurve.value.performance"/>
             </list>
             <parameter key="sorting_type" value="none"/>
             <parameter key="sorting_k" value="100"/>
             <parameter key="persistent" value="false"/>
         </operator>
     </operator>
 </operator>

</process>

But this is not giving the results I want.
The learning curve is way to chaotic!
Seems results do not get average over different runs:
http://student.science.uva.nl/~wluijben/learning_curve_in_need_of_smoothing.jpg

image


Old question:

How can I make a Learning Curve?

Lets say I have a dataset of 100 examples.
I wish to split this data in 10 folds each.
In normal cross-validation, there will be 10 runs:  training on 9 folds and testing on 1.
Which result in 1 result average + standard deviation.

Now I wish to do do an extra iteration inside each run:
Which varies the amount of folds used for training.
So this should result in N result averages + Nstandard divinations for each amount of folds used.
(Preferably it should output the amount of training data used, not the amount of folds)

Regards,

Wessel
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,527   Unicorn
    Hi,
    I think you have found a missing option: Performing several runs for each training ratio to lower the variance in the plot. You might file this as feature request in the bug tracker, if you want.
    You might solve this problem with an own process, replacing the LearningCurve operator with an Parameter Iteration with the parameter keep_output switched on, a sampling operator and after all an AverageBuilder. 

    Greetings,
      Sebastian
  • wesselwessel Member Posts: 537  Guru
    Is it also possible for me to change the java-code of learning curve?
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,661  RM Founder
    Hi,

    sure, you could implement that yourself. Or you can try the following process:

    <?xml version="1.0" encoding="windows-1252"?>
    <process version="4.6">

      <operator name="Root" class="Process" expanded="yes">
          <description text="This process plots the learning curve, i.e. the performance with respect to the number of examples which is used for learning."/>
          <parameter key="logverbosity"        value="warning"/>
          <parameter key="random_seed"        value="2004"/>
          <parameter key="send_mail"        value="never"/>
          <parameter key="process_duration_for_mail"        value="30"/>
          <parameter key="encoding"        value="SYSTEM"/>
          <operator name="ArffExampleSource" class="ArffExampleSource">
              <parameter key="data_file"        value="C:\Dokumente und Einstellungen\Mierswa\Eigene Dateien\rm_workspace\sample\data\iris.arff"/>
              <parameter key="label_attribute"        value="class"/>
              <parameter key="datamanagement"        value="double_array"/>
              <parameter key="decimal_point_character"        value="."/>
              <parameter key="sample_ratio"        value="1.0"/>
              <parameter key="sample_size"        value="-1"/>
              <parameter key="local_random_seed"        value="-1"/>
          </operator>
          <operator name="IOStorer" class="IOStorer">
              <parameter key="name"        value="data"/>
              <parameter key="io_object"        value="ExampleSet"/>
              <parameter key="store_which"        value="1"/>
              <parameter key="remove_from_process"        value="true"/>
          </operator>
          <operator name="IteratingOperatorChain" class="IteratingOperatorChain" expanded="no">
              <parameter key="iterations"        value="10"/>
              <parameter key="timeout"        value="-1"/>
              <operator name="IORetriever" class="IORetriever">
                  <parameter key="name"        value="data"/>
                  <parameter key="io_object"        value="ExampleSet"/>
                  <parameter key="remove_from_store"        value="false"/>
              </operator>
              <operator name="LearningCurve" class="LearningCurve" expanded="yes">
                  <parameter key="training_ratio"        value="0.5"/>
                  <parameter key="step_fraction"        value="0.05"/>
                  <parameter key="start_fraction"        value="-1.0"/>
                  <parameter key="sampling_type"        value="shuffled sampling"/>
                  <parameter key="local_random_seed"        value="-1"/>
                  <operator name="W-J48" class="W-J48">
                      <parameter key="keep_example_set"        value="false"/>
                      <parameter key="U"        value="false"/>
                      <parameter key="C"        value="0.25"/>
                      <parameter key="M"        value="2.0"/>
                      <parameter key="R"        value="false"/>
                      <parameter key="B"        value="false"/>
                      <parameter key="S"        value="false"/>
                      <parameter key="L"        value="false"/>
                      <parameter key="A"        value="false"/>
                  </operator>
                  <operator name="ApplierChain" class="OperatorChain" expanded="yes">
                      <operator name="ModelApplier" class="ModelApplier">
                          <parameter key="keep_model"        value="false"/>
                          <list key="application_parameters">
                          </list>
                          <parameter key="create_view"        value="false"/>
                      </operator>
                      <operator name="Performance" class="Performance">
                          <parameter key="keep_example_set"        value="false"/>
                          <parameter key="use_example_weights"        value="true"/>
                      </operator>
                  </operator>
                  <operator name="ProcessLog" class="ProcessLog">
                      <list key="log">
                        <parameter key="fraction"        value="operator.LearningCurve.value.fraction"/>
                        <parameter key="performance"        value="operator.LearningCurve.value.performance"/>
                        <parameter key="iteration"        value="operator.IteratingOperatorChain.value.iteration"/>
                      </list>
                      <parameter key="sorting_type"        value="none"/>
                      <parameter key="sorting_k"        value="100"/>
                      <parameter key="persistent"        value="false"/>
                  </operator>
              </operator>
          </operator>
          <operator name="ProcessLog2ExampleSet" class="ProcessLog2ExampleSet">
          </operator>
          <operator name="ClearProcessLog" class="ClearProcessLog">
              <parameter key="log_name"        value="ProcessLog"/>
              <parameter key="delete_table"        value="true"/>
          </operator>
          <operator name="ExampleFilter" class="ExampleFilter">
              <parameter key="condition_class"        value="attribute_value_filter"/>
              <parameter key="parameter_string"        value="fraction &gt;= 0.1"/>
              <parameter key="invert_filter"        value="false"/>
          </operator>
          <operator name="Example2AttributePivoting" class="Example2AttributePivoting">
              <parameter key="keep_example_set"        value="false"/>
              <parameter key="group_attribute"        value="iteration"/>
              <parameter key="index_attribute"        value="fraction"/>
              <parameter key="consider_weights"        value="true"/>
              <parameter key="weight_aggregation"        value="sum"/>
          </operator>
      </operator>

    </process>
    which will deliver a new data set which can be used to produce an image like the attached one. Or you can calculate the average values by aggregation. Or...

    image

    Have fun,
    Ingo
    RapidMiner Wisdom 2020
    February 11th and 12th 2020 in Boston, MA, USA

  • wesselwessel Member Posts: 537  Guru
    Woa, very nice!
    Thanks so much!

    Is there any way to reconstruct how many training examples and testing examples were used inside an iteration?

    I ask this question because I fear "Using the rest" for testing is not fair.
    You first want to split the data into train / test set
    then for each fraction, use train on only a fraction of the training set, thus keeping the test set constant.

    Where can I see the java code for LearningCurve?

    edit: by setting a breakpoint inside model applier I can see the amount of training / test examples used.
    training set looks constant..
    The operator information on LearningCurve is really confusing!

    Regards,

    Wessel
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,661  RM Founder
    Hi,

    this is exactly what happens. First the data is divided into two parts according to the parameter "training_ratio" of the learning curve operator. Then from this part the different frations are taken for training while the test data is kept constant. Just try to add additional logging like in the process below or work with breakpoints and you can exactly what happens:

    <operator name="Root" class="Process" expanded="yes">
        <description text="This process plots the learning curve, i.e. the performance with respect to the number of examples which is used for learning."/>
        <parameter key="logverbosity" value="warning"/>
        <parameter key="random_seed" value="2004"/>
        <operator name="ArffExampleSource" class="ArffExampleSource">
            <parameter key="data_file" value="C:\Dokumente und Einstellungen\Mierswa\Eigene Dateien\rm_workspace\sample\data\iris.arff"/>
            <parameter key="label_attribute" value="class"/>
        </operator>
        <operator name="IOStorer" class="IOStorer">
            <parameter key="name" value="data"/>
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
        <operator name="IteratingOperatorChain" class="IteratingOperatorChain" expanded="no">
            <parameter key="iterations" value="10"/>
            <operator name="IORetriever" class="IORetriever">
                <parameter key="name" value="data"/>
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="remove_from_store" value="false"/>
            </operator>
            <operator name="LearningCurve" class="LearningCurve" expanded="no">
                <parameter key="training_ratio" value="0.5"/>
                <parameter key="sampling_type" value="shuffled sampling"/>
                <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                    <operator name="DataMacroDefinition" class="DataMacroDefinition">
                        <parameter key="macro" value="training_size"/>
                    </operator>
                    <operator name="W-J48" class="W-J48">
                    </operator>
                </operator>
                <operator name="ApplierChain" class="OperatorChain" expanded="yes">
                    <operator name="DataMacroDefinition (2)" class="DataMacroDefinition">
                        <parameter key="macro" value="test_data"/>
                    </operator>
                    <operator name="ModelApplier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="Performance" class="Performance">
                    </operator>
                </operator>
                <operator name="ProcessLog" class="ProcessLog">
                    <list key="log">
                      <parameter key="fraction" value="operator.LearningCurve.value.fraction"/>
                      <parameter key="performance" value="operator.LearningCurve.value.performance"/>
                      <parameter key="iteration" value="operator.IteratingOperatorChain.value.iteration"/>
                    </list>
                </operator>
                <operator name="ProcessLog (2)" class="ProcessLog">
                    <list key="log">
                      <parameter key="training_size" value="operator.DataMacroDefinition.value.macro_value"/>
                      <parameter key="test_size" value="operator.DataMacroDefinition (2).value.macro_value"/>
                      <parameter key="iteration" value="operator.IteratingOperatorChain.value.iteration"/>
                    </list>
                </operator>
            </operator>
        </operator>
        <operator name="ProcessLog2ExampleSet" class="ProcessLog2ExampleSet">
            <parameter key="log_name" value="ProcessLog"/>
        </operator>
        <operator name="ClearProcessLog" class="ClearProcessLog">
            <parameter key="log_name" value="ProcessLog"/>
            <parameter key="delete_table" value="true"/>
        </operator>
        <operator name="ExampleFilter" class="ExampleFilter" breakpoints="after">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="fraction &gt;= 0.1"/>
        </operator>
        <operator name="Example2AttributePivoting" class="Example2AttributePivoting">
            <parameter key="group_attribute" value="iteration"/>
            <parameter key="index_attribute" value="fraction"/>
        </operator>
    </operator>

    Cheers,
    Ingo
    RapidMiner Wisdom 2020
    February 11th and 12th 2020 in Boston, MA, USA

  • wesselwessel Member Posts: 537  Guru
    Thank you so much.
    This is completely what I wanted.

    There is a strange anomaly though, which is hidden from your screenshot, because you use example filter: fraction >= 0.1.
    At fraction 0.05, using only 4 training examples, performance is better then using 100 training examples!
    Why is the performance of fraction 0.05 so good?

    Regards,

    Wessel
Sign In or Register to comment.