How to plot a Learning Curve for a given model?

meloamaurymeloamaury Member Posts: 8 Contributor II
edited December 2018 in Help

Hi all,

 

I am new in RapidMiner Studio and I am trying to figure out how to plot a learning curve for a given model (basically plot the performance for training and testing as a function of the number of examples). In principle the learning curve would be a good indicator for the robusteness of the model (showing the bias versus variance problem).

I could not find in RapidMiner an operator or some video examples on this issue. I tried getting some information using the Log operator after my Cross Validation operator in order to plot afterwards, but without success.

Any guidance would be very much appreciated.

 

Best,

Amaury

Best Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Solution Accepted

    The learning curve operator has been deprecated since about v7.3. Let me see if I can find a process that creates this for you.

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Solution Accepted

    Hi,

     

    I use the operator "Loop Parameters" for this and the inner "Sample" operator uses ratios between 5% and 100%.  Make sure that you evaluate the model with a cross-validation with a fixed local random seed since otherwise the influence of the data splits might be bigger than that of the additional examples...

     

    Below is a process which you can use as a building block for this.

     

    Hope this helps,

    Ingo

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data" compatibility="7.5.001" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
    <parameter key="target_function" value="global and local models classification"/>
    <parameter key="number_examples" value="10000"/>
    <parameter key="number_of_attributes" value="2"/>
    </operator>
    <operator activated="true" class="add_noise" compatibility="7.5.001" expanded="true" height="103" name="Add Noise" width="90" x="179" y="34">
    <parameter key="random_attributes" value="20"/>
    <list key="noise"/>
    </operator>
    <operator activated="true" class="loop_parameters" compatibility="7.5.001" expanded="true" height="82" name="Loop Parameters" width="90" x="313" y="34">
    <list key="parameters">
    <parameter key="Sample.sample_ratio" value="[0.05;1.0;19;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="sample" compatibility="7.5.001" expanded="true" height="82" name="Sample" width="90" x="45" y="34">
    <parameter key="sample" value="relative"/>
    <parameter key="sample_ratio" value="1.0"/>
    <list key="sample_size_per_class"/>
    <list key="sample_ratio_per_class"/>
    <list key="sample_probability_per_class"/>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.5.001" expanded="true" height="145" name="Cross Validation" width="90" x="179" y="34">
    <parameter key="use_local_random_seed" value="true"/>
    <process expanded="true">
    <operator activated="true" class="naive_bayes" compatibility="7.5.001" expanded="true" height="82" name="Naive Bayes" width="90" x="45" y="34"/>
    <connect from_port="training set" to_op="Naive Bayes" to_port="training set"/>
    <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.5.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance" compatibility="7.5.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="log" compatibility="7.5.001" expanded="true" height="82" name="Log" width="90" x="313" y="34">
    <list key="log">
    <parameter key="ratio" value="operator.Sample.parameter.sample_ratio"/>
    <parameter key="performance" value="operator.Cross Validation.value.performance main criterion"/>
    </list>
    </operator>
    <connect from_port="input 1" to_op="Sample" to_port="example set input"/>
    <connect from_op="Sample" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_op="Log" to_port="through 1"/>
    <connect from_op="Log" from_port="through 1" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Generate Data" from_port="output" to_op="Add Noise" to_port="example set input"/>
    <connect from_op="Add Noise" from_port="example set output" to_op="Loop Parameters" to_port="input 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    </process>
    </operator>
    </process>

Answers

  • pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 96 RM Research

    Hi Amaury,

    one option would be embedding your validation process in a loop operator iterating over the number of example chunks you'd like to test. Within the loop operator you can use Generate Macro to set the number of examples you want to use for the given iteration by using e.g., this funciton expression "eval(%{iteration})*100".

    as a macro called "stop" to apply your validation on chunks of 100 examples. Afterwards select the event with the Filter Example Range operator set to start at example 1 and use %{stop} as the last example. Add a Log operator (as you've tried already) after your Validation process and log both the %{stop} (the x-axis of your plot) and the desired performance. You can extract the desired performance from the Cross Validation operator by choosing Cross Validation, value and performance 1 within the log operator. This will be the y-axis of your plot.
    After running such a process you'll retrieve an example set containing the data you logged. Choose a scatter plot with your score on the y-axis and the number of examples on the x-axis. Done.

     

    Your process could look something like this:01_process.pngGeneral process02_in_loop.pngInside the loop operator

     

     

    03_plot.pngPlotting of the resulting ExampleSet

     

    Hope this solves your problem,

    Philipp

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You can also simply add number of examples as a Sample parameter (absolute number) in the Optimization operator, then set the appropriate range of examples, and then log the performance output for each of the sample runs.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • yasunotktyasunotkt Member Posts: 7 Contributor I

    Hi, Philip

    Thank you for your kindenss.

    I have just same question to plot the process log such as RMSE.

    I did success log plot.

    Yasuno,T.   

  • meloamaurymeloamaury Member Posts: 8 Contributor II

    Hi All,

     

    Thanks a lot for your input, I actually posted this question and forgot about it. Now that I checked again, here it is the solution! 

    Very appreciated your time and kindness!

    Amaury

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    excellent!  Welcome back @meloamaury:)


    Scott

     

     

Sign In or Register to comment.