Optimise Parameters

k_vishnu772k_vishnu772 Member Posts: 34 Contributor I
edited November 2018 in Help

Hi All,

I am new to rapid miner ,i am using optimise parameter to get the best parameters for my gradient boosting trees on maximum depth and no of trees and i got the maximim depth as 2 and num of trees as 220, i am wondering how would i know if my model is overfitting .

Can i trust the result of my optimise parameters would take care of over fitting also?

Tagged:

Best Answers

  • Thomas_OttThomas_Ott Posts: 1,761   Unicorn
    Solution Accepted

    @Telcontar120 @k_vishnu772 If I remember how GBT works is that it creates mutliple trees like Random Forest and then averages them together (some trees may overfit and some trees may underfit). This is done to miminize overfitting. In conjuction with Cross Validation I think the probability of overfitting is greatly reduced.

     

    Additionally, you're using a max depth of 2 so that really generalizes the tree too. 

  • JEdwardJEdward Posts: 564   Unicorn
    Solution Accepted

    One way you can also help to check for overfitting is to have a separate test dataset & use this to validate your model predictions. 

    This should not be part of the modelling or optimization stages.  You can sample this test set also & use averages to estimate model accuracy, but it's not strictly necessary... I just liked adding an extra loop in this example. 

     

    With a significance test between training and testing performance you can then see how much the models differ. 

    Is your optimized model performing significantly better than your test data performance?  If so, maybe your model is overfitting and will degrade faster in real-world usage. 

    Have a play with the example below.  Note, you do need more data to have a viable hold-out test set so it might not be practical for every use case. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_direct_mailing_data" compatibility="8.2.000" expanded="true" height="68" name="Generate Direct Mailing Data" width="90" x="45" y="85">
    <parameter key="number_examples" value="100000"/>
    <description align="center" color="transparent" colored="false" width="126">More data means less chance of overfitting. &lt;br/&gt;Especially if having a separate holdout. Try using a smaller value &amp;amp; watch the results.</description>
    </operator>
    <operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="103" name="Split Data" width="90" x="179" y="85">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.8"/>
    <parameter key="ratio" value="0.2"/>
    </enumeration>
    </operator>
    <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="8.2.000" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="313" y="34">
    <list key="parameters">
    <parameter key="Random Forest.number_of_trees" value="[2;5;10;linear]"/>
    <parameter key="Random Forest.maximal_depth" value="[2;5;10;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Cross Validation (2)" width="90" x="112" y="85">
    <parameter key="number_of_folds" value="5"/>
    <process expanded="true">
    <operator activated="true" class="concurrency:parallel_random_forest" compatibility="8.2.000" expanded="true" height="103" name="Random Forest" width="90" x="179" y="34">
    <parameter key="number_of_trees" value="5"/>
    <parameter key="maximal_depth" value="5"/>
    </operator>
    <connect from_port="training set" to_op="Random Forest" to_port="training set"/>
    <connect from_op="Random Forest" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="85">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance" compatibility="8.2.000" expanded="true" height="82" name="Performance Training" width="90" x="224" y="85"/>
    <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance Training" to_port="labelled data"/>
    <connect from_op="Performance Training" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Only a 5-fold because I want it to finish quickly.</description>
    </operator>
    <connect from_port="input 1" to_op="Cross Validation (2)" to_port="example set"/>
    <connect from_op="Cross Validation (2)" from_port="model" to_port="model"/>
    <connect from_op="Cross Validation (2)" from_port="performance 1" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="concurrency:loop" compatibility="8.2.000" expanded="true" height="103" name="Loop" width="90" x="313" y="238">
    <parameter key="number_of_iterations" value="20"/>
    <process expanded="true">
    <operator activated="true" class="sample" compatibility="8.2.000" expanded="true" height="82" name="Sample" width="90" x="45" y="136">
    <parameter key="sample" value="relative"/>
    <list key="sample_size_per_class"/>
    <list key="sample_ratio_per_class"/>
    <list key="sample_probability_per_class"/>
    <parameter key="use_local_random_seed" value="true"/>
    <parameter key="local_random_seed" value="%{iteration}"/>
    <description align="center" color="transparent" colored="false" width="126">Random Subsets of test data. Random Seed is based on iteration.</description>
    </operator>
    <operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model (4)" width="90" x="179" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance" compatibility="8.2.000" expanded="true" height="82" name="Performance Testing" width="90" x="380" y="34"/>
    <connect from_port="input 1" to_op="Apply Model (4)" to_port="model"/>
    <connect from_port="input 2" to_op="Sample" to_port="example set input"/>
    <connect from_op="Sample" from_port="example set output" to_op="Apply Model (4)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance Testing" to_port="labelled data"/>
    <connect from_op="Performance Testing" from_port="performance" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="source_input 3" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">There are much better ways to do this loop, but I haven't had enough caffeine yet.</description>
    </operator>
    <operator activated="true" class="average" compatibility="8.2.000" expanded="true" height="82" name="Average" width="90" x="447" y="187"/>
    <operator activated="true" class="t_test" compatibility="8.2.000" expanded="true" height="124" name="T-Test" width="90" x="581" y="85">
    <description align="center" color="transparent" colored="false" width="126">A significant difference between training &amp;amp; test data indicates the model might be overfitted.</description>
    </operator>
    <connect from_op="Generate Direct Mailing Data" from_port="output" to_op="Split Data" to_port="example set"/>
    <connect from_op="Split Data" from_port="partition 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
    <connect from_op="Split Data" from_port="partition 2" to_op="Loop" to_port="input 2"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_op="T-Test" to_port="performance 1"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="model" to_op="Loop" to_port="input 1"/>
    <connect from_op="Loop" from_port="output 1" to_op="Average" to_port="averagable 1"/>
    <connect from_op="Average" from_port="average" to_op="T-Test" to_port="performance 2"/>
    <connect from_op="T-Test" from_port="significance" to_port="result 1"/>
    <connect from_op="T-Test" from_port="performance 1" to_port="result 2"/>
    <connect from_op="T-Test" from_port="performance 2" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>
     

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,226   Unicorn

    Parameter optimization is definitely not a preventative measure against overfitting.  In fact, arguably it may be more likely to find an overfit model, depending on the complexity of the algorithm you are using and the number of parameters being tuned.  The best solution against overfitting is the robust and thorough practice of cross-validation, as detailed in many blog posts and articles on the RapidMiner website and the community posts.  Cross-validation is essential to avoiding overfitting.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • k_vishnu772k_vishnu772 Member Posts: 34 Contributor I

    I did use the cross validation inside the opbitmise parameter to get the best set of parameter so ,in this case i am out of over fitting ?

Sign In or Register to comment.