Gradient Boosted Tree Algorithm performance

varunm1varunm1 Member Posts: 199   Unicorn
I am working with Gradient boosted tree (GBT), and it performs better (5-Fold CV) on most of my datasets with high metrics like AUC (1.0), kappa (0.971), etc. I can correlate the results with the capabilities of GBT like regularization and sequential learning. I even set aside 30 percent data for testing after five-fold cross-validation and got kappa (0.974) for this unseen data.
 My question is, are there any cautions or factors that need to be considered while using and interpreting results of a GBT and how good is GBT in real applications? 

Thanks
Regards,
Varun

Best Answers

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 1,952  RM Data Scientist
    well, the usual overtraining thing for any complex algorithm, which you are already consindering. Usually GBTs are the best off-the-shelf algorithm, especially if you have nominal data in your data set.

    BR,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    varunm1
  • varunm1varunm1 Member Posts: 199   Unicorn
    Thanks, @Telcontar120. I am a bit skeptic about the result so, I did some basic checks like the correlation of attributes and hold out dataset (30%) for testing after CV and test set looks good as well. Will check other methods as well to see if there are any issues.

    Regards,
    Varun
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 1,952  RM Data Scientist
    Are you sure that your hold out is really independed and not pseudo-duplicates?

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • varunm1varunm1 Member Posts: 199   Unicorn
    edited March 12
    yes, they are different subjects. I tested by setting some subject aside for testing and applied cross-validation on remaining data. Cross-validation result looks pretty good but the test results on the separated subjects are poor.
    Regards,
    Varun
  • varunm1varunm1 Member Posts: 199   Unicorn
    edited March 12
    @mschmitz thanks. I have gone through your experience posted in the below link and it suits this scenario as well. @lionelderkrikor thanks for solving the filter operator issue.
    https://towardsdatascience.com/when-cross-validation-fails-9bd5a57f07b5
    Regards,
    Varun
    lionelderkrikor
  • varunm1varunm1 Member Posts: 199   Unicorn
    edited March 15
    Is there a way to do Subject wise cross-validation in RM rather than the record-wise (default) cross-validation. Subject wise cross-validation is where we split data based on Subject ID column rather than randomly splitting data into different folds. Leave one out subject wise cross-validation is recommended in case of medical diagnosis.

    This is especially used when there are multiple samples per subject in the dataset.

    Thanks a lot for your support.
    Regards,
    Varun
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,536  RM Founder
    Hi,
    There used to be an operator called "Batch Validation" which could have been used for that, but it seems like this one was removed in version 7.3 when we introduced parallel processing for all the validation operators.  With this operator you would have specified a "batch" attribute which defined the splits for the cross validation.  In your case this would have been the subject IDs (or more likely: groups of subjects).
    Anyway, since this operator is history, below is a simple process to achieve the same.  I use the passenger class of Titanic to define the groups.  In your case, you would use groups containing the same subject(s).
    If there is a lot of calling for such an operator in the future, I am sure we can bring it back but for now this should be a good workaround...
    Hope this helps,
    Ingo
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role" width="90" x="179" y="34">
            <parameter key="attribute_name" value="Survived"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="concurrency:loop_values" compatibility="9.2.000" expanded="true" height="82" name="Loop Values" width="90" x="313" y="34">
            <parameter key="attribute" value="Passenger Class"/>
            <parameter key="iteration_macro" value="loop_value"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="filter_examples" compatibility="9.2.000" expanded="true" height="103" name="Filter Examples" width="90" x="45" y="85">
                <parameter key="parameter_expression" value=""/>
                <parameter key="condition_class" value="custom_filters"/>
                <parameter key="invert_filter" value="false"/>
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="Passenger Class.equals.%{loop_value}"/>
                </list>
                <parameter key="filters_logic_and" value="true"/>
                <parameter key="filters_check_metadata" value="true"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes (2)" width="90" x="179" y="136">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="Passenger Class"/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="true"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.2.000" expanded="true" height="103" name="Decision Tree" width="90" x="313" y="136">
                <parameter key="criterion" value="gain_ratio"/>
                <parameter key="maximal_depth" value="10"/>
                <parameter key="apply_pruning" value="true"/>
                <parameter key="confidence" value="0.1"/>
                <parameter key="apply_prepruning" value="true"/>
                <parameter key="minimal_gain" value="0.01"/>
                <parameter key="minimal_leaf_size" value="2"/>
                <parameter key="minimal_size_for_split" value="4"/>
                <parameter key="number_of_prepruning_alternatives" value="3"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="Passenger Class"/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="true"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model" width="90" x="447" y="34">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="9.2.000" expanded="true" height="82" name="Performance" width="90" x="581" y="34">
                <parameter key="main_criterion" value="first"/>
                <parameter key="accuracy" value="true"/>
                <parameter key="classification_error" value="false"/>
                <parameter key="kappa" value="false"/>
                <parameter key="weighted_mean_recall" value="false"/>
                <parameter key="weighted_mean_precision" value="false"/>
                <parameter key="spearman_rho" value="false"/>
                <parameter key="kendall_tau" value="false"/>
                <parameter key="absolute_error" value="false"/>
                <parameter key="relative_error" value="false"/>
                <parameter key="relative_error_lenient" value="false"/>
                <parameter key="relative_error_strict" value="false"/>
                <parameter key="normalized_absolute_error" value="false"/>
                <parameter key="root_mean_squared_error" value="false"/>
                <parameter key="root_relative_squared_error" value="false"/>
                <parameter key="squared_error" value="false"/>
                <parameter key="correlation" value="false"/>
                <parameter key="squared_correlation" value="false"/>
                <parameter key="cross-entropy" value="false"/>
                <parameter key="margin" value="false"/>
                <parameter key="soft_margin_loss" value="false"/>
                <parameter key="logistic_loss" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
                <list key="class_weights"/>
              </operator>
              <connect from_port="input 1" to_op="Filter Examples" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="unmatched example set" to_op="Select Attributes (2)" to_port="example set input"/>
              <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
              <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="average" compatibility="9.2.000" expanded="true" height="82" name="Average" width="90" x="447" y="34"/>
          <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Loop Values" to_port="input 1"/>
          <connect from_op="Loop Values" from_port="output 1" to_op="Average" to_port="averagable 1"/>
          <connect from_op="Average" from_port="average" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    varunm1
  • varunm1varunm1 Member Posts: 199   Unicorn
    Thanks a lot @IngoRM this workaround is helpful.
    Regards,
    Varun
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,046   Unicorn
    Doesn't the standard cross-validation still have the "split on batch attribute" available as an advanced parameter?  Is that not doing this same thing?
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    varunm1
  • varunm1varunm1 Member Posts: 199   Unicorn
    Hello @Telcontar120

    Thanks for your response, looks like it is doing the same. I tested the CV with a split on batch attribute, the performance metrics are the same as process provided by Ingo. Any suggestion on doing similar cv with different folds (5 or 10) rather than testing on an individual batch. This is because once I select the CV with "split on batch attribute" the option for the number of folds disappears.
    Regards,
    Varun
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,046   Unicorn
    Right, once you are using your own batches, you will have as many folds as you have unique values of your batch attribute.  I am not entirely sure what you mean by doing multiple folds while also using a batch that specifies the records to be used in each fold.  If you mean you have a set number of batches and you want cross-validation performed on each batch (rather than just one model on each batch) you could simply put a conventional cross-validation inside the testing side of your outer cross-validation with which you are splitting on batch.  That should do the trick.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • varunm1varunm1 Member Posts: 199   Unicorn
    edited March 16
    Thank you @Telcontar120 ; I have an ID column which has 87 unique id values (1500 samples). Now I want to perform 5 fold cross-validation based on ID values rather than samples. If I select split on the batch attribute and set ID column role as batch it does "leave one subject out cross-validation" in my case.  But if I want to perform 5 fold CV based on ID ( samples related to 60 IDs in train and 17 in the test), I don't see there is an option for this in the CV operator.

    Sorry if it is confusing.
    Regards,
    Varun
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 1,952  RM Data Scientist
    Hi @varunm1
    you can go for Generate Attribute with
    batchid = id%5
    then use Set Role to make this the role "batch" and use the batch option of x-val.
    BR,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    Telcontar120IngoRM
Sign In or Register to comment.