How do I create these layers (image) with stacking?

MarlaBotMarlaBot Administrator, Moderator, Employee, Member Posts: 57 Community Manager
edited December 2018 in Help
@Jack530 wants to know the answer to this question.

"Hi. My goal is to create a highly accurate linear regression model. Such as the image in the link displays (logistic regression in the image). My question is how, probably with the stacking operator, do i create these layers? I now start with read cvs (training set with al corresponding roles/labels), then how do you move to a the model in the picture and eventually a highly accurate model (being aware of cross-validation/optimising parameters and all other important operators).

These are the steps that should be taken according to the image, and am stuck on how to implement in RapidMiner:
1. A diverse set of models with an assembled result
2. Subsequently the output of 1. is the input this phase; assembling models (bagging?)
3. Lastly the output of 2. is the input of the eventual model (so built on the predictions of al preliminary models).

I would greatly appreciate advice."


  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    This is an incredibly complicated architecture.  My interpretation of the diagram is that there are 64 separate individual models and 15 ensemble models in stage 1, all of which are built separately.   Maybe they are built on different segments of the underlying data (e.g., a segmented scorecard approach?  it is impossible to tell from the diagram).  These predictions are then passed into stage 2, which has 2 ensemble models built off them.  This results in another set of predictions, which are passed into a single LR in stage 3.

    This is all completely do-able in RapidMiner, albeit very tedious.  Is there a definite need to replicate this same structure?  You say you want to create a "highly accurate" model but there is no guarantee that the final model in this architecture is significantly more accurate than a single LR model---that totally depends on the data.  
    I would suggest you start by building a single LR model and assessing its performance against some other popular ML algorithms like k-nn or GBT.  This is easy to do with AutoModel.  You might then consider a simpler ensemble solution using a Voting or Stacking operator.  But there seems like no need to go directly to the layout depicted in this diagram.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    Jack530Jack530 Member Posts: 1 Contributor I
    Thank you very much for responding this quickly to my post.

    First of all, there is absolutely no need to replicate the structure of the image. It was attached just to display a structure that could define the best algorithm. (The picture that was attached to the initial post was an extremely complex version, I may be looking for a way more general one)

    However, allow me to explain my mission. I want to build the best possible predictive algorithm for predicting house prices.  

    First I indeed looked at what models perform best. I attached the results of this run down below in a typed column. 

    My goal is that I want to combine/structure (some of the) models to lower our eventual RMSE.

    First question now is, what models would you include (would you include all). Moreover, how would this look in Rapidminer so that the structure through Stacking/Bagging/Voting (what would you recommend and how do I implement this in Rapidminer practically) will lead to the lowest possible RMSE (which is of course dependant on the dataset) on our test dataset. Altogether, how do i connect the models and in what order/structure in Rapidminer (by means of trial and error of course - what combination produces lower RMSEs than others) so that I can work to the lowest possible RMSE?

    Again I really appreciate your time and effort in answering this question. I know it is a pretty in depth question and therefore thanks a lot.


    • Generalized Linear Model: 0.312
    • Deep Learning: 0.255
    • Decision Tree: 0.261
    • Random Forest: 0.295
    • Gradient Boosted Tree: 0.197
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    I would start by using a simple stacking model, with a DT as the stacking learner and the other models shown here as your base learners.  Here's a sample process (you obviously will need to change the data and the learners, this is from the Stacking tutorial process):
    <?xml version="1.0" encoding="UTF-8"?><process version="9.1.000-BETA2">
      <operator activated="true" class="process" compatibility="9.1.000-BETA2" expanded="true" name="Root" origin="GENERATED_TUTORIAL">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.1.000-BETA2" expanded="true" height="68" name="Sonar" origin="GENERATED_TUTORIAL" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//Samples/data/Sonar"/>
          <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000-BETA2" expanded="true" height="145" name="Cross Validation (2)" width="90" x="313" y="34">
            <parameter key="split_on_batch_attribute" value="false"/>
            <parameter key="leave_one_out" value="false"/>
            <parameter key="number_of_folds" value="10"/>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="stacking" compatibility="9.1.000-BETA2" expanded="true" height="68" name="Stacking" origin="GENERATED_TUTORIAL" width="90" x="112" y="34">
                <parameter key="keep_all_attributes" value="true"/>
                <process expanded="true">
                  <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.1.000-BETA2" expanded="true" height="103" name="Decision Tree" origin="GENERATED_TUTORIAL" width="90" x="112" y="34">
                    <parameter key="criterion" value="gain_ratio"/>
                    <parameter key="maximal_depth" value="20"/>
                    <parameter key="apply_pruning" value="true"/>
                    <parameter key="confidence" value="0.25"/>
                    <parameter key="apply_prepruning" value="true"/>
                    <parameter key="minimal_gain" value="0.1"/>
                    <parameter key="minimal_leaf_size" value="2"/>
                    <parameter key="minimal_size_for_split" value="4"/>
                    <parameter key="number_of_prepruning_alternatives" value="3"/>
                  <operator activated="true" class="k_nn" compatibility="9.1.000-BETA2" expanded="true" height="82" name="K-NN" origin="GENERATED_TUTORIAL" width="90" x="112" y="187">
                    <parameter key="k" value="5"/>
                    <parameter key="weighted_vote" value="false"/>
                    <parameter key="measure_types" value="MixedMeasures"/>
                    <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
                    <parameter key="nominal_measure" value="NominalDistance"/>
                    <parameter key="numerical_measure" value="EuclideanDistance"/>
                    <parameter key="divergence" value="GeneralizedIDivergence"/>
                    <parameter key="kernel_type" value="radial"/>
                    <parameter key="kernel_gamma" value="1.0"/>
                    <parameter key="kernel_sigma1" value="1.0"/>
                    <parameter key="kernel_sigma2" value="0.0"/>
                    <parameter key="kernel_sigma3" value="2.0"/>
                    <parameter key="kernel_degree" value="3.0"/>
                    <parameter key="kernel_shift" value="1.0"/>
                    <parameter key="kernel_a" value="1.0"/>
                    <parameter key="kernel_b" value="0.0"/>
                  <operator activated="true" class="linear_regression" compatibility="9.1.000-BETA2" expanded="true" height="103" name="Linear Regression" origin="GENERATED_TUTORIAL" width="90" x="112" y="289">
                    <parameter key="feature_selection" value="M5 prime"/>
                    <parameter key="alpha" value="0.05"/>
                    <parameter key="max_iterations" value="10"/>
                    <parameter key="forward_alpha" value="0.05"/>
                    <parameter key="backward_alpha" value="0.05"/>
                    <parameter key="eliminate_colinear_features" value="true"/>
                    <parameter key="min_tolerance" value="0.05"/>
                    <parameter key="use_bias" value="true"/>
                    <parameter key="ridge" value="1.0E-8"/>
                  <connect from_port="training set 1" to_op="Decision Tree" to_port="training set"/>
                  <connect from_port="training set 2" to_op="K-NN" to_port="training set"/>
                  <connect from_port="training set 3" to_op="Linear Regression" to_port="training set"/>
                  <connect from_op="Decision Tree" from_port="model" to_port="base model 1"/>
                  <connect from_op="K-NN" from_port="model" to_port="base model 2"/>
                  <connect from_op="Linear Regression" from_port="model" to_port="base model 3"/>
                  <portSpacing port="source_training set 1" spacing="0"/>
                  <portSpacing port="source_training set 2" spacing="105"/>
                  <portSpacing port="source_training set 3" spacing="84"/>
                  <portSpacing port="source_training set 4" spacing="0"/>
                  <portSpacing port="sink_base model 1" spacing="0"/>
                  <portSpacing port="sink_base model 2" spacing="105"/>
                  <portSpacing port="sink_base model 3" spacing="84"/>
                  <portSpacing port="sink_base model 4" spacing="0"/>
                <process expanded="true">
                  <operator activated="true" class="naive_bayes" compatibility="9.1.000-BETA2" expanded="true" height="82" name="Naive Bayes" origin="GENERATED_TUTORIAL" width="90" x="179" y="34">
                    <parameter key="laplace_correction" value="true"/>
                  <connect from_port="stacking examples" to_op="Naive Bayes" to_port="training set"/>
                  <connect from_op="Naive Bayes" from_port="model" to_port="stacking model"/>
                  <portSpacing port="source_stacking examples" spacing="0"/>
                  <portSpacing port="sink_stacking model" spacing="0"/>
              <connect from_port="training set" to_op="Stacking" to_port="training set"/>
              <connect from_op="Stacking" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.1.000-BETA2" expanded="true" height="82" name="Apply Model" origin="GENERATED_TUTORIAL" width="90" x="45" y="34">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              <operator activated="true" class="performance" compatibility="9.1.000-BETA2" expanded="true" height="82" name="Performance" origin="GENERATED_TUTORIAL" width="90" x="179" y="34">
                <parameter key="use_example_weights" value="true"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <connect from_op="Performance" from_port="example set" to_port="test set results"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
          <connect from_op="Sonar" from_port="output" to_op="Cross Validation (2)" to_port="example set"/>
          <connect from_op="Cross Validation (2)" from_port="model" to_port="result 1"/>
          <connect from_op="Cross Validation (2)" from_port="test result set" to_port="result 2"/>
          <connect from_op="Cross Validation (2)" from_port="performance 1" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="21"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.