Stacking With different preprocessing for the base learners

ekipropekiprop Member Posts: 4 Contributor I
edited November 2018 in Help

Hi everyone. I am new with RapidMiner

I am trying to implement stacking for a dataset. However whenever I attempt to do different preprocessing for each algorithm, i get an error. 

My question is, can one apply further preprocessing of input within the stacking operator before feeding into the algorithm operators?

Thanks in advance

Tagged:

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,316  RM Data Scientist

    WIth help of group models you can do a lot of things in this regard, what did you have in mind?

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    ekiprop
  • ekipropekiprop Member Posts: 4 Contributor I

    What I had in mind was to use the Stacking operator to blend several algorithms eg SVM, Decision tree and Naive Bayes. However, I would like to do a diferrent set of preprocessing steps for each model. I thought to input the data into the Stacking operator, then undertake preprocessing within the Stacking operator (different preprocessing for each algorithm) before feeding the training data into each operator(SVM, Decision Tree and Naive Bayes) as shown hereMy Stacking.png

     

    When I attempted this error:

    Input ExampleSet does not match the training ExampleSet. Attribute 'Age' is is of value type real but it should be 'nominal' or a supertype.

    Does it mean that if I have to apply the Stacking operator, I would need to do all my preprocessing outside of it, and none within it?

    What options do I have if I were to use the Stacking operator but apply different preprocessing to the data for each base model? An example process would really be helpful.

    Thanks.

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,316  RM Data Scientist

    Dear Ekiprop,

     

    for the lower to learners you can easily use a Group Models to solve it. What is the preprocessing for the Decision Tree? That has no preprocessing model, so I am a bit confused what it does :).

     

    Attached is a 7.2 process showing group models in stacking on golf. We recommend 7.2 not only for feature reasons, but also for stabilty.

     

    ~Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.2.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Samples/data/Golf"/>
    </operator>
    <operator activated="true" class="stacking" compatibility="7.2.001" expanded="true" height="68" name="Stacking" width="90" x="313" y="85">
    <process expanded="true">
    <operator activated="true" class="parallel_decision_tree" compatibility="7.2.001" expanded="true" height="82" name="Decision Tree" width="90" x="313" y="34"/>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.2.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="187">
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="support_vector_machine" compatibility="7.2.001" expanded="true" height="124" name="SVM" width="90" x="313" y="136"/>
    <operator activated="true" class="group_models" compatibility="7.2.001" expanded="true" height="103" name="Group Models" width="90" x="447" y="187"/>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.2.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="340">
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="naive_bayes" compatibility="7.2.001" expanded="true" height="82" name="Naive Bayes" width="90" x="313" y="289"/>
    <operator activated="true" class="group_models" compatibility="7.2.001" expanded="true" height="103" name="Group Models (2)" width="90" x="447" y="340"/>
    <connect from_port="training set 1" to_op="Decision Tree" to_port="training set"/>
    <connect from_port="training set 2" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_port="training set 3" to_op="Nominal to Numerical (2)" to_port="example set input"/>
    <connect from_op="Decision Tree" from_port="model" to_port="base model 1"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="SVM" to_port="training set"/>
    <connect from_op="Nominal to Numerical" from_port="preprocessing model" to_op="Group Models" to_port="models in 1"/>
    <connect from_op="SVM" from_port="model" to_op="Group Models" to_port="models in 2"/>
    <connect from_op="Group Models" from_port="model out" to_port="base model 2"/>
    <connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
    <connect from_op="Nominal to Numerical (2)" from_port="preprocessing model" to_op="Group Models (2)" to_port="models in 1"/>
    <connect from_op="Naive Bayes" from_port="model" to_op="Group Models (2)" to_port="models in 2"/>
    <connect from_op="Group Models (2)" from_port="model out" to_port="base model 3"/>
    <portSpacing port="source_training set 1" spacing="0"/>
    <portSpacing port="source_training set 2" spacing="0"/>
    <portSpacing port="source_training set 3" spacing="0"/>
    <portSpacing port="source_training set 4" spacing="0"/>
    <portSpacing port="sink_base model 1" spacing="0"/>
    <portSpacing port="sink_base model 2" spacing="0"/>
    <portSpacing port="sink_base model 3" spacing="0"/>
    <portSpacing port="sink_base model 4" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="parallel_decision_tree" compatibility="7.2.001" expanded="true" height="82" name="Decision Tree (2)" width="90" x="112" y="34"/>
    <connect from_port="stacking examples" to_op="Decision Tree (2)" to_port="training set"/>
    <connect from_op="Decision Tree (2)" from_port="model" to_port="stacking model"/>
    <portSpacing port="source_stacking examples" spacing="0"/>
    <portSpacing port="sink_stacking model" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve Golf" from_port="output" to_op="Stacking" to_port="training set"/>
    <connect from_op="Stacking" from_port="model" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • ekipropekiprop Member Posts: 4 Contributor I

    Thank you mschmitz . Thats just what I needed.:smileyhappy:

  • ekipropekiprop Member Posts: 4 Contributor I

    I have modified your process to look as in the code below.  I would like to evaluate the performance of the stacked ensemble but when I run it, I get the "Attributes do not match" error on the Apply Model operator. How can I solve the error?

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.2.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Samples/data/Golf"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.2.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="34"/>
    <operator activated="true" class="split_validation" compatibility="7.2.001" expanded="true" height="145" name="Validation" width="90" x="313" y="85">
    <parameter key="create_complete_model" value="false"/>
    <parameter key="split" value="relative"/>
    <parameter key="split_ratio" value="0.7"/>
    <parameter key="training_set_size" value="100"/>
    <parameter key="test_set_size" value="-1"/>
    <parameter key="sampling_type" value="shuffled sampling"/>
    <parameter key="use_local_random_seed" value="false"/>
    <parameter key="local_random_seed" value="1992"/>
    <process expanded="true">
    <operator activated="true" class="stacking" compatibility="7.2.001" expanded="true" height="68" name="Stacking" width="90" x="112" y="34">
    <parameter key="keep_all_attributes" value="true"/>
    <process expanded="true">
    <operator activated="true" class="parallel_decision_tree" compatibility="7.2.001" expanded="true" height="82" name="Decision Tree" width="90" x="313" y="34">
    <parameter key="criterion" value="gain_ratio"/>
    <parameter key="maximal_depth" value="20"/>
    <parameter key="apply_pruning" value="true"/>
    <parameter key="confidence" value="0.25"/>
    <parameter key="apply_prepruning" value="true"/>
    <parameter key="minimal_gain" value="0.1"/>
    <parameter key="minimal_leaf_size" value="2"/>
    <parameter key="minimal_size_for_split" value="4"/>
    <parameter key="number_of_prepruning_alternatives" value="3"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.2.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="187">
    <parameter key="return_preprocessing_model" value="false"/>
    <parameter key="create_view" value="false"/>
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="nominal"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="file_path"/>
    <parameter key="block_type" value="single_value"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="single_value"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    <parameter key="coding_type" value="dummy coding"/>
    <parameter key="use_comparison_groups" value="false"/>
    <list key="comparison_groups"/>
    <parameter key="unexpected_value_handling" value="all 0 and warning"/>
    <parameter key="use_underscore_in_name" value="false"/>
    </operator>
    <operator activated="true" class="support_vector_machine" compatibility="7.2.001" expanded="true" height="124" name="SVM" width="90" x="313" y="136">
    <parameter key="kernel_type" value="dot"/>
    <parameter key="kernel_gamma" value="1.0"/>
    <parameter key="kernel_sigma1" value="1.0"/>
    <parameter key="kernel_sigma2" value="0.0"/>
    <parameter key="kernel_sigma3" value="2.0"/>
    <parameter key="kernel_shift" value="1.0"/>
    <parameter key="kernel_degree" value="2.0"/>
    <parameter key="kernel_a" value="1.0"/>
    <parameter key="kernel_b" value="0.0"/>
    <parameter key="kernel_cache" value="200"/>
    <parameter key="C" value="0.0"/>
    <parameter key="convergence_epsilon" value="0.001"/>
    <parameter key="max_iterations" value="100000"/>
    <parameter key="scale" value="true"/>
    <parameter key="calculate_weights" value="true"/>
    <parameter key="return_optimization_performance" value="true"/>
    <parameter key="L_pos" value="1.0"/>
    <parameter key="L_neg" value="1.0"/>
    <parameter key="epsilon" value="0.0"/>
    <parameter key="epsilon_plus" value="0.0"/>
    <parameter key="epsilon_minus" value="0.0"/>
    <parameter key="balance_cost" value="false"/>
    <parameter key="quadratic_loss_pos" value="false"/>
    <parameter key="quadratic_loss_neg" value="false"/>
    <parameter key="estimate_performance" value="false"/>
    </operator>
    <operator activated="true" class="group_models" compatibility="7.2.001" expanded="true" height="103" name="Group Models" width="90" x="447" y="187"/>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.2.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="340">
    <parameter key="return_preprocessing_model" value="false"/>
    <parameter key="create_view" value="false"/>
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="nominal"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="file_path"/>
    <parameter key="block_type" value="single_value"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="single_value"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    <parameter key="coding_type" value="dummy coding"/>
    <parameter key="use_comparison_groups" value="false"/>
    <list key="comparison_groups"/>
    <parameter key="unexpected_value_handling" value="all 0 and warning"/>
    <parameter key="use_underscore_in_name" value="false"/>
    </operator>
    <operator activated="true" class="naive_bayes" compatibility="7.2.001" expanded="true" height="82" name="Naive Bayes" width="90" x="313" y="289">
    <parameter key="laplace_correction" value="true"/>
    </operator>
    <operator activated="true" class="group_models" compatibility="7.2.001" expanded="true" height="103" name="Group Models (2)" width="90" x="447" y="340"/>
    <connect from_port="training set 1" to_op="Decision Tree" to_port="training set"/>
    <connect from_port="training set 2" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_port="training set 3" to_op="Nominal to Numerical (2)" to_port="example set input"/>
    <connect from_op="Decision Tree" from_port="model" to_port="base model 1"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="SVM" to_port="training set"/>
    <connect from_op="Nominal to Numerical" from_port="preprocessing model" to_op="Group Models" to_port="models in 1"/>
    <connect from_op="SVM" from_port="model" to_op="Group Models" to_port="models in 2"/>
    <connect from_op="Group Models" from_port="model out" to_port="base model 2"/>
    <connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
    <connect from_op="Nominal to Numerical (2)" from_port="preprocessing model" to_op="Group Models (2)" to_port="models in 1"/>
    <connect from_op="Naive Bayes" from_port="model" to_op="Group Models (2)" to_port="models in 2"/>
    <connect from_op="Group Models (2)" from_port="model out" to_port="base model 3"/>
    <portSpacing port="source_training set 1" spacing="0"/>
    <portSpacing port="source_training set 2" spacing="0"/>
    <portSpacing port="source_training set 3" spacing="0"/>
    <portSpacing port="source_training set 4" spacing="0"/>
    <portSpacing port="sink_base model 1" spacing="0"/>
    <portSpacing port="sink_base model 2" spacing="0"/>
    <portSpacing port="sink_base model 3" spacing="0"/>
    <portSpacing port="sink_base model 4" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="parallel_decision_tree" compatibility="7.2.001" expanded="true" height="82" name="Decision Tree (2)" width="90" x="112" y="34">
    <parameter key="criterion" value="gain_ratio"/>
    <parameter key="maximal_depth" value="20"/>
    <parameter key="apply_pruning" value="true"/>
    <parameter key="confidence" value="0.25"/>
    <parameter key="apply_prepruning" value="true"/>
    <parameter key="minimal_gain" value="0.1"/>
    <parameter key="minimal_leaf_size" value="2"/>
    <parameter key="minimal_size_for_split" value="4"/>
    <parameter key="number_of_prepruning_alternatives" value="3"/>
    </operator>
    <connect from_port="stacking examples" to_op="Decision Tree (2)" to_port="training set"/>
    <connect from_op="Decision Tree (2)" from_port="model" to_port="stacking model"/>
    <portSpacing port="source_stacking examples" spacing="0"/>
    <portSpacing port="sink_stacking model" spacing="0"/>
    </process>
    </operator>
    <connect from_port="training" to_op="Stacking" to_port="training set"/>
    <connect from_op="Stacking" from_port="model" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.2.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    <parameter key="create_view" value="false"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_port="averagable 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_averagable 1" spacing="0"/>
    <portSpacing port="sink_averagable 2" spacing="0"/>
    <portSpacing port="sink_averagable 3" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve Golf" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_port="result 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Validation" to_port="training"/>
    <connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
    <connect from_op="Validation" from_port="averagable 2" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>
Sign In or Register to comment.