RapidMiner Wisdom Banner
ALL FEATURE REQUESTS HERE ARE MONITORED BY OUR PRODUCT TEAM.

VOTING MATTERS!

IDEAS WITH HIGH NUMBERS OF VOTES (USUALLY ≥ 10) ARE PRIORITIZED IN OUR ROADMAP.

NOTE: IF YOU WISH TO SUGGEST A NEW FEATURE, PLEASE POST A NEW QUESTION AND TAG AS "FEATURE REQUEST". THANK YOU.

Problem with generalized linear model (lambda seach)

scottchung64scottchung64 Member Posts: 1 Learner I
edited December 2018 in Product Ideas

Hi all,

I'm trying to do classification using generalized linear model.

In default setting, the lambda value is chosen by H2O (described in documentation).

However, I found that if I use lambda search, the performance is much better.

I don't understand what is the difference between this two method.

Is the better performance from doing lambda search comes from overfitting?

 

Thanks!

Best,

Scott

0
0 votes

Open for Voting · Last Updated

31 May 2019 - redesignated as Feature Request - moved to Product Ideas PROD-821

Best Answer

  • yyhuangyyhuang Posts: 239  RM Data Scientist
    Solution Accepted

    Hi @scottchung64,

     

    You are correct. The lambda search is used for controlling the regularization to avoid overfitting. When performing regularization, penalties are introduced to the model buidling process to avoid overfitting. GLM needs to find the optimal values of the regularization parameters alpha and lambda. The lambda parameter controls the amount of regularization applied to the model. 

     

    When you activate the labmda search in GLM operator, it will take longer time to find the best value of parameters.

     

    YY

Comments

  • staskhalitovstaskhalitov Member Posts: 3 Contributor I

    is it possible to initiate an Alpha search? 

     

    i see this: "Providing multiple alpha values via the advanced parameters triggers a search."

    but how do i actualy provide multiple values...what is the format?

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 239  RM Data Scientist

    Hi @staskhalitov,

     

    Good point. You will need to edit the "expert parameters" list

    alpha_search.PNGalpha_list.PNG

     

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process" origin="GENERATED_TUTORIAL">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="9.0.002" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
    </operator>
    <operator activated="true" class="h2o:generalized_linear_model" compatibility="7.2.000" expanded="true" height="124" name="Generalized Linear Model" origin="GENERATED_TUTORIAL" width="90" x="179" y="34">
    <parameter key="lambda_search" value="true"/>
    <parameter key="number_of_lambdas" value="3"/>
    <parameter key="alpha" value="0.6"/>
    <list key="beta_constraints"/>
    <list key="expert_parameters">
    <parameter key="additional_alphas" value="0.2"/>
    <parameter key="additional_alphas" value="0.1"/>
    </list>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" origin="GENERATED_TUTORIAL" width="90" x="380" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="9.0.002" expanded="true" height="82" name="Performance" origin="GENERATED_TUTORIAL" width="90" x="514" y="85">
    <list key="class_weights"/>
    </operator>
    <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Generalized Linear Model" to_port="training set"/>
    <connect from_op="Generalized Linear Model" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
    <connect from_op="Performance" from_port="performance" to_port="result 1"/>
    <connect from_op="Performance" from_port="example set" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    Hope it helps.

     

    YY

  • staskhalitovstaskhalitov Member Posts: 3 Contributor I

    so I tried your xml, but it seems like the model just uses what ever value of Alpha you have in the initial settings, .6 in your example. 

    It doesnt look like it considered the additional Alphas, .2 & .1, in the expert parameters. 

     

    How do i actualy initiate a search for an Alpha per this description? 

    alpha
    Description: The alpha parameter controls the distribution between the L1 (Lasso) and L2 (Ridge regression) penalties. A value of 1.0 for alpha represents Lasso, and an alpha value of 0.0 produces Ridge regression. Providing multiple alpha values via the advanced parameters triggers a search. Default is 0.0 for the L-BFGS solver, else 0.5.
    Range: real; 0.0-1.0
    Optional: true

     

     

    If i leave the initial Alpha .6 blank, and have additional Alphas in expert parameters i get an error.

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 239  RM Data Scientist

    Hi @staskhalitov,

     

    Thanks for the followup! Great catch. I double checked the model descriptions and unfortunately the additional alpha values are not used for alpha search. We are investigating the bug. @phellinger 

     

    At the same time, you can manually do a grid search by loop. Here is an example:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process" origin="GENERATED_TUTORIAL">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="9.0.002" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="313" y="187">
    <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
    </operator>
    <operator activated="true" class="generate_data" compatibility="9.0.002" expanded="true" height="68" name="Generate Data" width="90" x="179" y="34">
    <parameter key="target_function" value="grid function"/>
    <parameter key="number_examples" value="5"/>
    <parameter key="number_of_attributes" value="1"/>
    <parameter key="attributes_lower_bound" value="0.0"/>
    <parameter key="attributes_upper_bound" value="1.0"/>
    </operator>
    <operator activated="true" class="numerical_to_polynominal" compatibility="9.0.002" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="313" y="34"/>
    <operator activated="true" class="concurrency:loop_values" compatibility="9.0.002" expanded="true" height="124" name="Loop Values" width="90" x="514" y="34">
    <parameter key="attribute" value="att1"/>
    <parameter key="iteration_macro" value="alpha"/>
    <parameter key="enable_parallel_execution" value="false"/>
    <process expanded="true">
    <operator activated="true" class="concurrency:cross_validation" compatibility="9.0.002" expanded="true" height="145" name="Cross Validation" width="90" x="112" y="34">
    <process expanded="true">
    <operator activated="true" class="h2o:generalized_linear_model" compatibility="7.2.000" expanded="true" height="124" name="Generalized Linear Model" origin="GENERATED_TUTORIAL" width="90" x="112" y="85">
    <parameter key="alpha" value="%{alpha}"/>
    <parameter key="standardize" value="false"/>
    <list key="beta_constraints"/>
    <list key="expert_parameters">
    <parameter key="additional_alphas" value="0.3"/>
    <parameter key="additional_alphas" value="0.1"/>
    <parameter key="additional_alphas" value="0.55"/>
    <parameter key="keep_cross_validation_predictions" value="true"/>
    </list>
    </operator>
    <connect from_port="training set" to_op="Generalized Linear Model" to_port="training set"/>
    <connect from_op="Generalized Linear Model" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" origin="GENERATED_TUTORIAL" width="90" x="112" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_binominal_classification" compatibility="9.0.002" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
    <parameter key="classification_error" value="true"/>
    <parameter key="kappa" value="true"/>
    <parameter key="AUC" value="true"/>
    <parameter key="recall" value="true"/>
    <parameter key="f_measure" value="true"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="performance_to_data" compatibility="9.0.002" expanded="true" height="82" name="Performance to Data" width="90" x="313" y="136"/>
    <operator activated="true" class="generate_attributes" compatibility="9.0.002" expanded="true" height="82" name="Generate Attributes" width="90" x="447" y="85">
    <list key="function_descriptions">
    <parameter key="ALPHA" value="%{alpha}"/>
    </list>
    </operator>
    <connect from_port="input 2" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="model" to_port="output 1"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_op="Performance to Data" to_port="performance vector"/>
    <connect from_op="Performance to Data" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Performance to Data" from_port="performance vector" to_port="output 3"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="output 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="source_input 3" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="21"/>
    <portSpacing port="sink_output 3" spacing="42"/>
    <portSpacing port="sink_output 4" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Loop Values" to_port="input 2"/>
    <connect from_op="Generate Data" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/>
    <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Loop Values" to_port="input 1"/>
    <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
    <connect from_op="Loop Values" from_port="output 2" to_port="result 2"/>
    <connect from_op="Loop Values" from_port="output 3" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    Best,

     

    YY

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,738  Community Manager
Sign In or Register to comment.