Logistic regression: Select or change reference group

chris1chris1 Member Posts: 5 Contributor I
edited December 2018 in Help

I am new to RapidMiner but have been working with logistic regression in SAS for years.  When working with categorical attributes in logistic regression, how does RapidMiner choose which cateogry to be the reference category?  Is it possible to change this to assign a different reference category?

 

For example, say I have Race in my model with five possible values of white, black, asian, other, and unknown and RapidMiner is assigning a weight of 0 to black (with all other weights being relative to black) but I want to change it so asian or white is the reference group with a weight of 0.  Is there a way to do this?

 

Thanks.

Tagged:

Best Answer

  • earmijoearmijo Member Posts: 270 Unicorn
    Solution Accepted

    The solution to your problem is that you could create the dummies yourself. 

     

    In this first example, I let RM choose the reference category (they turn out to be Female for gender and First for Passenger Class.

     

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="238">
    <parameter key="repository_entry" value="//Samples/data/Titanic"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
    <parameter key="attribute_name" value="Survived"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="238">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Age|Sex|Passenger Class"/>
    </operator>
    <operator activated="true" class="h2o:logistic_regression" compatibility="7.5.000" expanded="true" height="103" name="Logistic Regression" width="90" x="514" y="238"/>
    <connect from_op="Retrieve Titanic" from_port="output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
    <connect from_op="Logistic Regression" from_port="model" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Then you get:

    Screen Shot 2017-08-11 at 11.32.38 AM.png

     

    Say you want the reference categories to be Male and Third Class. You have to create dummies and use comparison groups. This gives you more control but you have to work more.

     

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="238">
    <parameter key="repository_entry" value="//Samples/data/Titanic"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
    <parameter key="attribute_name" value="Survived"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="238">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Age|Sex|Passenger Class"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.5.003" expanded="true" height="103" name="Nominal to Numerical" width="90" x="447" y="238">
    <parameter key="use_comparison_groups" value="true"/>
    <list key="comparison_groups">
    <parameter key="Sex" value="Male"/>
    <parameter key="Passenger Class" value="Third"/>
    </list>
    </operator>
    <operator activated="true" class="h2o:logistic_regression" compatibility="7.5.000" expanded="true" height="103" name="Logistic Regression" width="90" x="648" y="238"/>
    <connect from_op="Retrieve Titanic" from_port="output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
    <connect from_op="Logistic Regression" from_port="model" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Then you get:

     

    Screen Shot 2017-08-11 at 11.32.20 AM.png

     

     

    Obviously you can get the original result using:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="238">
    <parameter key="repository_entry" value="//Samples/data/Titanic"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
    <parameter key="attribute_name" value="Survived"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.5.003" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="238">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Age|Sex|Passenger Class"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.5.003" expanded="true" height="103" name="Nominal to Numerical" width="90" x="447" y="187">
    <parameter key="use_comparison_groups" value="true"/>
    <list key="comparison_groups">
    <parameter key="Sex" value="Female"/>
    <parameter key="Passenger Class" value="First"/>
    </list>
    </operator>
    <operator activated="true" class="h2o:logistic_regression" compatibility="7.5.000" expanded="true" height="103" name="Logistic Regression" width="90" x="648" y="238"/>
    <connect from_op="Retrieve Titanic" from_port="output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
    <connect from_op="Logistic Regression" from_port="model" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    And you get:

     

    Screen Shot 2017-08-11 at 11.31.51 AM.png

     

     

     

     

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Hi Chris,

     

    They way to control which target or reference variable you want to learn to is using the Set Role operator. Just select the variable name and set the parameter role to 'label.'

  • chris1chris1 Member Posts: 5 Contributor I

    Thanks for the reply but I think maybe I didn't clearly state my question.  I have the label set correctly, that's not an issue.  What I'm trying to do is determine which level of category within my categorical independent variable in the model is set as the reference group that has a weight of zero within that categorical variable.  The weights/coefficients that the model generates are relative to the reference group in the category.  

     

    In my particular model, race is one of the independent variables.  When I run the model, RapidMiner is setting the reference group for the categorical race variable as the "black" group.  All the coefficients associated with race in the model are then the relative coefficients for each race category relative to the "black" race group.  Instead I want to set the "white" group as the reference group and show the coefficients for each race cateogry relative to the "white" group.  Some races have positive coefficeint values right now relative to black but may have negative coefficient values when compared to the white group.  Race isn't the only categorical predictor that I have in the model, it's just the one I'm using in my example since it's easily understood.

     

    Does that help clear up what I'm trying to do?

     

    Thanks.

  • chris1chris1 Member Posts: 5 Contributor I

    That's perfect, exactly what I was trying to do.  Thanks for your help!

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You can also use the "Nominal to Numerical" operator and use the "effect coding" option, which allows you to specify your own comparison groups.  

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.