Options

Columns with too many values

Chemical_engChemical_eng Member Posts: 16 Contributor II
Hello. 

I am using AutoModel for a regression problem ( my target is continuous). I have 3 input parameters for which I have categorical values. For one of them I have 27 values, for the other 16, but for another I have 107. I have toggled off the option of "Remove columns with too many values". Does this ensure that the one hot encoding is performed correctly for the column with 107 values ? 

What does it mean when for different categories in the generalized linear model I have coefficient 0 for many categories , is it not taking the impact ? 

Thanks
Tagged:

Best Answer

  • Options
    yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    edited June 2022 Solution Accepted
    Hi @Chemical_eng,

    Thanks for sharing your experience using AutoML for a regression problem. 
    I have toggled off the option of "Remove columns with too many values". Does this ensure that the one hot encoding is performed correctly for the column with 107 values ? 
    Yes and No. RapidMiner AutoML by default, uses "Target encoding" to remove attributes with too many values and no encoding performed. However, GLM algorithm itself will handle categorical columns directly by one-hot encoding (internally). You don't have to transform the nominal to numerical beforehand for GLM. We strongly recommend avoiding one-hot encoding categorical columns with any levels into many binary columns, as this is very inefficient. That is why we perform target encoding before the GLM internal one-hot encoding.

    I tested the Titanic data in AutoML to predict the passenger fare.
    open the process here

    In Design view, you can locate the operator that handle nominal attributes (another tip, activate the Tree view ). Here it is.

    Inside the subprocess "Basic Feature Engineering", you can find "Target Encoding" instead of one hot encoding as shown in my example. If turn on "Remove cloumns with too many values" with a max num of values set as 10, the Target encoding model will remove the attribute "Life boat", but no encodings as default. Here you can customize it by replacing with one-hot encoding operators.


    What does it mean when for different categories in the generalized linear model I have coefficient 0 for many categories , is it not taking the impact ? 
    The too many of zero coefficients is usually comes from the "regularization" in GLM. Simply put, Regularization is used to reduce the number of predictors in the model to reduce variance of the prediction error, to handle correlated predictors, and to avoid overfitting. https://en.wikipedia.org/wiki/Regularization_(mathematics)

    Again, in the process view, you can toggle off the option of regularization. 

    Hope it helps.

    Cheers,
    YY

Answers

  • Options
    Chemical_engChemical_eng Member Posts: 16 Contributor II
    Many thanks for this answer 


  • Options
    Chemical_engChemical_eng Member Posts: 16 Contributor II
    I performed the procedure, but then when I open the model simulator operator results it shows me one input variable per category ( like it left it with the encoding) ... this is not what I want 
  • Options
    yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Thank you @Chemical_eng! The model simulator from AutoML will use the data before one-hot encoding handled by GLM.

    Like the screenshot shows, we have a dropdown list of all possible values in the categorial variable. 

    If you are available for a follow-up, I could walk you through the details in a quick call. 

  • Options
    Chemical_engChemical_eng Member Posts: 16 Contributor II
    Yes I would like to have a call because after updating to one hot encoding my simulator does not show it as that . How can we arrange this ? thanks
Sign In or Register to comment.