Columns with too many values

Chemical_eng · June 2022

Hello.

I am using AutoModel for a regression problem ( my target is continuous). I have 3 input parameters for which I have categorical values. For one of them I have 27 values, for the other 16, but for another I have 107. I have toggled off the option of "Remove columns with too many values". Does this ensure that the one hot encoding is performed correctly for the column with 107 values ?

What does it mean when for different categories in the generalized linear model I have coefficient 0 for many categories , is it not taking the impact ?

Thanks

yyhuang · June 2022

Hi @Chemical_eng,

Thanks for sharing your experience using AutoML for a regression problem.

I have toggled off the option of "Remove columns with too many values". Does this ensure that the one hot encoding is performed correctly for the column with 107 values ?

Yes and No. RapidMiner AutoML by default, uses "Target encoding" to remove attributes with too many values and no encoding performed. However, GLM algorithm itself will handle categorical columns directly by one-hot encoding (internally). You don't have to transform the nominal to numerical beforehand for GLM. We strongly recommend avoiding one-hot encoding categorical columns with any levels into many binary columns, as this is very inefficient. That is why we perform target encoding before the GLM internal one-hot encoding.

I tested the Titanic data in AutoML to predict the passenger fare.
open the process here

Image: https://us.v-cdn.net/6030995/uploads/editor/7c/6wqmyp3mx4vg.png

In Design view, you can locate the operator that handle nominal attributes (another tip, activate the Tree view ). Here it is.

Image: https://us.v-cdn.net/6030995/uploads/editor/z0/b9tgi8236zcq.png

Inside the subprocess "Basic Feature Engineering", you can find "Target Encoding" instead of one hot encoding as shown in my example. If turn on "Remove cloumns with too many values" with a max num of values set as 10, the Target encoding model will remove the attribute "Life boat", but no encodings as default. Here you can customize it by replacing with one-hot encoding operators.

What does it mean when for different categories in the generalized linear model I have coefficient 0 for many categories , is it not taking the impact ?

The too many of zero coefficients is usually comes from the "regularization" in GLM. Simply put, Regularization is used to reduce the number of predictors in the model to reduce variance of the prediction error, to handle correlated predictors, and to avoid overfitting. https://en.wikipedia.org/wiki/Regularization_(mathematics)

Image: https://us.v-cdn.net/6030995/uploads/editor/q8/qbcxa0ow49p5.png

Again, in the process view, you can toggle off the option of regularization.

Hope it helps.

Cheers,
YY

Chemical_eng · July 2022

Many thanks for this answer

Chemical_eng · July 2022

I performed the procedure, but then when I open the model simulator operator results it shows me one input variable per category ( like it left it with the encoding) ... this is not what I want

yyhuang · July 2022

Thank you @Chemical_eng! The model simulator from AutoML will use the data before one-hot encoding handled by GLM.

Like the screenshot shows, we have a dropdown list of all possible values in the categorial variable.

If you are available for a follow-up, I could walk you through the details in a quick call.

Image: https://us.v-cdn.net/6030995/uploads/editor/6t/q0yfudanqahc.png

Chemical_eng · July 2022

Yes I would like to have a call because after updating to one hot encoding my simulator does not show it as that . How can we arrange this ? thanks

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Columns with too many values

Best Answer

Answers