How to deal with high cardinality variables on a Regression problem in RapidMiner

tlg265tlg265 Member Posts: 1 Contributor I
Hello, I'm working on a Regression problem with a dataset that looks like:
> str(myds)
'data.frame':   841500 obs. of  30 variables:
 $ score                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ amount_sms_received       : int  0 0 0 0 0 0 3 0 0 3 ...
 $ amount_emails_received    : int  3 36 3 12 0 63 9 6 6 3 ...
 $ distance_from_server      : int  17 17 7 7 7 14 10 7 34 10 ...
 $ age                       : int  17 44 16 16 30 29 26 18 19 43 ...
 $ points_earned             : int  929 655 286 357 571 833 476 414 726 857 ...
 $ registrationYYYY          : Factor w/ 2 levels ...
 $ registrationDateMM        : Factor w/ 9 levels ...
 $ registrationDateDD        : Factor w/ 31 levels ...
 $ registrationDateHH        : Factor w/ 24 levels ...
 $ registrationDateWeekDay   : Factor w/ 7 levels ...
 $ catVar_05                 : Factor w/ 2 levels ...
 $ catVar_06                 : Factor w/ 140 levels ...
 $ catVar_07                 : Factor w/ 21 levels ...
 $ catVar_08                 : Factor w/ 1582 levels ...
 $ catVar_09                 : Factor w/ 70 levels ...
 $ catVar_10                 : Factor w/ 755 levels ...
 $ catVar_11                 : Factor w/ 23 levels ...
 $ catVar_12                 : Factor w/ 129 levels ...
 $ catVar_13                 : Factor w/ 15 levels ...
 $ city                      : Factor w/ 22750 levels ...
 $ state                     : Factor w/ 55 levels ...
 $ zip                       : Factor w/ 26659 levels ...
 $ catVar_17                 : Factor w/ 2 levels ...
 $ catVar_18                 : Factor w/ 2 levels ...
 $ catVar_19                 : Factor w/ 3 levels ...
 $ catVar_20                 : Factor w/ 6 levels ...
 $ catVar_21                 : Factor w/ 2 levels ...
 $ catVar_22                 : Factor w/ 4 levels ...
 $ catVar_23                 : Factor w/ 5 levels ...

My goal is to predict the target variable: "score".

I'm using R but I also want to use Rapidminer. I think both tools work well together based on what I have read so far.

On the link: http:// mod.rapidminer.com/#app  I specified the nature of the dataset displayed above and it recommends me to use KNN for the prediction of the target variable: "score".

My main concern here are high cardinality variables : { "city", "zip" }.

One of the ways to deal with that is by using "Target Encoding" (aka: "Mean Encoding"). But as stated here:

https:// maxhalford.github.io/blog/target-encoding-done-the-right-way/

"The problem of target encoding has a name: over-fitting. Indeed relying on an average value isn’t always a good idea when the number of values used in the average is low. You’ve got to keep in mind that the dataset you’re training on is a sample of a larger set. This means that whatever artifacts you may find in the training set might not hold true when applied to another dataset (i.e. the test set)."

It looks like the way to handle that side effect is the: "Regularization".

I have been using R, and one of the most popular packages to deal with this is: "vtreat" which is used here:

https:// www.r-bloggers.com/vtreat-prepare-data/

For sure that package is awesome, but I think is going to take me a while to be familiar with.

My question is: Can the Rapidminer do "Target Encoding" as well?, doing at the same time: "Regularization"? Probably its the very intuitive UI helps.
Tghadially

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,270   Unicorn
    You can accomplish "Target Encoding" in RapidMiner by using the Aggregate operator where you use the X categories as your "group by" and average of Y as your aggregation function, and then join that back into your dataset as a new attribute.
    (I'd have to think through the implications of how this is really different in a regression context rather than simply converting the original categorical attribute using Nominal to Numerical with dummy encoding and throwing all those into your regression.  At first blush they don't seem likely to lead to dramatically different results.)

    Either way, that's not a good idea with high cardinality nominal variables because of overfitting.  And I wouldn't rely on regularization (which is a supported option in the GLM operator in RapidMiner) to fix it.

    In my view, you are much better off doing appropriate feature engineering upfront as part of modeling pre-processing to combine related categories (for any attribute with high cardinality).  For example, instead of using the full zip code, use only the first 1-3 characters and combine that way (an easy transformation in RapidMiner) to get larger, more representative samples.  Or use a City to Metro Region mapping dataset (a bit more complicated, but still do-able) to join in the metro region, and use that instead.  RapidMiner has plenty of binning and combining operators to support this kind of ETL (join, discretize, map, replace, replace rare values, generate attributes, etc.)




    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    mschmitzTghadiallysgenzer
Sign In or Register to comment.