Automatic feature engineering: results interpretation

kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 277   Unicorn
Hi there, 

I played a bit with automatic feature engineering operator and I would like to get some points cleared; specifically, how do I interpret the result. 

I have used 'balance for accuracy' = 1 and 'feature selection' option (no generation) on a dataset of 14,000 examples / 142 regular features, split 80/20 for train and test. Inside feature selection operator I used GLM learner on a numeric label (so, we have a linear regression here) and RMSE criteria for optimization.

This is the output I got in progress dialog: 



In total, 5 feature sets were generated, with these fitness / complexity correspondingly:

0.408 / 142  
0.408 / 62
0.410 / 59
0.458 / 55
0.466 / 50

So far, in terms of RMSE minimization, first two sets are optimal (leftmost points on the right graph). However, the first one is also identical to original set, which means ALL features were used. 
  1. Why optimization operator still have chosen the bigger set (142) not the smaller (62), as the fitness is equal for both? 
  2. Is there a way to make optimizer choose the most optimal set AND the smallest at the same time, in the situation like above?
  3. If the most accurate feature set includes all the features, does it mean that they all are contributing to predictions, so no feature can be removed without increasing the error? 
  4. How do I interpret the left graph? I understand it shows trade-offs for error vs. complexity for different feature sets, but how exactly do I read it? Why the upper line (in blue) shows complexity 142 and error close to 0.490 (logically it's the highest error, so the 'worst' option)? On the contrary, lower line (in orange) goes around 0.408 (lowest value) but complexity is around 20? In other words, I cannot find analogy between left and right charts.

Best Answer

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 745   Unicorn
    Hi @kypexin,

    It's just to submit a hypothesis to the 1. question : 

    You have set "Balance for accuracy" = 1 , so RM will select the most accurate feature set : 
    Maybe the fitness with 142 features is something like 0,407777... and it is effectively less than the fitness
    with 62 features (maybe something like 0,4079999999....)....

    Regards,

    Lionel 
    kypexintopaz_n
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 277   Unicorn
    Thanks @lionelderkrikor and @IngoRM!

    I am always forgetting about this real numbers rounding thingy. And all other explanations seem pretty straightforward, so I am now more confident in automatic feature engineering results interpretation. 
    sgenzer
Sign In or Register to comment.