Discretization before or after Feature Selection?

green_teagreen_tea Member Posts: 11 Contributor I
edited January 2019 in Help
Hello Rapidminer community,
I posted this question yesterday evening as well, however it has somehow disappeared after I edited it. I'm not sure if it will come back, so I thought I will ask again.

I have the following situation: I have a labelled dataset with 80+ features and ~3 million rows. I want to do a feature selection to get the ~10 most relevant features. The resulting features have to be discretized as I can only have a limited amount of different possibilities. For example, if a feature has values between 0-100 I will have to discretize it into 2-5 bins. Now I am unsure if I have to discretize all 80 variables first and then do the feature selection or if I can do the discretization only on the 10 most relevant features. How would this effect my result? I greatly appreciate your answers and explanations!

Best Answers

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    i would agree with @lionelderkrikor , but with a bit less "force". I think it's statically legal to do both. But i don't see any reason to do a FS on a different feature representation than you use for learning?

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • MaerkliMaerkli Member Posts: 84 Guru
    Hallo Green_Tea,

    Martin and Lionel are two RapidMiner authorities - I can't contradict them.However, I would recommand to look this training given by Markus Hofmann, another RM senior person:  https://www.youtube.com/watch?v=Nmo5puHRBwE
    Maerkli



  • green_teagreen_tea Member Posts: 11 Contributor I
    first of all, thanks for the very fast replies and explanations!
    As @mschmitz asked
    But i don't see any reason to do a FS on a different feature representation than you use for learning?
    I will actually not use the resulting dataset for learning, but will combine the selected features into an "activity key" that I have to use for another tool. That is also the reason why I have to discretize the features, as too many different possibilities will limit the usability of that key. By doing the discretization afterwards, I would safe a lot of work.


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    For the Dec-Tree example: If you discretize first, you enforce specific splits (your bin-boundaries). This changes what the tree can do. You further reduce the ability of the tree to split into a tree. This is a quasi-pre-pruning. Thus it makes a big difference for a tree if you do it before or not.

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    @mschmitz Agreed, that is why I said if using a tree method it would be better to do modeling first and then use the splits found by the tree in discretization.  Sorry if that was not clear.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • green_teagreen_tea Member Posts: 11 Contributor I
    Thanks for the input!
    I decided to discretize first and am doing the feature selection right now. I will probably also do the same evaluation without discretization to see how much of a difference it makes.
Sign In or Register to comment.