"Assessing features performance on different datasets"

ollestrat · January 2011

Hello,

My question is:
How to identify the features that work best on various different datasets? This means those features have to be robust and transferable and independent by the specific characteristics of an individual dataset.

My data:
- two-class problem
- 7 datasets with about 50 identical numerical features (ranges can differ significantly, but the question is not to find robust thresholds but rather identifying the key features that have a good performance across all datasets)
- Each dataset with about 5000 instances for training and testing

My ideas so far:
- select for each of the 7 datasets an optimal feature subset (e.g. by a wrapper feature selection) and simply count the occurences over all 7 results
- also, calculate "information gain" of features for the individual datasets. The average out of all 7 tests will reveal the robust features (? ..hopefully).

Do you think the ideas are worth to follow? Can you give me a hint to some problems, improvements, RapidMiner algorithms etc. as I'm relatively new to RM and data mining?

Thanks and Greetings
ollestrat

IngoRM · January 2011

Hi ollestrat,

your ideas make sense, I just want to add that you could probably make your estimation of the best multiple-purpose feature set even more robust, if you not only take the optimal features into account but the result of multiple feature selection runs for each data set. More often than not the feature set will be overfitted during its selection process and using a wrapper validation approach with an inner and an outer cross validation helps to overcome this issue.

Hardly anybody knows that the RapidMiner cross validation operators can be used to build the average not only from the performance but also from other averagables like feature weights. So I would suggest to calculate those averaged feature weights for each data set and average those results again over all data sets. Maybe here a better aggregation function would work even better.

Just my 2c, cheers,
Ingo

ollestrat · January 2011

Thank you Ingo for your helpful remarks.

I set up a workflow to repeat the FSS on every dataset 10 times (10-fold "wrapper X-Validation") and indeed the subsets are varying to a fair degree, as you supposed, thus averaging the subsets seems to be a good choice.

However I didnt quite get how I can benefit from assessing the performance of features subsets on a further classifier (as its two times nested: "Optimze Selection" within the "Wrapper X-Validation"). "Optimize Selection" is a wrapper method and the "Wrapper X-Validation" requires again a classfier. Choosing exactly the same classifier will not lead to significantly different performance values compared to the FSS performance evaluation within the "Optimize Selection". And choosing a different classifier does not make sense either as a wrapper FSS is inherently biased towards its selected classifier. I'm probably misunderstanding here something.

Greetings

ollestrat

wessel · January 2011

Option 1.
Is it possible to simply merge all the datasets?

Option 2.
Get ideas from this presentation:
The joint boost algorithm.
http://courses.engr.illinois.edu/ece598/ffl/paper_presentations/HaoTang_JointBoosting.pdf

ollestrat · January 2011

To merge all datasets could be an option. However the set up now is with a Random Forest classifier and Tree Importance measurement. Merging all datasets in a preprocessing step would result in more complexity, i.e. much more tree branches/depth would be required as each dataset is different (meatadata would reveal significant differences in mean values and standard deviation of the features, some are only biased, some differ in their distribution etc.). In the end it would be not to handle for my hardware setup, but thats only the practical issue. Theoretically..maybe.

Concerning JointBoosting: Didnt get it at first glance. Need a closer look at it. Thank you though

wessel · January 2011

Joint boost is a meta algorithm, like boosting, bagging or stacking.

It's designed to find shared features among different concepts within different datasets.

The slides show that learning a harder problem of learning all concepts at once,
yields better results than learning then learning to discriminate between concept and rest on separate datasets.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Assessing features performance on different datasets"

Answers