The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

# Prediction with Optional Features

RapidMiner Certified Analyst, Member Posts: 12 Contributor II
edited December 2018 in Help

Here's a question/scenario that has me going "hmm" ... I am faced with a regression problem where my dataset has examples with attributes {A, B, C} and other examples have attributes {A, B, C, D, E}. I'm scratching my head as I consider different ways to model the data to ultimately predict the target variable.

I understand at a basic level that my regression formula can't be Y = f(A,B,C,D,E) unless I have a way to impute/default the values "D" and "E" for those examples without those features. My thought process is "my model can make a more accurate prediction when it has more information" which is the hypothesis I want to prove with this data.

Anybody have experience developing a model(s) when some of the attributes are "optional?"

Tagged:

• RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
Solution Accepted

You have a couple of different approaches here:

1. build a model on all the attributes and limit it to only records that have all the attributes populated
2. build a model with all the attributes and use missing value replacement (mulitple options here) for any that are missing
3. build a model with only the smaller set of attributes that are common to all examples
4. build two separate models, one for the larger attribute dataset and build one for the smaller attribute dataset

It's probably not the case that one of these approaches is always better than the others because it will depend on your application and use case.  They each have different pros and cons.  Option 1 will usually give the best model but it will not be able to score all examples, while option 3 will give the most broadly applicable model but it won't be as powerful.

I've had good experience with the last option, which is essentially a segmented scorecard, although it requires enough examples of each type to train a good model separately.  The second option is also a good possibility if there are reasons why the additional attributes are missing and that can be used to assign reasonable replacement values.

Brian T.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts