Dealing with an important but often missing attribute

keith · April 2009

What is a good way to use an attribute that is important when a value is available, but is missing for a large percentage of the data set?

I have an example set containing data that go back about 20 years. Each example has a 20-30 attributes, most of which are available for the entire 20 year span. However, there are some attributes that are only available for recent data (past 5 years or so), and are missing for all the examples prior to that time. These newer attributes, if present, are likely to be strong predictors for the regression problem I'm trying to solve.

My preferred model is a nearest neighbors (actually W-LWL), as its been found to work quite well when using attributes that are available throughout the timespan. However, if I simply fill in the missing values with the average (MissingValueReplenishment), then such a large fraction of the dataset has a single value that it doesn't get selected or weighted highly.

Is there an alternate way of modeling this such that it would take advantage of these useful-but-rare attributes only when they are present?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Dealing with an important but often missing attribute