"Is possibile (and correct) to replace missing values keeping the same distribution of values?"

f_laperna · October 2017

Hi, I have some attributes with missing values and I want to find the best way to replace them.

Usually you can replace them with the "average" (or most frequent value) but is it possible in Rapid Miner (but more important, is it correct) to replace them by keeping the same distribution of the non-missing values?

I try to explain better with an example:

Let's say I have an attribute "Nationality" with this distribution of values:

ENG: 50%

ITA: 22%

DEU: 20%

FRA: 8%

I would like to replace the missing values with: 50% of values "ENG", 22% of values "ITA" and so on.

Note that I don't have other attributes which give me more knowledge about it and that I can use to better estimate the nationality.

What do you think? Do you have suggestion or better ways to do it?

Thank you in advance

BalazsBarany · October 2017

Hi!

It might be possible (e. g. something with random numbers and using Generate Attributes depending on the value falling between 0.0 and 0.5, 0.5 and 0.72 etc.) but it's certainly not correct.

You have data with a known value (people with the attribute value ENG) and data with a missing value. If you randomly assign someone the value ENG without knowing if it is right, you'll get a worse model.

What to do depends on different things. Is a large percentage of the values missing? Then it might be better to just drop the attribute. Might the "missingness" of the value have a meaning on its own? Then you might want to change "missing" to another value like "MISSING Nationality" (if your model required data without missing values). Are there very few missing nationalities? You might build the model without those examples (if you can accept a model that won't work on new examples with a missing nationality).

These are correct approaches. Filling missings with random data is not better than randomly changing non-missing data. (Which might be a sensible thing to do in some circumstances, for example if you'd like to test the robustness of your model. But that would happen in a later phase.)

Regards,

Balázs

f_laperna · October 2017

Ok thank you. I was quite sure it was not correct to do that, this is why I asked. Since the number of missing is not so big I will just exclude these records from the model.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Is possibile (and correct) to replace missing values keeping the same distribution of values?"

Best Answer

Answers