How can I handle missing values for only specific years so I can keep certain examples?

N_28 · April 2021

Dear all,
Currently, I am writing my master thesis. I am trying to make a predictive model; however, I am really stuck. I just do not know anymore how to handle the missing values in the exampleset without removing valuable examples from my data. To give you a better idea of how my data looks like and what I mean, I have attached a small part of my dataset.

For example, row 122, is not useful in my opinion as only data on 2017 and 2019 is present. But, row 226 e.g only has 2019 missing. So, I thought maybe I can just delete the rows such as 122 when not sufficient data is available (only two years) but keep a row such as 226 as only one year (2019) is missing. So that I can keep the indicator G3. Is that possible?

Hence, I want to filter out any example that is missing at least X values between 2014- and 2019. But I do not know how to do this and which operator I need for this?

Can anyone please help me out ?

Thank you so much in advance.

Image: https://us.v-cdn.net/6030995/uploads/editor/fi/yghiptdnjd6z.png

yyhuang · April 2021

You can generate indicators for missing value by year. Then sum up the indicators for the total counts of missing columns.

I don't have your data but get something similar for your reference. HTH!

N_28 · April 2021

Thank you so much, I think this is a great idea! I will try this tomorrow.

N_28 · May 2021

Thank you @yyhuang, it worked!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How can I handle missing values for only specific years so I can keep certain examples?

Best Answers

Answers