How can I handle missing values for only specific years so I can keep certain examples?

N_28N_28 Member Posts: 9 Learner I
Dear all,
Currently, I am writing my master thesis. I am trying to make a predictive model; however, I am really stuck. I just do not know anymore how to handle the missing values in the exampleset without removing valuable examples from my data. To give you a better idea of how my data looks like and what I mean, I have attached a small part of my dataset. 

For example, row 122, is not useful in my opinion as only data on 2017 and 2019 is present. But, row 226 e.g only has 2019 missing. So, I thought maybe I can just delete the rows such as 122 when not sufficient data is available (only two years) but keep a row such as 226 as only one year (2019) is missing. So that I can keep the indicator G3. Is that possible?

Hence, I want to filter out any example that is missing at least X values between 2014- and 2019. But I do not know how to do this and which operator I need for this? 

Can anyone please help me out ? 

Thank you so much in advance.

Best Answers

  • Options
    yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Solution Accepted
    You can generate indicators for missing value by year. Then sum up the indicators for the total counts of missing columns.

    I don't have your data but get something similar for your reference. HTH!

  • Options
    N_28N_28 Member Posts: 9 Learner I
    Solution Accepted
    Thank you so much, I think this is a great idea! I will try this tomorrow.


Sign In or Register to comment.