Handling of missing variables

chris92chris92 Member Posts: 6 Contributor I
edited November 2018 in Help

Hi all,

 

I am having some difficulties with handling missing values in my data. My dataset has questions marks (?) where there are missing values and I need an operator that will identify these as missing values and ignore them if possible? Does such an operator exist? I do not wish to replace these missing values with any other calculated values simply declare them as missing. It is important that I do not lose the entire row of data as there are 90 odd attributes in each row of data with some missing values throughout. I simply wish to ignore the individual cells with these ? in them. Any assistance in this matter will be greatly appreciated.

 

Regards,

Chris

Answers

  • LobbieLobbie Member Posts: 10 Contributor I

    Hi Chris,

     

    IMHO, how missing values are handled depends on what data mining algorithm you intend to use on the data.  For example, if you intend to use Decision Tree classifier, then you do not need to worry about missing values as DT can handle or 'ignore' them.  However and if you intend to do a regression, then missing values can be a problem.  In this situation, you can do discretisation or binning the missing values into 'missing classes' for regression.

     

    HTH,

    Lobbie

  • earmijoearmijo Member Posts: 270 Unicorn

    If you don't want to drop the cases that contain missing values you have to either:

     

    - Replace them 

    - Impute them

     

    You can replace them with the mean, zero or any other value.  Use the operator "Replace Missing Values" to do this. 

     

    A better way to deal with missing values is to impute them. Basically you treat each column as the label and use the others to learn a model and then use it to predict the missing values. The process is iterative and may well take a long time to process (make sure you save the result to a new dataset). Use the operator "Impute Missing Values" to do this. Bear in mind that this is a Nested Operator (you will have to place a model inside to do the imputation). 

     

Sign In or Register to comment.