Impute Missing Values

mlubicz · May 2021

Working with my students on dealing with missing and imbalanced data in RM we found that Impute Missing Values operator, used in in the Tutorial Process for that operator, removes the label role from the class attribute (of the Labor-Negotiations dataset) and transfers it to duration attribute.

You can easily check the attributes and their roles on the k-NN (or any othe learner inside the operator) outside input and inside input.

I was not able to explain such a behaviour (although of course it is easy to work it out using Set Role twice).

Does anybody know the formal explanation?

yyhuang · May 2021

As explained in the first reply, the learner is built iteratively on each column. When you impute for column A, it automatically set column A as label because you need to predict the missing values in column A. When you impute column B in the next round, the learner will use non-missing values of column B as label to predict the missing values in column B. Repeat this (set different column as label in each step) for column C, column D, column E,…, until you finish imputing missing values in all columns.
More explanation and implementation details can be found on the GitHub open source page here
https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/preprocessing/filter/MissingValueImputation.java

What is role? Check this out https://community.rapidminer.com/discussion/54761/roles-and-labels-a-quick-guide

yyhuang · May 2021

You should insert break points before the learner inside the nest and check the refreshed metadata in each iteration. In your sceeenshot, the metadata is valid for the first iteration Only.

mlubicz · May 2021

Thank you for your comprehensive replies, including directing me to the operator code on the GitHub, which clarifies a lot, particularly "* setting one of the regular attributes to label under the assumption that all * attributes are from the same type".
I think we could set the question as solved from practical point of view (although it could be interesting to investigate the case when the above assumption is not hold while a learner accepts attributes of a specific type, like DT for a selected criterion; maybe at least an explanation in the IMV operator description in Help would be helpful, if not enabling the IMV to impute missing values for attributes of that specific type)

yyhuang · May 2021

For this operator "Impute Missing Values", you can treat it as an iterative loop. Inside the loop, for each iteration, a machine learning model (here KNN) will be trained to predict missing values in one of the columns of your data. You don't have to manually define the label with "set role" because it is auto configured.

mlubicz · May 2021

Thank you for the explanation for the operator, that is pretty clear. However it does not explain the behaviour:

1. before you run the Tutorial process you can check that the class attribute had the label role

Image: https://us.v-cdn.net/6030995/uploads/editor/5g/jz7o56u5n5jc.png

2. however inside the operator the label role is transferred to duration attribute

Image: https://us.v-cdn.net/6030995/uploads/editor/8g/aaa3par8knci.png

which is not the problem when you use k-NN

3. but if you want to change the learner for Decision Tree you first get the suggestion to use least square criterion

Image: https://us.v-cdn.net/6030995/uploads/editor/mk/k72n8k3dvsv3.png

4. and next get the opposite recommendation

Image: https://us.v-cdn.net/6030995/uploads/editor/g4/n89odh1dniid.png

It would be nice to be able to clarify what is going on

yyhuang · May 2021

Decision Tree will be configured to predicting nominal label (GAIN_RATIO INFORMATION_GAIN, GINI_INDEX_ACCURACY), or numerical label (LEAST_SQUARE) according to the criterion parameter. You can not use one Decision Tree model (fixed criterion) to impute both numerical and nominal columns inside this nested operator. However, K-NN is different story, which is powerful to handle both numerical and nominal labels. Other candidate models to consider are GLM, GBT.

Check out the operator info here

Image: https://us.v-cdn.net/6030995/uploads/editor/3f/0pyvbtny1nux.png

mlubicz · May 2021

Once more thank you for the explanation.

However I was not asking about the behavior of the Decision Tree operator and how to solve the problem, but about the automatic assignment of the label role to another attribute (duration) instead of the original class attribute.

If we replace DT for Gradient Boosted Trees, as you suggested, the duration attribute remained with the label role.

Even if you run the Tutorial Process for Input Missing Values operator from scratch, the inner exa input before the k-NN (as shown on your screen shot) shows duration as label, not the class attribute.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Impute Missing Values

Best Answers

Answers