Impute Missing Values

mlubiczmlubicz Member, University Professor Posts: 17 University Professor
Working with my students on dealing with missing and imbalanced data in RM we found that Impute Missing Values operator, used in in the Tutorial Process for that operator, removes the label role from the class attribute (of  the Labor-Negotiations dataset) and  transfers it to duration attribute.
You can easily check the attributes and their roles on the k-NN (or any othe learner inside the operator) outside input and inside input.
I was not able to explain such a behaviour (although of course it is easy to work it out using Set Role twice).
Does anybody know the formal explanation?

Best Answers

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    edited May 2021 Solution Accepted
    As explained in the first reply, the learner is built iteratively on each column. When you impute for column A, it automatically set column A as label because you need to predict the missing values in column A. When you impute column B in the next round, the learner will use non-missing values of column B as label to predict the missing values in column B. Repeat this (set different column as label in each step) for column C, column D, column E,…, until you finish imputing missing values in all columns.
    More explanation and implementation details can be found on the GitHub open source page here
    https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/preprocessing/filter/MissingValueImputation.java

    What is role? Check this out https://community.rapidminer.com/discussion/54761/roles-and-labels-a-quick-guide
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    edited May 2021 Solution Accepted
    You should insert break points before the learner inside the nest and check the refreshed metadata in each iteration. In your sceeenshot, the metadata is valid for the first iteration Only.
  • mlubiczmlubicz Member, University Professor Posts: 17 University Professor
    Solution Accepted
    Thank you for your comprehensive replies, including directing me to the operator code on the GitHub, which clarifies a lot, particularly "* setting one of the regular attributes to label under the assumption that all * attributes are from the same type".
    I think we could set the question as solved from practical point of view (although it could be interesting to investigate the case when the above assumption is not hold while a learner accepts attributes of a specific type, like DT for a selected criterion; maybe at least an explanation in the IMV operator description in Help would be helpful, if not enabling the IMV to impute missing values for attributes of that specific type)

Answers

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    For this operator "Impute Missing Values", you can treat it as an iterative loop. Inside the loop, for each iteration, a machine learning model (here KNN) will be trained to predict missing values in one of the columns of your data. You don't have to manually define the label with "set role" because it is auto configured.
  • mlubiczmlubicz Member, University Professor Posts: 17 University Professor
    Thank you for the explanation for the operator, that is pretty clear. However it does not explain the behaviour:
    1. before you run the Tutorial process you can check that the class attribute had the label role
    2. however inside the operator the label role is transferred to duration attribute
     
    which is not the problem when you use k-NN
    3. but if you want to change the learner for Decision Tree you first get the suggestion to use least square criterion
    4. and next get the opposite recommendation

    It would be nice to be able to clarify what is going on

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Decision Tree will be configured to predicting nominal label (GAIN_RATIO INFORMATION_GAIN, GINI_INDEX_ACCURACY), or numerical label (LEAST_SQUARE) according to the criterion parameter. You can not use one Decision Tree model (fixed criterion) to impute both numerical and nominal columns inside this nested operator. However, K-NN is different story, which is powerful to handle both numerical and nominal labels. Other candidate models to consider are GLM, GBT. 

    Check out the operator info here


  • mlubiczmlubicz Member, University Professor Posts: 17 University Professor
    Once more thank you for the explanation.
    However I was not asking about the behavior of the Decision Tree operator and how to solve the problem, but about the automatic assignment of the label role to another attribute (duration) instead of the original class attribute.
    If we replace DT for Gradient Boosted Trees, as you suggested, the duration attribute remained with the label role.
    Even if you run the Tutorial Process for Input Missing Values operator from scratch, the inner exa input before the k-NN (as shown on your screen shot) shows duration as label, not the class attribute.
Sign In or Register to comment.