Performance of Impute Missing Values

HeikoPaulheimHeikoPaulheim Member Posts: 13 Contributor II
edited November 2018 in Help
Hi,

just by chance, I found out that the impute missing values operator trains a model for each attribute - while from my understanding, it would be perfectly enough to train a model only for those attributes that actually contain missing values, with the result being 100% identical. This tweak could improve the operator's performance by a large factor in many cases.

Best,
Heiko

Answers

  • RalfKlinkenbergRalfKlinkenberg Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member, Unconfirmed, University Professor Posts: 68 RM Founder
    Hi Heiko,

    yes, this would accelerate the operator. However, please consider the following: While some attributes may not have missing values during the training phase, they might actually have missing values during the deployment phase. The operator in its current implementation can handle that, while the accelerated version would not be able to handle missing values of such attributes in the deployment phase and hence would perform only an incomplete job. Since the new data occuring during deployment is not known in advance and hence you cannot be sure that certain attributes will not have missing values in the future, you need value prediction models for all attributes, if you want to have a robust implementation of this operator.

    If you would like to apply the missing value imputation only to a subset of the attributes, you can combine it with an attribute selection opersator and re-join the other attributes later.

    Best wishes,
    Ralf
  • HeikoPaulheimHeikoPaulheim Member Posts: 13 Contributor II
    Hi Ralf,

    this is an interesting argument. However, if the operator would look into attributes on the fly and decide whether or not they contain missing values, the thing should still work. The models seem to be built right at the moment when the operator is applied, so I would have a model for every attribute I need. Am I missing anything here?

    The matter would be different, of course, if I trained an imputation model on training data, to apply it to test data later on. In that case, however, I would expect a preprocessing model output of the impute missing values operator.

    Best,
    Heiko
Sign In or Register to comment.