The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Performance of Impute Missing Values
HeikoPaulheim
Member Posts: 13 Contributor II
Hi,
just by chance, I found out that the impute missing values operator trains a model for each attribute - while from my understanding, it would be perfectly enough to train a model only for those attributes that actually contain missing values, with the result being 100% identical. This tweak could improve the operator's performance by a large factor in many cases.
Best,
Heiko
just by chance, I found out that the impute missing values operator trains a model for each attribute - while from my understanding, it would be perfectly enough to train a model only for those attributes that actually contain missing values, with the result being 100% identical. This tweak could improve the operator's performance by a large factor in many cases.
Best,
Heiko
0
Answers
yes, this would accelerate the operator. However, please consider the following: While some attributes may not have missing values during the training phase, they might actually have missing values during the deployment phase. The operator in its current implementation can handle that, while the accelerated version would not be able to handle missing values of such attributes in the deployment phase and hence would perform only an incomplete job. Since the new data occuring during deployment is not known in advance and hence you cannot be sure that certain attributes will not have missing values in the future, you need value prediction models for all attributes, if you want to have a robust implementation of this operator.
If you would like to apply the missing value imputation only to a subset of the attributes, you can combine it with an attribute selection opersator and re-join the other attributes later.
Best wishes,
Ralf
this is an interesting argument. However, if the operator would look into attributes on the fly and decide whether or not they contain missing values, the thing should still work. The models seem to be built right at the moment when the operator is applied, so I would have a model for every attribute I need. Am I missing anything here?
The matter would be different, of course, if I trained an imputation model on training data, to apply it to test data later on. In that case, however, I would expect a preprocessing model output of the impute missing values operator.
Best,
Heiko