10-05-2016 06:03 PM
Right now I am learning rapidminer as part of at Emory as part of their program; however, I am already apprechiate how much time cleaning data Rapid Miner Saves. One thing I would love to do in rapid miner would be to replace missing values (or even filter) based on a function. What do I mean?
You know how sometimes you can replace a value by the statisical average? You know its appropriate to do that based on knowing where the data came from (or source of the data). Sometimes especially in science and engineering, we know one attribute (called y) is really based on 1 - 2 other attritubes (called x).... so that the average is infact and equation based on those two attritubes. For example, average y = x^2.
[Suggestion] I would like to be able to replace missing values based on function, like y = x^2, since it would be more accurate then simply the stastical average. I am learning how to do this in R. Currently, I have not figure out a way to do this, so I was hoping that developers could add this feature of filter/replacing this base on function. Or is their a way I could create in my own custum operator?
P.S. I can give a real life example where I ran into this problem while trying to take semi-structured and structure exoplanet data and clean it up. Would you all be interested in me posting that example.
Solved! Go to Solution.
10-05-2016 07:49 PM - edited 10-05-2016 07:50 PM
10-06-2016 09:04 AM
Of course @Thomas_Ott is correct, if you already know the function that you need, then Generate Attributes works perfectly.
However, if you don't already know the exact function that should describe the missing values, but you do know that you can do better than just assigning the average or the mode or similar, let me draw your attention to the "impute missing values" operator. It's actually a tool to do exactly what you describe, which is to fit a small model to predict missing values as a function of the other nonmissing attribute values (which it does on the examples where the target is not missing) and then apply it to generate the missing values where needed. What's also nice about this is that it actually allows you to try different learners in the subprocess so you can see how different approaches fill in the missing values based on the modeling algorithm chosen.