Options

Replace missing values with average in each cluster

painfuloverpainfulover Member Posts: 1 Newbie
Hello,
I'm new to Rapidminer and I would like to replace missing values based on clustering, which means I have used k-means on columns which have no missing values and divide the original exampleset into 5 clusters. Now I would like know how to replace each row's missing values by the averages of the cluster it belongs to instead of the averages of whole attributes. I can only find the way to do the latter by the operator [replace missing values].
Thank you very much.

Best Answers

  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted
    Hi!

    This process is a bit involved. You get a "cluster model" from the clustering operator that you can apply to the data with missing values. However, you need to choose an operator that can work with missing values itself. Then you would aggregate the clustered original data (the non-missing data), grouping by the cluster to get the averages. You can join the result with the missing values and use e. g. Loop Attributes to fill in the missing values using Generate Attributes with a formula like if(missing(%{attr}), eval("average(" + %{attr} + ")"), %{attr})

    It is much easier to use the Impute Missing Values operator that automates the selection of missing values, building a model for predicting them (you can select the model type) and putting the predicted values into the missing cells. There is an example process in the operator help that shows you how to use it, with k-NN as the example learning algorithm. 

    Regards,
    Balázs
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Solution Accepted
    Hi,
    I would just use Group into Collection and create a collection (list) of example sets, where each example set only contains one cluster. You can then use Loop Collection and in there use Replace Missing to replace it with the respective means. Afterwards you just Append the resulting collection again.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.