Combining Multiple imputation results

faridehbagherza · July 2013

Hi! I used R for multiple imputation and imputed 5 Imputations of my data. For the Model, I am using a stacking model of 3 base learners.
I don`t know what I should do with these imputations of the data. Should I train all my base learners with all these imputations individually?
That sounds right, but it takes a lot of time to train each of the base learners with each of the imputed data sets and then again train the stacked model with each of the imputed data sets!
Anyway, if that`s right, how can I combine the five models learned by 5 imputed data sets?
I mean, for example, to combine models for a stacking model, or addaboost or ... there are operators, but to combine models built from different imputed data sets, I couldn`t find any operator!

wessel · July 2013

Dear Sir,

Can you upload a process to exemplify your problem?

The process attached below creates missing values for the //Samples/Sonar data set.
You can extend this process with your imputation/stacking scheme.

Best regards,

Wessel

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Retrieve" width="90" x="43" y="31">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="normalize" compatibility="5.3.008" expanded="true" height="94" name="Normalize" width="90" x="180" y="30"/>
<operator activated="true" class="loop_attributes" compatibility="5.3.008" expanded="true" height="76" name="Loop Attributes" width="90" x="315" y="30">
<parameter key="iteration_macro" value="a1"/>
<process expanded="true">
<operator activated="true" class="generate_attributes" compatibility="5.3.008" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="296" y="73">
<list key="function_descriptions">
<parameter key="%{a1}" value="if(rand() > .9, NaN, %{a1})"/>
</list>
<parameter key="use_standard_constants" value="false"/>
</operator>
<connect from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_port="example set"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Loop Attributes" to_port="example set"/>
<connect from_op="Loop Attributes" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

faridehbagherza · July 2013

Dear Sir
I used your sample and completed it with my process!
But the site doesnt let me put the code here! Because that`s too long!
What should I do now?
Regards
Farideh

wessel · July 2013

Hey,

I managed to load the processes that you e-mailed me.
But I still don't understand the problem.

Imputation creates a modified data-set with missing values replaced by imputed values, correct?

Do you wish to compare the performance of different imputation techniques (Machine Learning)?
Or do you wish to create a model that performs well on some real world problem (Data Mining)?

Best regards,

Wessel

edit: The process in the next post validates a stacking model using different imputed data sets as input.

wessel · July 2013

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.012">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.012" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.012" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="normalize" compatibility="5.3.012" expanded="true" height="94" name="Normalize" width="90" x="180" y="30"/>
<operator activated="true" class="loop_attributes" compatibility="5.3.012" expanded="true" height="76" name="Loop Attributes" width="90" x="315" y="30">
<parameter key="iteration_macro" value="a1"/>
<process expanded="true">
<operator activated="true" class="generate_attributes" compatibility="5.3.012" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="299" y="73">
<list key="function_descriptions">
<parameter key="%{a1}" value="if(rand() > .9, NaN, %{a1})"/>
</list>
<parameter key="use_standard_constants" value="false"/>
</operator>
<connect from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_port="example set"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
<operator activated="true" class="x_validation" compatibility="5.3.012" expanded="true" height="112" name="Validation" width="90" x="458" y="103">
<parameter key="number_of_validations" value="2"/>
<process expanded="true">
<operator activated="true" class="stacking" compatibility="5.3.012" expanded="true" height="60" name="Stacking" width="90" x="174" y="45">
<parameter key="keep_all_attributes" value="false"/>
<process expanded="true">
<operator activated="true" class="impute_missing_values" compatibility="5.3.012" expanded="true" height="60" name="Imp1" width="90" x="170" y="60">
<parameter key="learn_on_complete_cases" value="false"/>
<process expanded="true">
<operator activated="true" class="k_nn" compatibility="5.3.012" expanded="true" height="76" name="k-NN" width="90" x="419" y="30"/>
<connect from_port="example set source" to_op="k-NN" to_port="training set"/>
<connect from_op="k-NN" from_port="model" to_port="model sink"/>
<portSpacing port="source_example set source" spacing="0"/>
<portSpacing port="sink_model sink" spacing="0"/>
</process>
</operator>
<operator activated="true" class="naive_bayes" compatibility="5.3.012" expanded="true" height="76" name="Naive Bayes" width="90" x="336" y="57"/>
<operator activated="true" class="impute_missing_values" compatibility="5.3.012" expanded="true" height="60" name="Imp2" width="90" x="165" y="199">
<parameter key="learn_on_complete_cases" value="false"/>
<process expanded="true">
<operator activated="true" class="linear_regression" compatibility="5.3.012" expanded="true" name="Linear Regression"/>
<connect from_port="example set source" to_op="Linear Regression" to_port="training set"/>
<connect from_op="Linear Regression" from_port="model" to_port="model sink"/>
<portSpacing port="source_example set source" spacing="0"/>
<portSpacing port="sink_model sink" spacing="0"/>
</process>
</operator>
<operator activated="true" class="naive_bayes" compatibility="5.3.012" expanded="true" height="76" name="Naive Bayes (2)" width="90" x="356" y="177"/>
<connect from_port="training set 1" to_op="Imp1" to_port="example set in"/>
<connect from_port="training set 2" to_op="Imp2" to_port="example set in"/>
<connect from_op="Imp1" from_port="example set out" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="base model 1"/>
<connect from_op="Imp2" from_port="example set out" to_op="Naive Bayes (2)" to_port="training set"/>
<connect from_op="Naive Bayes (2)" from_port="model" to_port="base model 2"/>
<portSpacing port="source_training set 1" spacing="0"/>
<portSpacing port="source_training set 2" spacing="0"/>
<portSpacing port="source_training set 3" spacing="0"/>
<portSpacing port="sink_base model 1" spacing="0"/>
<portSpacing port="sink_base model 2" spacing="0"/>
<portSpacing port="sink_base model 3" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="decision_tree" compatibility="5.3.012" expanded="true" height="76" name="Decision Tree" width="90" x="45" y="30"/>
<connect from_port="stacking examples" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="stacking model"/>
<portSpacing port="source_stacking examples" spacing="0"/>
<portSpacing port="sink_stacking model" spacing="0"/>
</process>
</operator>
<connect from_port="training" to_op="Stacking" to_port="training set"/>
<connect from_op="Stacking" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.3.012" expanded="true" height="76" name="Apply Model" width="90" x="97" y="40">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.3.012" expanded="true" height="76" name="Performance" width="90" x="212" y="45"/>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="5.3.012" expanded="true" height="76" name="Apply Model (2)" width="90" x="588" y="4">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Loop Attributes" to_port="example set"/>
<connect from_op="Loop Attributes" from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Validation" from_port="training" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 3"/>
<connect from_op="Apply Model (2)" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>

faridehbagherza · August 2013

Hi

yeah, Imputation creates a modified data-set with missing values replaced by imputed values.

I want my model to have a better performance when using the 5 imputations than when just one imputation is used.
For example, when you build several regression models from several imputations, there are rules to combine these regressions and extract one model out of them. But here that I have an ensemble model, I`m not sure what is the best way to combine them. Voting or any other way?
Voting gives better performance, but is it the common way to combine models built from different imputations.

This is my first experience with Rapid miner, How was the process overally?
Any suggestions on the whole process?
Thanks again
Regards
Farideh Bagherzadeh

faridehbagherza · August 2013

Here is a sample of what I was talking about for all the others who have suggestions for me:
I uploaded 2 codes on pastebin.com

1st code: http://pastebin.com/vjr8p9a7
2nd code: http://pastebin.com/Zn0aduu5

Here is a little explanation about them: 1. You need to have VIM package of R for being able to run it!
2. I upload two codes for you! In the first one I just imputed 1 dataset, and in the second one I imputed 5 datasets.
About the first code: Here, in the first Subprocess I trained 3 base learners and in the second subprocess I used these 3 learners for training a stacking model!
The stacking model has a better performance of all!
About the second code:Here in the first subprocess, I used 5 imputations to train 5 stacking models just like how I did in the first code! Then in the second subprocess I voted on these 5 models built by 5 imputations to combine the results to gain better performance!
I hope you don`t get confused with the process!
Any suggestions on the whole process would be welcomed!
I mean any other way to combine the results of the imputations instead of voting or ...!
In these processes I trained all the base learners with all the imputations, is that the common way?
Thanks in advance.
Regards
Farideh

wessel · August 2013

Hey,

@ "This is my first experience with Rapid miner, How was the process overall?"
You should limit the size of your process, and rely less on recall operators.

The process I uploaded above showed how to embed multiple imputation operators within the stacking and X-validation operator .

Alternatively you can create a new data set that contains multiple copies for each imputed attribute.
As in:
- Load your data
- Generate ID's
- Apply imputation (5x)
- Join results into 1 final data set

I will upload this process below.

Best regards,

Wessel

wessel · August 2013

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.012">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.012" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.012" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="normalize" compatibility="5.3.012" expanded="true" height="94" name="Normalize" width="90" x="180" y="30"/>
<operator activated="true" class="generate_id" compatibility="5.3.012" expanded="true" height="76" name="Generate ID" width="90" x="304" y="34"/>
<operator activated="true" class="multiply" compatibility="5.3.012" expanded="true" height="94" name="Multiply" width="90" x="435" y="39"/>
<operator activated="true" class="impute_missing_values" compatibility="5.3.012" expanded="true" height="60" name="Imp1" width="90" x="467" y="259">
<parameter key="learn_on_complete_cases" value="false"/>
<process expanded="true">
<operator activated="true" class="k_nn" compatibility="5.3.012" expanded="true" name="k-NN"/>
<connect from_port="example set source" to_op="k-NN" to_port="training set"/>
<connect from_op="k-NN" from_port="model" to_port="model sink"/>
<portSpacing port="source_example set source" spacing="0"/>
<portSpacing port="sink_model sink" spacing="0"/>
</process>
</operator>
<operator activated="true" class="rename_by_replacing" compatibility="5.3.012" expanded="true" height="76" name="Rename by Replacing" width="90" x="592" y="179">
<parameter key="replace_what" value="$"/>
<parameter key="replace_by" value="_2"/>
</operator>
<operator activated="true" class="impute_missing_values" compatibility="5.3.012" expanded="true" height="60" name="Imp2" width="90" x="588" y="39">
<parameter key="learn_on_complete_cases" value="false"/>
<process expanded="true">
<operator activated="true" class="linear_regression" compatibility="5.3.012" expanded="true" height="94" name="Linear Regression" width="90" x="379" y="34"/>
<connect from_port="example set source" to_op="Linear Regression" to_port="training set"/>
<connect from_op="Linear Regression" from_port="model" to_port="model sink"/>
<portSpacing port="source_example set source" spacing="0"/>
<portSpacing port="sink_model sink" spacing="0"/>
</process>
</operator>
<operator activated="true" class="join" compatibility="5.3.012" expanded="true" height="76" name="Join" width="90" x="766" y="123">
<parameter key="remove_double_attributes" value="false"/>
<list key="key_attributes"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Imp2" to_port="example set in"/>
<connect from_op="Multiply" from_port="output 2" to_op="Imp1" to_port="example set in"/>
<connect from_op="Imp1" from_port="example set out" to_op="Rename by Replacing" to_port="example set input"/>
<connect from_op="Rename by Replacing" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Imp2" from_port="example set out" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

faridehbagherza · August 2013

Hey

About your first code, I don`t want to use just one model(Naive as you did) as the base learner. I want 3, and I wasn`t sure to train all these 3 with all the base learners. Besides, using stacking instead of voting (as you did) makes the process even more complicated.
About your second code, the Join operator is not a suitable way for joining your imputations. Because whenever there is a difference it just ignores the value of the right imputation. It`s just like using only the left imputation.
Anyway, thanks for your time!
Regards
Farideh

wessel · August 2013

"About your second code, the Join operator is not a suitable way for joining your imputations. Because whenever there is a difference it just ignores the value of the right imputation. "

What?

faridehbagherza · September 2013

Hey there!
Earlier I made a mistake about the "join" operator. I used your suggestion and that worked well.
Thanks a lot, that was a great help to me.
Regards

faridehbagherza · September 2013

Hi Wessel!
I got a question. When we join the multiple imputations by the method you suggested, doesn`t the correlation make any problem?
I mean we are inserting each variable 5 times!
Best Regards
Farideh

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Combining Multiple imputation results

Answers