Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Combining Multiple imputation results
faridehbagherza
Member Posts: 22 Contributor II
Hi! I used R for multiple imputation and imputed 5 Imputations of my data. For the Model, I am using a stacking model of 3 base learners.
I don`t know what I should do with these imputations of the data. Should I train all my base learners with all these imputations individually?
That sounds right, but it takes a lot of time to train each of the base learners with each of the imputed data sets and then again train the stacked model with each of the imputed data sets!
Anyway, if that`s right, how can I combine the five models learned by 5 imputed data sets?
I mean, for example, to combine models for a stacking model, or addaboost or ... there are operators, but to combine models built from different imputed data sets, I couldn`t find any operator!
I don`t know what I should do with these imputations of the data. Should I train all my base learners with all these imputations individually?
That sounds right, but it takes a lot of time to train each of the base learners with each of the imputed data sets and then again train the stacked model with each of the imputed data sets!
Anyway, if that`s right, how can I combine the five models learned by 5 imputed data sets?
I mean, for example, to combine models for a stacking model, or addaboost or ... there are operators, but to combine models built from different imputed data sets, I couldn`t find any operator!
0
Answers
Can you upload a process to exemplify your problem?
The process attached below creates missing values for the //Samples/Sonar data set.
You can extend this process with your imputation/stacking scheme.
Best regards,
Wessel
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Retrieve" width="90" x="43" y="31">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="normalize" compatibility="5.3.008" expanded="true" height="94" name="Normalize" width="90" x="180" y="30"/>
<operator activated="true" class="loop_attributes" compatibility="5.3.008" expanded="true" height="76" name="Loop Attributes" width="90" x="315" y="30">
<parameter key="iteration_macro" value="a1"/>
<process expanded="true">
<operator activated="true" class="generate_attributes" compatibility="5.3.008" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="296" y="73">
<list key="function_descriptions">
<parameter key="%{a1}" value="if(rand() > .9, NaN, %{a1})"/>
</list>
<parameter key="use_standard_constants" value="false"/>
</operator>
<connect from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_port="example set"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Loop Attributes" to_port="example set"/>
<connect from_op="Loop Attributes" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I used your sample and completed it with my process!
But the site doesnt let me put the code here! Because that`s too long!
What should I do now?
Regards
Farideh
I managed to load the processes that you e-mailed me.
But I still don't understand the problem.
Imputation creates a modified data-set with missing values replaced by imputed values, correct?
Do you wish to compare the performance of different imputation techniques (Machine Learning)?
Or do you wish to create a model that performs well on some real world problem (Data Mining)?
Best regards,
Wessel
edit: The process in the next post validates a stacking model using different imputed data sets as input.
<process version="5.3.012">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.012" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.012" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="normalize" compatibility="5.3.012" expanded="true" height="94" name="Normalize" width="90" x="180" y="30"/>
<operator activated="true" class="loop_attributes" compatibility="5.3.012" expanded="true" height="76" name="Loop Attributes" width="90" x="315" y="30">
<parameter key="iteration_macro" value="a1"/>
<process expanded="true">
<operator activated="true" class="generate_attributes" compatibility="5.3.012" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="299" y="73">
<list key="function_descriptions">
<parameter key="%{a1}" value="if(rand() > .9, NaN, %{a1})"/>
</list>
<parameter key="use_standard_constants" value="false"/>
</operator>
<connect from_port="example set" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_port="example set"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
<operator activated="true" class="x_validation" compatibility="5.3.012" expanded="true" height="112" name="Validation" width="90" x="458" y="103">
<parameter key="number_of_validations" value="2"/>
<process expanded="true">
<operator activated="true" class="stacking" compatibility="5.3.012" expanded="true" height="60" name="Stacking" width="90" x="174" y="45">
<parameter key="keep_all_attributes" value="false"/>
<process expanded="true">
<operator activated="true" class="impute_missing_values" compatibility="5.3.012" expanded="true" height="60" name="Imp1" width="90" x="170" y="60">
<parameter key="learn_on_complete_cases" value="false"/>
<process expanded="true">
<operator activated="true" class="k_nn" compatibility="5.3.012" expanded="true" height="76" name="k-NN" width="90" x="419" y="30"/>
<connect from_port="example set source" to_op="k-NN" to_port="training set"/>
<connect from_op="k-NN" from_port="model" to_port="model sink"/>
<portSpacing port="source_example set source" spacing="0"/>
<portSpacing port="sink_model sink" spacing="0"/>
</process>
</operator>
<operator activated="true" class="naive_bayes" compatibility="5.3.012" expanded="true" height="76" name="Naive Bayes" width="90" x="336" y="57"/>
<operator activated="true" class="impute_missing_values" compatibility="5.3.012" expanded="true" height="60" name="Imp2" width="90" x="165" y="199">
<parameter key="learn_on_complete_cases" value="false"/>
<process expanded="true">
<operator activated="true" class="linear_regression" compatibility="5.3.012" expanded="true" name="Linear Regression"/>
<connect from_port="example set source" to_op="Linear Regression" to_port="training set"/>
<connect from_op="Linear Regression" from_port="model" to_port="model sink"/>
<portSpacing port="source_example set source" spacing="0"/>
<portSpacing port="sink_model sink" spacing="0"/>
</process>
</operator>
<operator activated="true" class="naive_bayes" compatibility="5.3.012" expanded="true" height="76" name="Naive Bayes (2)" width="90" x="356" y="177"/>
<connect from_port="training set 1" to_op="Imp1" to_port="example set in"/>
<connect from_port="training set 2" to_op="Imp2" to_port="example set in"/>
<connect from_op="Imp1" from_port="example set out" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_port="base model 1"/>
<connect from_op="Imp2" from_port="example set out" to_op="Naive Bayes (2)" to_port="training set"/>
<connect from_op="Naive Bayes (2)" from_port="model" to_port="base model 2"/>
<portSpacing port="source_training set 1" spacing="0"/>
<portSpacing port="source_training set 2" spacing="0"/>
<portSpacing port="source_training set 3" spacing="0"/>
<portSpacing port="sink_base model 1" spacing="0"/>
<portSpacing port="sink_base model 2" spacing="0"/>
<portSpacing port="sink_base model 3" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="decision_tree" compatibility="5.3.012" expanded="true" height="76" name="Decision Tree" width="90" x="45" y="30"/>
<connect from_port="stacking examples" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="stacking model"/>
<portSpacing port="source_stacking examples" spacing="0"/>
<portSpacing port="sink_stacking model" spacing="0"/>
</process>
</operator>
<connect from_port="training" to_op="Stacking" to_port="training set"/>
<connect from_op="Stacking" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.3.012" expanded="true" height="76" name="Apply Model" width="90" x="97" y="40">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.3.012" expanded="true" height="76" name="Performance" width="90" x="212" y="45"/>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="5.3.012" expanded="true" height="76" name="Apply Model (2)" width="90" x="588" y="4">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Loop Attributes" to_port="example set"/>
<connect from_op="Loop Attributes" from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Validation" from_port="training" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 3"/>
<connect from_op="Apply Model (2)" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
yeah, Imputation creates a modified data-set with missing values replaced by imputed values.
I want my model to have a better performance when using the 5 imputations than when just one imputation is used.
For example, when you build several regression models from several imputations, there are rules to combine these regressions and extract one model out of them. But here that I have an ensemble model, I`m not sure what is the best way to combine them. Voting or any other way?
Voting gives better performance, but is it the common way to combine models built from different imputations.
This is my first experience with Rapid miner, How was the process overally?
Any suggestions on the whole process?
Thanks again
Regards
Farideh Bagherzadeh
I uploaded 2 codes on pastebin.com
1st code: http://pastebin.com/vjr8p9a7
2nd code: http://pastebin.com/Zn0aduu5
Here is a little explanation about them: 1. You need to have VIM package of R for being able to run it!
2. I upload two codes for you! In the first one I just imputed 1 dataset, and in the second one I imputed 5 datasets.
About the first code: Here, in the first Subprocess I trained 3 base learners and in the second subprocess I used these 3 learners for training a stacking model!
The stacking model has a better performance of all!
About the second code:Here in the first subprocess, I used 5 imputations to train 5 stacking models just like how I did in the first code! Then in the second subprocess I voted on these 5 models built by 5 imputations to combine the results to gain better performance!
I hope you don`t get confused with the process!
Any suggestions on the whole process would be welcomed!
I mean any other way to combine the results of the imputations instead of voting or ...!
In these processes I trained all the base learners with all the imputations, is that the common way?
Thanks in advance.
Regards
Farideh
@ "This is my first experience with Rapid miner, How was the process overall?"
You should limit the size of your process, and rely less on recall operators.
The process I uploaded above showed how to embed multiple imputation operators within the stacking and X-validation operator .
Alternatively you can create a new data set that contains multiple copies for each imputed attribute.
As in:
- Load your data
- Generate ID's
- Apply imputation (5x)
- Join results into 1 final data set
I will upload this process below.
Best regards,
Wessel
<process version="5.3.012">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.012" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.012" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="normalize" compatibility="5.3.012" expanded="true" height="94" name="Normalize" width="90" x="180" y="30"/>
<operator activated="true" class="generate_id" compatibility="5.3.012" expanded="true" height="76" name="Generate ID" width="90" x="304" y="34"/>
<operator activated="true" class="multiply" compatibility="5.3.012" expanded="true" height="94" name="Multiply" width="90" x="435" y="39"/>
<operator activated="true" class="impute_missing_values" compatibility="5.3.012" expanded="true" height="60" name="Imp1" width="90" x="467" y="259">
<parameter key="learn_on_complete_cases" value="false"/>
<process expanded="true">
<operator activated="true" class="k_nn" compatibility="5.3.012" expanded="true" name="k-NN"/>
<connect from_port="example set source" to_op="k-NN" to_port="training set"/>
<connect from_op="k-NN" from_port="model" to_port="model sink"/>
<portSpacing port="source_example set source" spacing="0"/>
<portSpacing port="sink_model sink" spacing="0"/>
</process>
</operator>
<operator activated="true" class="rename_by_replacing" compatibility="5.3.012" expanded="true" height="76" name="Rename by Replacing" width="90" x="592" y="179">
<parameter key="replace_what" value="$"/>
<parameter key="replace_by" value="_2"/>
</operator>
<operator activated="true" class="impute_missing_values" compatibility="5.3.012" expanded="true" height="60" name="Imp2" width="90" x="588" y="39">
<parameter key="learn_on_complete_cases" value="false"/>
<process expanded="true">
<operator activated="true" class="linear_regression" compatibility="5.3.012" expanded="true" height="94" name="Linear Regression" width="90" x="379" y="34"/>
<connect from_port="example set source" to_op="Linear Regression" to_port="training set"/>
<connect from_op="Linear Regression" from_port="model" to_port="model sink"/>
<portSpacing port="source_example set source" spacing="0"/>
<portSpacing port="sink_model sink" spacing="0"/>
</process>
</operator>
<operator activated="true" class="join" compatibility="5.3.012" expanded="true" height="76" name="Join" width="90" x="766" y="123">
<parameter key="remove_double_attributes" value="false"/>
<list key="key_attributes"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Imp2" to_port="example set in"/>
<connect from_op="Multiply" from_port="output 2" to_op="Imp1" to_port="example set in"/>
<connect from_op="Imp1" from_port="example set out" to_op="Rename by Replacing" to_port="example set input"/>
<connect from_op="Rename by Replacing" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Imp2" from_port="example set out" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
About your first code, I don`t want to use just one model(Naive as you did) as the base learner. I want 3, and I wasn`t sure to train all these 3 with all the base learners. Besides, using stacking instead of voting (as you did) makes the process even more complicated.
About your second code, the Join operator is not a suitable way for joining your imputations. Because whenever there is a difference it just ignores the value of the right imputation. It`s just like using only the left imputation.
Anyway, thanks for your time!
Regards
Farideh
What?
Earlier I made a mistake about the "join" operator. I used your suggestion and that worked well.
Thanks a lot, that was a great help to me.
Regards
I got a question. When we join the multiple imputations by the method you suggested, doesn`t the correlation make any problem?
I mean we are inserting each variable 5 times!
Best Regards
Farideh