A possible bug in Impute Missing Values operator

suleymansahalsuleymansahal Member Posts: 27 Contributor II
edited November 2018 in Help

Could you please check attached process? There are missing values in the data set. Although we see those missing values in the meta data information in the output port of the multiply operator, in the process result those missing values are replaced by the imputing operator's outcomes.

Tagged:

Best Answer

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Hey thanks for pointing this out, it is strange behavior indeed. @Marco_Boeck can you check in this?

     

     

  • suleymansahalsuleymansahal Member Posts: 27 Contributor II

    Hi again. Were you able to check this out?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I think Marco is out on Holiday. Check back early next year. 

  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering

    Hi,

     

    technically, it's the "Impute Missing Values" operator that is buggy. It changes the underlying data behind the scenes which is bad. Until this is fixed, you can work around this by adding a "Materialize" operator after the "Multiply" operator to get a fresh copy of the actual data:

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000-SNAPSHOT">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000-SNAPSHOT" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.4.000-SNAPSHOT" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Samples/data/Titanic"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.4.000-SNAPSHOT" expanded="true" height="103" name="Multiply (2)" width="90" x="179" y="85"/>
    <operator activated="true" class="set_role" compatibility="7.4.000-SNAPSHOT" expanded="true" height="82" name="Set Role" width="90" x="313" y="136">
    <parameter key="attribute_name" value="Survived"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.4.000-SNAPSHOT" expanded="true" height="103" name="Multiply" width="90" x="447" y="187"/>
    <operator activated="true" class="materialize_data" compatibility="7.4.000-SNAPSHOT" expanded="true" height="82" name="Materialize Data" width="90" x="648" y="187"/>
    <operator activated="true" class="impute_missing_values" compatibility="7.4.000-SNAPSHOT" expanded="true" height="68" name="Impute Missing Values" width="90" x="648" y="289">
    <parameter key="order" value="information gain"/>
    <process expanded="true">
    <operator activated="true" class="k_nn" compatibility="7.4.000-SNAPSHOT" expanded="true" height="82" name="k-NN" width="90" x="179" y="85">
    <parameter key="k" value="3"/>
    </operator>
    <connect from_port="example set source" to_op="k-NN" to_port="training set"/>
    <connect from_op="k-NN" from_port="model" to_port="model sink"/>
    <portSpacing port="source_example set source" spacing="0"/>
    <portSpacing port="sink_model sink" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve Titanic" from_port="output" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_port="result 1"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Materialize Data" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Impute Missing Values" to_port="example set in"/>
    <connect from_op="Materialize Data" from_port="example set output" to_port="result 2"/>
    <connect from_op="Impute Missing Values" from_port="example set out" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

    Marco

  • suleymansahalsuleymansahal Member Posts: 27 Contributor II

    Thank you for the quick fix Marco. As you aggreed it is a serious problem. I hope it can be solved soon.

Sign In or Register to comment.