Impute Missing Values by KNN

derek_tsuiderek_tsui Member Posts: 4 Contributor I
edited May 2020 in Help

Hi Experts,

 

I walked through the operator of 'Impute Missing Values' that the tutorial is using K-NN scheme, and the configuration of parameters with ticked "iterate" and "learn on complete cases". May I know the default of this parameter is using K-NN scheme for imputation?

 

Thanks,

Derek

Tagged:

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    Hi Derek,

     

    For the tutorial process kNN with a default of 1 is useful because kNN simply selects the value from the nearest record (using distance measures) to the missing value.  It's a pretty logical choice for default.  
    However, you are not limited to only kNN.  Here is an example using a Decision Tree for nominal value attributes and a Neural Network for numerical attributes.  

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.5.001" expanded="true" height="68" name="Labor-Negotiations" width="90" x="112" y="85">
    <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.5.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="85"/>
    <operator activated="true" class="materialize_data" compatibility="7.5.001" expanded="true" height="82" name="DT then NN" width="90" x="380" y="34"/>
    <operator activated="true" class="materialize_data" compatibility="7.5.001" expanded="true" height="82" name="kNN" width="90" x="447" y="187"/>
    <operator activated="true" class="impute_missing_values" compatibility="7.3.001" expanded="true" height="68" name="Impute Missing Values" width="90" x="514" y="34">
    <parameter key="attribute_filter_type" value="value_type"/>
    <parameter key="value_type" value="nominal"/>
    <process expanded="true">
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.5.001" expanded="true" height="82" name="Decision Tree" width="90" x="380" y="34"/>
    <connect from_port="example set source" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_port="model sink"/>
    <portSpacing port="source_example set source" spacing="0"/>
    <portSpacing port="sink_model sink" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="impute_missing_values" compatibility="7.3.001" expanded="true" height="68" name="Impute Missing Values (2)" width="90" x="648" y="34">
    <parameter key="attribute_filter_type" value="value_type"/>
    <parameter key="value_type" value="numeric"/>
    <process expanded="true">
    <operator activated="true" class="neural_net" compatibility="7.5.001" expanded="true" height="82" name="Neural Net" width="90" x="179" y="34">
    <list key="hidden_layers"/>
    </operator>
    <connect from_port="example set source" to_op="Neural Net" to_port="training set"/>
    <connect from_op="Neural Net" from_port="model" to_port="model sink"/>
    <portSpacing port="source_example set source" spacing="0"/>
    <portSpacing port="sink_model sink" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="impute_missing_values" compatibility="7.3.001" expanded="true" height="68" name="Impute Missing Values (3)" width="90" x="581" y="187">
    <parameter key="value_type" value="nominal"/>
    <process expanded="true">
    <operator activated="true" class="k_nn" compatibility="7.5.001" expanded="true" height="82" name="k-NN" width="90" x="112" y="34"/>
    <connect from_port="example set source" to_op="k-NN" to_port="training set"/>
    <connect from_op="k-NN" from_port="model" to_port="model sink"/>
    <portSpacing port="source_example set source" spacing="0"/>
    <portSpacing port="sink_model sink" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Labor-Negotiations" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="DT then NN" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="kNN" to_port="example set input"/>
    <connect from_op="DT then NN" from_port="example set output" to_op="Impute Missing Values" to_port="example set in"/>
    <connect from_op="kNN" from_port="example set output" to_op="Impute Missing Values (3)" to_port="example set in"/>
    <connect from_op="Impute Missing Values" from_port="example set out" to_op="Impute Missing Values (2)" to_port="example set in"/>
    <connect from_op="Impute Missing Values (2)" from_port="example set out" to_port="result 1"/>
    <connect from_op="Impute Missing Values (3)" from_port="example set out" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

     

     

     

     

     

     

     

  • derek_tsuiderek_tsui Member Posts: 4 Contributor I

    When I tried to apply this operator by using decision tree or knn, it also showed the same error message "Missing attributes: Input ExampleSet has no attributes. Learning schemes cannot be applied without at least one valide attribute." May I know if I missed anything to apply these algorithms?

     

    Thanks,
    Derek

  • ecolixecolix Member Posts: 1 Contributor I

    Im also facing the same problem. In the Impute Missing Value operator, where I selected to only input a single attribute name "col-adj" in the process below. Seems like the operator only selects that attribute and pass it into the inner process of the Impute Missing Value operator and therefore returning an error.

     

    Does it mean that we can only impute all missing values and don't get to select which column to impute?

     

    Thanks!

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.5.001" expanded="true" height="68" name="Labor-Negotiations" width="90" x="380" y="30">
    <parameter key="repository_entry" value="//Samples/data/Labor-Negotiations"/>
    </operator>
    <operator activated="true" class="impute_missing_values" compatibility="7.3.001" expanded="true" height="68" name="Impute Missing Values" width="90" x="514" y="30">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="col-adj"/>
    <process expanded="true">
    <operator activated="true" class="k_nn" compatibility="7.5.001" expanded="true" height="82" name="k-NN" width="90" x="313" y="30"/>
    <connect from_port="example set source" to_op="k-NN" to_port="training set"/>
    <connect from_op="k-NN" from_port="model" to_port="model sink"/>
    <portSpacing port="source_example set source" spacing="0"/>
    <portSpacing port="sink_model sink" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Labor-Negotiations" from_port="output" to_op="Impute Missing Values" to_port="example set in"/>
    <connect from_op="Impute Missing Values" from_port="example set out" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    @ecolix SO your setup will not produce any results because you only select one column thereby removing additional feature information for the K-nn algorithm to use to figure out what the missing values are.  If you want to replace missing values on a single column, you might want to look at the generic Replace Missing Values and set them at a specific value. When using Impute Missing Values, it's best to use the entire data set and not just a single column.

Sign In or Register to comment.