Random Forest & Missing Values

Wisdom logo Registration now open for RapidMiner Wisdom Americas | New Orleans | October 10-12, 2018   Learn More

Random Forest & Missing Values

Could anyone please explain how Rapidminer implementation of Random Forest operator handles missing values in attributes.


Re: Random Forest & Missing Values



Both in Random Forest and Decision Trees, missing values are treated like a separate data value, both for numerical and nominal attributes. You can check it out yourself in the following process:


<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000">
  <operator activated="true" class="process" compatibility="9.0.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.0.000" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
      <operator activated="true" class="declare_missing_value" compatibility="9.0.000" expanded="true" height="82" name="Declare Missing Value" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Sex"/>
        <parameter key="mode" value="nominal"/>
        <parameter key="nominal_value" value="Female"/>
      <operator activated="true" class="declare_missing_value" compatibility="9.0.000" expanded="true" height="82" name="Declare Missing Value (2)" width="90" x="447" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Age"/>
        <parameter key="mode" value="expression"/>
        <parameter key="nominal_value" value="Female"/>
        <parameter key="expression_value" value="Age&gt;40"/>
      <operator activated="true" class="concurrency:parallel_random_forest" compatibility="9.0.000" expanded="true" height="103" name="Random Forest" width="90" x="648" y="34"/>
      <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Declare Missing Value" to_port="example set input"/>
      <connect from_op="Declare Missing Value" from_port="example set output" to_op="Declare Missing Value (2)" to_port="example set input"/>
      <connect from_op="Declare Missing Value (2)" from_port="example set output" to_op="Random Forest" to_port="training set"/>
      <connect from_op="Random Forest" from_port="model" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>

Note that for numerical attributes it results in a 3-way split.


With Decision Tree models, inputing missing values doesn't improve the model, unless you have a very precise way to do it.