Random Forest & Missing Values

Liverpool_RedsLiverpool_Reds Member Posts: 1 Contributor I
edited November 9 in Help

Could anyone please explain how Rapidminer implementation of Random Forest operator handles missing values in attributes.

Tagged:

Answers

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 184   Unicorn

    Hi,

     

    Both in Random Forest and Decision Trees, missing values are treated like a separate data value, both for numerical and nominal attributes. You can check it out yourself in the following process:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="9.0.000" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="34">
    <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
    </operator>
    <operator activated="true" class="declare_missing_value" compatibility="9.0.000" expanded="true" height="82" name="Declare Missing Value" width="90" x="246" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Sex"/>
    <parameter key="mode" value="nominal"/>
    <parameter key="nominal_value" value="Female"/>
    </operator>
    <operator activated="true" class="declare_missing_value" compatibility="9.0.000" expanded="true" height="82" name="Declare Missing Value (2)" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Age"/>
    <parameter key="mode" value="expression"/>
    <parameter key="nominal_value" value="Female"/>
    <parameter key="expression_value" value="Age&gt;40"/>
    </operator>
    <operator activated="true" class="concurrency:parallel_random_forest" compatibility="9.0.000" expanded="true" height="103" name="Random Forest" width="90" x="648" y="34"/>
    <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Declare Missing Value" to_port="example set input"/>
    <connect from_op="Declare Missing Value" from_port="example set output" to_op="Declare Missing Value (2)" to_port="example set input"/>
    <connect from_op="Declare Missing Value (2)" from_port="example set output" to_op="Random Forest" to_port="training set"/>
    <connect from_op="Random Forest" from_port="model" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Note that for numerical attributes it results in a 3-way split.

     

    With Decision Tree models, inputing missing values doesn't improve the model, unless you have a very precise way to do it.

     

    Regards,

    Sebastian

    sgenzer
Sign In or Register to comment.