Options

Imputing Nulls using Grouped by values

garg2409garg2409 Member Posts: 2 Contributor I
edited November 2018 in Help

Hello Community,

Can we impute Null values with Grouped by some column of Non Null Values?

Eg : In titanic problem, I want to impute Age column with the mean age grouped by Sex. Now, this logic can be as complex as we want. Is there a functionality in RM to write our own rules for imputing Nulls?

P.S. I have seen the Imputing Missing Values operator. KNN may not work for all cases so we may need custom rules.

Thanks :)

Tagged:

Best Answer

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,524 RM Data Scientist
    Solution Accepted

    Hi,

    what about loop values, filter example, replace missings? :)

     

     Edit: a quicker way uses Aggregate and join. An example is attached.

     

    ~Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.3.001" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="136">
    <parameter key="repository_entry" value="//Samples/data/Titanic"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="136"/>
    <operator activated="true" class="aggregate" compatibility="7.3.001" expanded="true" height="82" name="Aggregate" width="90" x="380" y="34">
    <list key="aggregation_attributes">
    <parameter key="Age" value="average"/>
    </list>
    <parameter key="group_by_attributes" value="Sex"/>
    </operator>
    <operator activated="true" class="join" compatibility="7.3.001" expanded="true" height="82" name="Join" width="90" x="581" y="136">
    <parameter key="join_type" value="right"/>
    <parameter key="use_id_attribute_as_key" value="false"/>
    <list key="key_attributes">
    <parameter key="Sex" value="Sex"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="136">
    <list key="function_descriptions">
    <parameter key="Age" value="if(missing(Age),[average(Age)],Age)"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="916" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="average(Age)"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <connect from_op="Retrieve Titanic" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
    <connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
    <connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • Options
    garg2409garg2409 Member Posts: 2 Contributor I

    Thanks Man! That Worked!  

    Capture.JPG

     

     

     

    capture 2.JPGSub Process

     

    Just one more question :)

    If I wanted to do imputation using group by on two columns (Sex, Passenger Class), will I need to build two Loop Value operators?

     

    EDIT:

    Your edit answers my second query as well. Thanks :)

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,524 RM Data Scientist

    If you want to go for the Loop Values way with more than one group by col you can create an attribute like

    concat(group_by_att_a,group_by_att_b)

    and use it on it.

     

    While loop values is a bit more flexible i would recommend the aggregate way if possible. It's simply faster.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.