Fill in Missing Values by average of whole row

msacs09msacs09 Member Posts: 50 Contributor II
I'm trying to fill in the missing values by averaging the rows or possibly use some row level computation like regression. I'm wondering if i need to do un-pivot it? I would greatly appreciate a sample process that can do this . Below is my sample row level data 


Answers

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 329   Unicorn
    edited February 7

    You can do it with the Generate Aggregation operator and a Loop. Something like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="9.1.000" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
    <parameter key="excel_file" value="C:\Users\sgolbert\Downloads\Sample_Data_1.xlsx"/>
    <parameter key="sheet_selection" value="sheet number"/>
    <parameter key="sheet_number" value="1"/>
    <parameter key="imported_cell_range" value="A1"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="first_row_as_names" value="true"/>
    <list key="annotations"/>
    <parameter key="date_format" value=""/>
    <parameter key="time_zone" value="SYSTEM"/>
    <parameter key="locale" value="English (United States)"/>
    <parameter key="read_all_values_as_polynominal" value="false"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="ID.true.integer.id"/>
    <parameter key="1" value="2017-01.true.integer.attribute"/>
    <parameter key="2" value="2017-02.true.integer.attribute"/>
    <parameter key="3" value="2017-03.true.integer.attribute"/>
    <parameter key="4" value="2017-04.true.integer.attribute"/>
    <parameter key="5" value="2017-05.true.integer.attribute"/>
    <parameter key="6" value="2017-06.true.integer.attribute"/>
    <parameter key="7" value="2017-07.true.integer.attribute"/>
    <parameter key="8" value="2017-08.true.integer.attribute"/>
    <parameter key="9" value="2017-09.true.integer.attribute"/>
    <parameter key="10" value="2017-10.true.integer.attribute"/>
    <parameter key="11" value="2017-11.true.integer.attribute"/>
    <parameter key="12" value="2017-12.true.integer.attribute"/>
    <parameter key="13" value="2018-01.true.integer.attribute"/>
    <parameter key="14" value="2018-02.true.integer.attribute"/>
    <parameter key="15" value="2018-03.true.integer.attribute"/>
    <parameter key="16" value="2018-04.true.integer.attribute"/>
    <parameter key="17" value="2018-05.true.integer.attribute"/>
    <parameter key="18" value="2018-06.true.integer.attribute"/>
    <parameter key="19" value="2018-07.true.integer.attribute"/>
    <parameter key="20" value="2018-08.true.integer.attribute"/>
    <parameter key="21" value="2018-09.true.integer.attribute"/>
    <parameter key="22" value="2018-10.true.integer.attribute"/>
    <parameter key="23" value="2018-11.true.integer.attribute"/>
    <parameter key="24" value="2018-12.true.integer.attribute"/>
    </list>
    <parameter key="read_not_matching_values_as_missings" value="false"/>
    <parameter key="datamanagement" value="double_array"/>
    <parameter key="data_management" value="auto"/>
    </operator>
    <operator activated="true" class="generate_aggregation" compatibility="9.1.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="246" y="34">
    <parameter key="attribute_name" value="row_mean"/>
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    <parameter key="aggregation_function" value="average"/>
    <parameter key="concatenation_separator" value="|"/>
    <parameter key="keep_all" value="true"/>
    <parameter key="ignore_missings" value="true"/>
    <parameter key="ignore_missing_attributes" value="false"/>
    </operator>
    <operator activated="true" class="numerical_to_real" compatibility="9.1.000" expanded="true" height="82" name="Numerical to Real" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="numeric"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="real"/>
    <parameter key="block_type" value="value_series"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_series_end"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="9.1.000" expanded="true" height="68" name="Extract Macro (2)" width="90" x="514" y="34">
    <parameter key="macro" value="sample_size"/>
    <parameter key="macro_type" value="number_of_examples"/>
    <parameter key="statistics" value="average"/>
    <parameter key="attribute_name" value=""/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="concurrency:loop" compatibility="9.1.000" expanded="true" height="82" name="Loop" width="90" x="648" y="34">
    <parameter key="number_of_iterations" value="%{sample_size}"/>
    <parameter key="iteration_macro" value="iteration"/>
    <parameter key="reuse_results" value="false"/>
    <parameter key="enable_parallel_execution" value="true"/>
    <process expanded="true">
    <operator activated="true" class="filter_example_range" compatibility="9.1.000" expanded="true" height="82" name="Filter Example Range" width="90" x="112" y="34">
    <parameter key="first_example" value="%{iteration}"/>
    <parameter key="last_example" value="%{iteration}"/>
    <parameter key="invert_filter" value="false"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="9.1.000" expanded="true" height="68" name="Extract Macro" width="90" x="380" y="34">
    <parameter key="macro" value="replacement"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="statistics" value="average"/>
    <parameter key="attribute_name" value="row_mean"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="replace_missing_values" compatibility="9.1.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="648" y="34">
    <parameter key="return_preprocessing_model" value="false"/>
    <parameter key="create_view" value="false"/>
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    <parameter key="default" value="value"/>
    <list key="columns"/>
    <parameter key="replenishment_value" value="%{replacement}"/>
    </operator>
    <connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Replace Missing Values" to_port="example set input"/>
    <connect from_op="Replace Missing Values" from_port="example set output" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="9.1.000" expanded="true" height="82" name="Append" width="90" x="782" y="34">
    <parameter key="datamanagement" value="double_array"/>
    <parameter key="data_management" value="auto"/>
    <parameter key="merge_type" value="all"/>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Generate Aggregation" to_port="example set input"/>
    <connect from_op="Generate Aggregation" from_port="example set output" to_op="Numerical to Real" to_port="example set input"/>
    <connect from_op="Numerical to Real" from_port="example set output" to_op="Extract Macro (2)" to_port="example set"/>
    <connect from_op="Extract Macro (2)" from_port="example set" to_op="Loop" to_port="input 1"/>
    <connect from_op="Loop" from_port="output 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>


    It is a bit clumsy, but I didn't find an easier way.

    Regards,
    Sebastian


    msacs09eackley29
  • msacs09msacs09 Member Posts: 50 Contributor II
    @SGolbert Thank you. If I may ask, how do i replace the average here with a  regression model such as KNN or some other learner, so that we impute with a regression value as opposed to average, such that we do not repeat the SAME value for every missing row for that ID?

    Thank you for your valuable time.
  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 329   Unicorn
    Hi msacs,

    you could do something very similar, but just use a prediction column (after applying the model) instead of the row_mean column. Another option is the Impute Missing Values operator, that does practically the same, all in one go.

    Do you want to try it out?

    Regards,
    Sebastian

  • msacs09msacs09 Member Posts: 50 Contributor II
    It appears that the imputation operator has a bug. It complains about missing exampleset as reported here. So I was curious if your approach would resolve that issue. 

    https://community.rapidminer.com/discussion/54957/imputing-challenge-with-a-best-possible-ensemble#latest
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,383  Community Manager
    nope no bug. See my response in other thread.

    Scott
Sign In or Register to comment.