Fill in Missing Values by average of whole row

msacs09msacs09 Member Posts: 55 Contributor II
edited March 2020 in Help
I'm trying to fill in the missing values by averaging the rows or possibly use some row level computation like regression. I'm wondering if i need to do un-pivot it? I would greatly appreciate a sample process that can do this . Below is my sample row level data 


Tagged:

Answers

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
    edited February 2019

    You can do it with the Generate Aggregation operator and a Loop. Something like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"><br>    <parameter key="logverbosity" value="init"/><br>    <parameter key="random_seed" value="2001"/><br>    <parameter key="send_mail" value="never"/><br>    <parameter key="notification_email" value=""/><br>    <parameter key="process_duration_for_mail" value="30"/><br>    <parameter key="encoding" value="SYSTEM"/><br>    <process expanded="true"><br>      <operator activated="true" class="read_excel" compatibility="9.1.000" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34"><br>        <parameter key="excel_file" value="C:\Users\sgolbert\Downloads\Sample_Data_1.xlsx"/><br>        <parameter key="sheet_selection" value="sheet number"/><br>        <parameter key="sheet_number" value="1"/><br>        <parameter key="imported_cell_range" value="A1"/><br>        <parameter key="encoding" value="SYSTEM"/><br>        <parameter key="first_row_as_names" value="true"/><br>        <list key="annotations"/><br>        <parameter key="date_format" value=""/><br>        <parameter key="time_zone" value="SYSTEM"/><br>        <parameter key="locale" value="English (United States)"/><br>        <parameter key="read_all_values_as_polynominal" value="false"/><br>        <list key="data_set_meta_data_information"><br>          <parameter key="0" value="ID.true.integer.id"/><br>          <parameter key="1" value="2017-01.true.integer.attribute"/><br>          <parameter key="2" value="2017-02.true.integer.attribute"/><br>          <parameter key="3" value="2017-03.true.integer.attribute"/><br>          <parameter key="4" value="2017-04.true.integer.attribute"/><br>          <parameter key="5" value="2017-05.true.integer.attribute"/><br>          <parameter key="6" value="2017-06.true.integer.attribute"/><br>          <parameter key="7" value="2017-07.true.integer.attribute"/><br>          <parameter key="8" value="2017-08.true.integer.attribute"/><br>          <parameter key="9" value="2017-09.true.integer.attribute"/><br>          <parameter key="10" value="2017-10.true.integer.attribute"/><br>          <parameter key="11" value="2017-11.true.integer.attribute"/><br>          <parameter key="12" value="2017-12.true.integer.attribute"/><br>          <parameter key="13" value="2018-01.true.integer.attribute"/><br>          <parameter key="14" value="2018-02.true.integer.attribute"/><br>          <parameter key="15" value="2018-03.true.integer.attribute"/><br>          <parameter key="16" value="2018-04.true.integer.attribute"/><br>          <parameter key="17" value="2018-05.true.integer.attribute"/><br>          <parameter key="18" value="2018-06.true.integer.attribute"/><br>          <parameter key="19" value="2018-07.true.integer.attribute"/><br>          <parameter key="20" value="2018-08.true.integer.attribute"/><br>          <parameter key="21" value="2018-09.true.integer.attribute"/><br>          <parameter key="22" value="2018-10.true.integer.attribute"/><br>          <parameter key="23" value="2018-11.true.integer.attribute"/><br>          <parameter key="24" value="2018-12.true.integer.attribute"/><br>        </list><br>        <parameter key="read_not_matching_values_as_missings" value="false"/><br>        <parameter key="datamanagement" value="double_array"/><br>        <parameter key="data_management" value="auto"/><br>      </operator><br>      <operator activated="true" class="generate_aggregation" compatibility="9.1.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="246" y="34"><br>        <parameter key="attribute_name" value="row_mean"/><br>        <parameter key="attribute_filter_type" value="all"/><br>        <parameter key="attribute" value=""/><br>        <parameter key="attributes" value=""/><br>        <parameter key="use_except_expression" value="false"/><br>        <parameter key="value_type" value="attribute_value"/><br>        <parameter key="use_value_type_exception" value="false"/><br>        <parameter key="except_value_type" value="time"/><br>        <parameter key="block_type" value="attribute_block"/><br>        <parameter key="use_block_type_exception" value="false"/><br>        <parameter key="except_block_type" value="value_matrix_row_start"/><br>        <parameter key="invert_selection" value="false"/><br>        <parameter key="include_special_attributes" value="false"/><br>        <parameter key="aggregation_function" value="average"/><br>        <parameter key="concatenation_separator" value="|"/><br>        <parameter key="keep_all" value="true"/><br>        <parameter key="ignore_missings" value="true"/><br>        <parameter key="ignore_missing_attributes" value="false"/><br>      </operator><br>      <operator activated="true" class="numerical_to_real" compatibility="9.1.000" expanded="true" height="82" name="Numerical to Real" width="90" x="380" y="34"><br>        <parameter key="attribute_filter_type" value="all"/><br>        <parameter key="attribute" value=""/><br>        <parameter key="attributes" value=""/><br>        <parameter key="use_except_expression" value="false"/><br>        <parameter key="value_type" value="numeric"/><br>        <parameter key="use_value_type_exception" value="false"/><br>        <parameter key="except_value_type" value="real"/><br>        <parameter key="block_type" value="value_series"/><br>        <parameter key="use_block_type_exception" value="false"/><br>        <parameter key="except_block_type" value="value_series_end"/><br>        <parameter key="invert_selection" value="false"/><br>        <parameter key="include_special_attributes" value="false"/><br>      </operator><br>      <operator activated="true" class="extract_macro" compatibility="9.1.000" expanded="true" height="68" name="Extract Macro (2)" width="90" x="514" y="34"><br>        <parameter key="macro" value="sample_size"/><br>        <parameter key="macro_type" value="number_of_examples"/><br>        <parameter key="statistics" value="average"/><br>        <parameter key="attribute_name" value=""/><br>        <list key="additional_macros"/><br>      </operator><br>      <operator activated="true" class="concurrency:loop" compatibility="9.1.000" expanded="true" height="82" name="Loop" width="90" x="648" y="34"><br>        <parameter key="number_of_iterations" value="%{sample_size}"/><br>        <parameter key="iteration_macro" value="iteration"/><br>        <parameter key="reuse_results" value="false"/><br>        <parameter key="enable_parallel_execution" value="true"/><br>        <process expanded="true"><br>          <operator activated="true" class="filter_example_range" compatibility="9.1.000" expanded="true" height="82" name="Filter Example Range" width="90" x="112" y="34"><br>            <parameter key="first_example" value="%{iteration}"/><br>            <parameter key="last_example" value="%{iteration}"/><br>            <parameter key="invert_filter" value="false"/><br>          </operator><br>          <operator activated="true" class="extract_macro" compatibility="9.1.000" expanded="true" height="68" name="Extract Macro" width="90" x="380" y="34"><br>            <parameter key="macro" value="replacement"/><br>            <parameter key="macro_type" value="data_value"/><br>            <parameter key="statistics" value="average"/><br>            <parameter key="attribute_name" value="row_mean"/><br>            <parameter key="example_index" value="1"/><br>            <list key="additional_macros"/><br>          </operator><br>          <operator activated="true" class="replace_missing_values" compatibility="9.1.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="648" y="34"><br>            <parameter key="return_preprocessing_model" value="false"/><br>            <parameter key="create_view" value="false"/><br>            <parameter key="attribute_filter_type" value="all"/><br>            <parameter key="attribute" value=""/><br>            <parameter key="attributes" value=""/><br>            <parameter key="use_except_expression" value="false"/><br>            <parameter key="value_type" value="attribute_value"/><br>            <parameter key="use_value_type_exception" value="false"/><br>            <parameter key="except_value_type" value="time"/><br>            <parameter key="block_type" value="attribute_block"/><br>            <parameter key="use_block_type_exception" value="false"/><br>            <parameter key="except_block_type" value="value_matrix_row_start"/><br>            <parameter key="invert_selection" value="false"/><br>            <parameter key="include_special_attributes" value="false"/><br>            <parameter key="default" value="value"/><br>            <list key="columns"/><br>            <parameter key="replenishment_value" value="%{replacement}"/><br>          </operator><br>          <connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/><br>          <connect from_op="Filter Example Range" from_port="example set output" to_op="Extract Macro" to_port="example set"/><br>          <connect from_op="Extract Macro" from_port="example set" to_op="Replace Missing Values" to_port="example set input"/><br>          <connect from_op="Replace Missing Values" from_port="example set output" to_port="output 1"/><br>          <portSpacing port="source_input 1" spacing="0"/><br>          <portSpacing port="source_input 2" spacing="0"/><br>          <portSpacing port="sink_output 1" spacing="0"/><br>          <portSpacing port="sink_output 2" spacing="0"/><br>        </process><br>      </operator><br>      <operator activated="true" class="append" compatibility="9.1.000" expanded="true" height="82" name="Append" width="90" x="782" y="34"><br>        <parameter key="datamanagement" value="double_array"/><br>        <parameter key="data_management" value="auto"/><br>        <parameter key="merge_type" value="all"/><br>      </operator><br>      <connect from_op="Read Excel" from_port="output" to_op="Generate Aggregation" to_port="example set input"/><br>      <connect from_op="Generate Aggregation" from_port="example set output" to_op="Numerical to Real" to_port="example set input"/><br>      <connect from_op="Numerical to Real" from_port="example set output" to_op="Extract Macro (2)" to_port="example set"/><br>      <connect from_op="Extract Macro (2)" from_port="example set" to_op="Loop" to_port="input 1"/><br>      <connect from_op="Loop" from_port="output 1" to_op="Append" to_port="example set 1"/><br>      <connect from_op="Append" from_port="merged set" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>    </process><br>  </operator><br></process><br><br>

    It is a bit clumsy, but I didn't find an easier way.

    Regards,
    Sebastian


  • msacs09msacs09 Member Posts: 55 Contributor II
    @SGolbert Thank you. If I may ask, how do i replace the average here with a  regression model such as KNN or some other learner, so that we impute with a regression value as opposed to average, such that we do not repeat the SAME value for every missing row for that ID?

    Thank you for your valuable time.
  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
    Hi msacs,

    you could do something very similar, but just use a prediction column (after applying the model) instead of the row_mean column. Another option is the Impute Missing Values operator, that does practically the same, all in one go.

    Do you want to try it out?

    Regards,
    Sebastian

  • msacs09msacs09 Member Posts: 55 Contributor II
    It appears that the imputation operator has a bug. It complains about missing exampleset as reported here. So I was curious if your approach would resolve that issue. 

    https://community.rapidminer.com/discussion/54957/imputing-challenge-with-a-best-possible-ensemble#latest
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    nope no bug. See my response in other thread.

    Scott
Sign In or Register to comment.