Downsampling of Data (From 1 hour each to 1 day[24hrs] each)

1705410G1705410G Member Posts: 43 Contributor I
edited January 2019 in Help
My data is divided into 3 months of 1 hour each, so there's roughly 2000 over examples. 
How can I merge the data into 1 day(24hrs) each instead? There should be about 90+ examples. 

I thought about using Generate Attribute, but I don't think that would work.

Thank you!

Regards,
AY
Tagged:

Best Answer

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @1705410G,

    To better understand, you want downsample your time series ? 
    a value each hour to a value each day ? and take for example the average of the 24 hours as value of the day ?
    I'm correct ?

    Regards,

    Lionel
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @1705410G,

    I did not find a solution with RapidMiner's native operators, so I propose a solution with a (very simple) Python script.
    To execute this process, you have : 
     - to install Python on your computer.
     - install the Python Scripting extension from the marketplace.

    The process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.7.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="85">
            <parameter key="generator_type" value="date_series"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="true"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)">
              <parameter key="date" value="2019-01-01 00:00:00.1.hour"/>
            </list>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Generate ID" width="90" x="179" y="85">
            <parameter key="create_nominal_ids" value="false"/>
            <parameter key="offset" value="0"/>
          </operator>
          <operator activated="true" class="generate_data" compatibility="9.2.000-SNAPSHOT" expanded="true" height="68" name="Generate Data" width="90" x="45" y="187">
            <parameter key="target_function" value="random"/>
            <parameter key="number_examples" value="100"/>
            <parameter key="number_of_attributes" value="2"/>
            <parameter key="attributes_lower_bound" value="-10.0"/>
            <parameter key="attributes_upper_bound" value="10.0"/>
            <parameter key="gaussian_standard_deviation" value="10.0"/>
            <parameter key="largest_radius" value="10.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Generate ID (2)" width="90" x="179" y="187">
            <parameter key="create_nominal_ids" value="false"/>
            <parameter key="offset" value="0"/>
          </operator>
          <operator activated="true" breakpoints="after" class="concurrency:join" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Join" width="90" x="313" y="85">
            <parameter key="remove_double_attributes" value="true"/>
            <parameter key="join_type" value="inner"/>
            <parameter key="use_id_attribute_as_key" value="true"/>
            <list key="key_attributes"/>
            <parameter key="keep_both_join_attributes" value="false"/>
          </operator>
          <operator activated="true" class="python_scripting:execute_python" compatibility="9.1.000" expanded="true" height="103" name="Execute Python" width="90" x="581" y="85">
            <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;  data = data.resample('D',on = 'date').mean()&#10;  data = data.reset_index(level = ['date'])&#10;  data['date'] = data['date'].astype(str)&#10;    &#10;  return data"/>
            <parameter key="use_default_python" value="true"/>
            <parameter key="package_manager" value="conda (anaconda)"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Set Role" width="90" x="715" y="85">
            <parameter key="attribute_name" value="date"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Join" to_port="left"/>
          <connect from_op="Generate Data" from_port="output" to_op="Generate ID (2)" to_port="example set input"/>
          <connect from_op="Generate ID (2)" from_port="example set output" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_op="Execute Python" to_port="input 1"/>
          <connect from_op="Execute Python" from_port="output 1" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    

    Hope it helps,

    Regards,

    Lionel

    NB : Of course, I am interested to discover a solution 100% RapidMiner for this task...

  • 1705410G1705410G Member Posts: 43 Contributor I
    lionelderkrikor ,

    Yes to this:
    To better understand, you want downsample your time series ? 

    a value each hour to a value each day ? and take for example the average of the 24 hours as value of the day ? I'm correct ?

    I'll give the python a shot. Thank you

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi,
    you can just create a day indicator with Date to Numerical and then use Aggregate. 

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • 1705410G1705410G Member Posts: 43 Contributor I
    edited January 2019
    @ mschmitz 

    Hi! 
    However, I have 3 months worth of data. By using the day indicator with Date to Numerical. My results will be 1-31, 1-31 and 1-31. So in that case, how will I be able to differentiate which month it belongs to and how it will not be merged incorrectly?

    If day indicator with Date to Numerical still can be use then, what are the parameters of Aggregate I should make changes to?


    Regards,
    AY
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    if you use Days since Epoch in Date to Numerical then you can get the Days since 1970. Attached is an example. It needs the extension 'Operator Toolbox' to run.
    There is an alternative solution with Process Series, which can be more versatile.

    BR,
    Martin
    <?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"><br>    <parameter key="logverbosity" value="init"/><br>    <parameter key="random_seed" value="2001"/><br>    <parameter key="send_mail" value="never"/><br>    <parameter key="notification_email" value=""/><br>    <parameter key="process_duration_for_mail" value="30"/><br>    <parameter key="encoding" value="SYSTEM"/><br>    <process expanded="true"><br>      <operator activated="true" class="subprocess" compatibility="9.1.000" expanded="true" height="82" name="Subprocess" width="90" x="45" y="187"><br>        <process expanded="true"><br>          <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.8.000-SNAPSHOT" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34"><br>            <parameter key="generator_type" value="date_series"/><br>            <parameter key="number_of_examples" value="5000"/><br>            <parameter key="use_stepsize" value="false"/><br>            <list key="function_descriptions"/><br>            <parameter key="add_id_attribute" value="false"/><br>            <list key="numeric_series_configuration"/><br>            <list key="date_series_configuration"><br>              <parameter key="date" value="2018-01-01 00:00:00.2018-04-01 00:00:00"/><br>            </list><br>            <list key="date_series_configuration (interval)"/><br>            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/><br>            <parameter key="column_separator" value=","/><br>            <parameter key="parse_all_as_nominal" value="false"/><br>            <parameter key="decimal_point_character" value="."/><br>            <parameter key="trim_attribute_names" value="true"/><br>          </operator><br>          <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34"><br>            <list key="function_descriptions"><br>              <parameter key="value" value="rand()"/><br>            </list><br>            <parameter key="keep_all" value="true"/><br>          </operator><br>          <connect from_op="Create ExampleSet" from_port="output" to_op="Generate Attributes" to_port="example set input"/><br>          <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/><br>          <portSpacing port="source_in 1" spacing="0"/><br>          <portSpacing port="sink_out 1" spacing="0"/><br>          <portSpacing port="sink_out 2" spacing="0"/><br>        </process><br>        <description align="center" color="transparent" colored="false" width="126">Generate example data</description><br>      </operator><br>      <operator activated="true" class="date_to_numerical" compatibility="9.1.000" expanded="true" height="82" name="Date to Numerical" width="90" x="246" y="187"><br>        <parameter key="attribute_name" value="date"/><br>        <parameter key="time_unit" value="day"/><br>        <parameter key="millisecond_relative_to" value="second"/><br>        <parameter key="second_relative_to" value="minute"/><br>        <parameter key="minute_relative_to" value="hour"/><br>        <parameter key="hour_relative_to" value="day"/><br>        <parameter key="day_relative_to" value="epoch"/><br>        <parameter key="week_relative_to" value="year"/><br>        <parameter key="month_relative_to" value="year"/><br>        <parameter key="quarter_relative_to" value="year"/><br>        <parameter key="half_year_relative_to" value="year"/><br>        <parameter key="year_relative_to" value="era"/><br>        <parameter key="keep_old_attribute" value="true"/><br>      </operator><br>      <operator activated="true" class="aggregate" compatibility="9.1.000" expanded="true" height="82" name="Aggregate" width="90" x="380" y="187"><br>        <parameter key="use_default_aggregation" value="false"/><br>        <parameter key="attribute_filter_type" value="all"/><br>        <parameter key="attribute" value=""/><br>        <parameter key="attributes" value=""/><br>        <parameter key="use_except_expression" value="false"/><br>        <parameter key="value_type" value="attribute_value"/><br>        <parameter key="use_value_type_exception" value="false"/><br>        <parameter key="except_value_type" value="time"/><br>        <parameter key="block_type" value="attribute_block"/><br>        <parameter key="use_block_type_exception" value="false"/><br>        <parameter key="except_block_type" value="value_matrix_row_start"/><br>        <parameter key="invert_selection" value="false"/><br>        <parameter key="include_special_attributes" value="false"/><br>        <parameter key="default_aggregation_function" value="average"/><br>        <list key="aggregation_attributes"><br>          <parameter key="value" value="average"/><br>        </list><br>        <parameter key="group_by_attributes" value="date_day"/><br>        <parameter key="count_all_combinations" value="false"/><br>        <parameter key="only_distinct" value="false"/><br>        <parameter key="ignore_missings" value="true"/><br>      </operator><br>      <operator activated="true" class="aggregate" compatibility="9.1.000" expanded="true" height="82" name="Aggregate (2)" width="90" x="514" y="340"><br>        <parameter key="use_default_aggregation" value="false"/><br>        <parameter key="attribute_filter_type" value="all"/><br>        <parameter key="attribute" value=""/><br>        <parameter key="attributes" value=""/><br>        <parameter key="use_except_expression" value="false"/><br>        <parameter key="value_type" value="attribute_value"/><br>        <parameter key="use_value_type_exception" value="false"/><br>        <parameter key="except_value_type" value="time"/><br>        <parameter key="block_type" value="attribute_block"/><br>        <parameter key="use_block_type_exception" value="false"/><br>        <parameter key="except_block_type" value="value_matrix_row_start"/><br>        <parameter key="invert_selection" value="false"/><br>        <parameter key="include_special_attributes" value="false"/><br>        <parameter key="default_aggregation_function" value="average"/><br>        <list key="aggregation_attributes"><br>          <parameter key="date" value="minimum"/><br>        </list><br>        <parameter key="group_by_attributes" value="date_day"/><br>        <parameter key="count_all_combinations" value="false"/><br>        <parameter key="only_distinct" value="false"/><br>        <parameter key="ignore_missings" value="true"/><br>      </operator><br>      <operator activated="true" class="rename" compatibility="9.1.000" expanded="true" height="82" name="Rename" width="90" x="648" y="340"><br>        <parameter key="old_name" value="minimum(date)"/><br>        <parameter key="new_name" value="FirstDateInWindow"/><br>        <list key="rename_additional_attributes"/><br>      </operator><br>      <operator activated="true" class="concurrency:join" compatibility="9.1.000" expanded="true" height="82" name="Join" width="90" x="782" y="187"><br>        <parameter key="remove_double_attributes" value="true"/><br>        <parameter key="join_type" value="inner"/><br>        <parameter key="use_id_attribute_as_key" value="false"/><br>        <list key="key_attributes"><br>          <parameter key="date_day" value="date_day"/><br>        </list><br>        <parameter key="keep_both_join_attributes" value="false"/><br>      </operator><br>      <connect from_op="Subprocess" from_port="out 1" to_op="Date to Numerical" to_port="example set input"/><br>      <connect from_op="Date to Numerical" from_port="example set output" to_op="Aggregate" to_port="example set input"/><br>      <connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/><br>      <connect from_op="Aggregate" from_port="original" to_op="Aggregate (2)" to_port="example set input"/><br>      <connect from_op="Aggregate (2)" from_port="example set output" to_op="Rename" to_port="example set input"/><br>      <connect from_op="Rename" from_port="example set output" to_op="Join" to_port="right"/><br>      <connect from_op="Join" from_port="join" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>      <description align="center" color="yellow" colored="false" height="182" resized="true" width="320" x="471" y="290">Just to have the first day in Window att later. Might be usefule</description><br>    </process><br>  </operator><br></process><br><br>



    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You can do this natively in RapidMiner with the new Process Windows operator which is now part of the Time Series operators in Studio.  You just have to decide which value from each day you want to keep: min, max, average, etc.  
    Here is a sample process:
    <?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process" origin="GENERATED_TUTORIAL">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Lake Huron" origin="GENERATED_TUTORIAL" width="90" x="179" y="34">
            <parameter key="repository_entry" value="//Samples/Time Series/data sets/Lake Huron"/>
          </operator>
          <operator activated="true" class="time_series:process_windows" compatibility="9.1.000" expanded="true" height="82" name="Process Windows" origin="GENERATED_TUTORIAL" width="90" x="380" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="numeric"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="real"/>
            <parameter key="block_type" value="value_series"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_series_end"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="has_indices" value="true"/>
            <parameter key="indices_attribute" value="Date"/>
            <parameter key="window_size" value="30"/>
            <parameter key="no_overlapping_windows" value="false"/>
            <parameter key="step_size" value="1"/>
            <parameter key="create_horizon_(labels)" value="true"/>
            <parameter key="horizon_attribute" value="Lake surface level / feet"/>
            <parameter key="horizon_size" value="1"/>
            <parameter key="horizon_offset" value="0"/>
            <parameter key="add_last_index_in_window_attribute" value="true"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="time_series:extract_std_descriptive_features" compatibility="9.1.000" expanded="true" height="82" name="Extract Aggregates" origin="GENERATED_TUTORIAL" width="90" x="447" y="34">
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="numeric"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="real"/>
                <parameter key="block_type" value="value_series"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_series_end"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <parameter key="add_time_series_name" value="false"/>
                <parameter key="sum" value="true"/>
                <parameter key="mean" value="true"/>
                <parameter key="geometric_mean" value="true"/>
                <parameter key="first_quartile" value="true"/>
                <parameter key="median" value="true"/>
                <parameter key="third_quartile" value="true"/>
                <parameter key="min" value="true"/>
                <parameter key="max" value="true"/>
                <parameter key="std_deviation" value="true"/>
                <parameter key="kurtosis" value="true"/>
                <parameter key="skewness" value="true"/>
                <parameter key="ignore_invalid_values" value="false"/>
              </operator>
              <connect from_port="windowed example set" to_op="Extract Aggregates" to_port="example set"/>
              <connect from_op="Extract Aggregates" from_port="features" to_port="output 1"/>
              <portSpacing port="source_windowed example set" spacing="0"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="9.1.000" expanded="true" height="82" name="Append" origin="GENERATED_TUTORIAL" width="90" x="581" y="34">
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="merge_type" value="all"/>
          </operator>
          <connect from_op="Retrieve Lake Huron" from_port="output" to_op="Process Windows" to_port="example set"/>
          <connect from_op="Process Windows" from_port="output 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • 1705410G1705410G Member Posts: 43 Contributor I
    lionelderkrikor ,

    Yes to this:
    To better understand, you want downsample your time series ? 

    a value each hour to a value each day ? and take for example the average of the 24 hours as value of the day ? I'm correct ?

    I'll give the python script a shot. Thank you!

    Regards,
    AY

  • 1705410G1705410G Member Posts: 43 Contributor I
    @ mschmitz 

    Hi! What are the parameters of Aggregate I should make changes to? 


    Regards,
    AY
  • 1705410G1705410G Member Posts: 43 Contributor I
    lionelderkrikor ,

    Yes to this:
    To better understand, you want downsample your time series ? 

    a value each hour to a value each day ? and take for example the average of the 24 hours as value of the day ? I'm correct ?

    I'll give the python script a shot. Thank you!

    Regards,
    AY

  • 1705410G1705410G Member Posts: 43 Contributor I
    @ mschmitz 

    Hi! What are the parameters of Aggregate I should make changes to? 


    Regards,
    AY
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    @1705410G You should really try the sample process I sent that uses the Process Windows operator, it is designed specifically for this type of use case.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.