Downsampling of Data (From 1 hour each to 1 day[24hrs] each)

1705410G · January 2019

My data is divided into 3 months of 1 hour each, so there's roughly 2000 over examples.
How can I merge the data into 1 day(24hrs) each instead? There should be about 90+ examples.

I thought about using Generate Attribute, but I don't think that would work.

Thank you!

Regards,
AY

Knut-RM · January 2019

https://academy.rapidminer.com/learn/video/time-series-calculations-with-process-windows

lionelderkrikor · January 2019

Hi @1705410G,

To better understand, you want downsample your time series ?
a value each hour to a value each day ? and take for example the average of the 24 hours as value of the day ?
I'm correct ?

Regards,

Lionel

lionelderkrikor · January 2019

Hi @1705410G,

I did not find a solution with RapidMiner's native operators, so I propose a solution with a (very simple) Python script.
To execute this process, you have :
- to install Python on your computer.
- install the Python Scripting extension from the marketplace.

The process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.7.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="85">
        <parameter key="generator_type" value="date_series"/>
        <parameter key="number_of_examples" value="100"/>
        <parameter key="use_stepsize" value="true"/>
        <list key="function_descriptions"/>
        <parameter key="add_id_attribute" value="false"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)">
          <parameter key="date" value="2019-01-01 00:00:00.1.hour"/>
        </list>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
        <parameter key="column_separator" value=","/>
        <parameter key="parse_all_as_nominal" value="false"/>
        <parameter key="decimal_point_character" value="."/>
        <parameter key="trim_attribute_names" value="true"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Generate ID" width="90" x="179" y="85">
        <parameter key="create_nominal_ids" value="false"/>
        <parameter key="offset" value="0"/>
      </operator>
      <operator activated="true" class="generate_data" compatibility="9.2.000-SNAPSHOT" expanded="true" height="68" name="Generate Data" width="90" x="45" y="187">
        <parameter key="target_function" value="random"/>
        <parameter key="number_examples" value="100"/>
        <parameter key="number_of_attributes" value="2"/>
        <parameter key="attributes_lower_bound" value="-10.0"/>
        <parameter key="attributes_upper_bound" value="10.0"/>
        <parameter key="gaussian_standard_deviation" value="10.0"/>
        <parameter key="largest_radius" value="10.0"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Generate ID (2)" width="90" x="179" y="187">
        <parameter key="create_nominal_ids" value="false"/>
        <parameter key="offset" value="0"/>
      </operator>
      <operator activated="true" breakpoints="after" class="concurrency:join" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Join" width="90" x="313" y="85">
        <parameter key="remove_double_attributes" value="true"/>
        <parameter key="join_type" value="inner"/>
        <parameter key="use_id_attribute_as_key" value="true"/>
        <list key="key_attributes"/>
        <parameter key="keep_both_join_attributes" value="false"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="9.1.000" expanded="true" height="103" name="Execute Python" width="90" x="581" y="85">
        <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;  data = data.resample('D',on = 'date').mean()&#10;  data = data.reset_index(level = ['date'])&#10;  data['date'] = data['date'].astype(str)&#10;    &#10;  return data"/>
        <parameter key="use_default_python" value="true"/>
        <parameter key="package_manager" value="conda (anaconda)"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Set Role" width="90" x="715" y="85">
        <parameter key="attribute_name" value="date"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="Generate Data" from_port="output" to_op="Generate ID (2)" to_port="example set input"/>
      <connect from_op="Generate ID (2)" from_port="example set output" to_op="Join" to_port="right"/>
      <connect from_op="Join" from_port="join" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Hope it helps,

Regards,

Lionel

NB : Of course, I am interested to discover a solution 100% RapidMiner for this task...

1705410G · January 2019

@ lionelderkrikor ,

Yes to this:
To better understand, you want downsample your time series ?
a value each hour to a value each day ? and take for example the average of the 24 hours as value of the day ? I'm correct ?

I'll give the python a shot. Thank you

MartinLiebig · January 2019

Hi,
you can just create a day indicator with Date to Numerical and then use Aggregate.

BR,
Martin

1705410G · January 2019

@ mschmitz

Hi!
However, I have 3 months worth of data. By using the day indicator with Date to Numerical. My results will be 1-31, 1-31 and 1-31. So in that case, how will I be able to differentiate which month it belongs to and how it will not be merged incorrectly?

If day indicator with Date to Numerical still can be use then, what are the parameters of Aggregate I should make changes to?

Regards,
AY

MartinLiebig · January 2019

Hi @1705410G ,

if you use Days since Epoch in Date to Numerical then you can get the Days since 1970. Attached is an example. It needs the extension 'Operator Toolbox' to run.

There is an alternative solution with Process Series, which can be more versatile.

BR,

Martin

<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"><br>    <parameter key="logverbosity" value="init"/><br>    <parameter key="random_seed" value="2001"/><br>    <parameter key="send_mail" value="never"/><br>    <parameter key="notification_email" value=""/><br>    <parameter key="process_duration_for_mail" value="30"/><br>    <parameter key="encoding" value="SYSTEM"/><br>    <process expanded="true"><br>      <operator activated="true" class="subprocess" compatibility="9.1.000" expanded="true" height="82" name="Subprocess" width="90" x="45" y="187"><br>        <process expanded="true"><br>          <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.8.000-SNAPSHOT" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34"><br>            <parameter key="generator_type" value="date_series"/><br>            <parameter key="number_of_examples" value="5000"/><br>            <parameter key="use_stepsize" value="false"/><br>            <list key="function_descriptions"/><br>            <parameter key="add_id_attribute" value="false"/><br>            <list key="numeric_series_configuration"/><br>            <list key="date_series_configuration"><br>              <parameter key="date" value="2018-01-01 00:00:00.2018-04-01 00:00:00"/><br>            </list><br>            <list key="date_series_configuration (interval)"/><br>            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/><br>            <parameter key="column_separator" value=","/><br>            <parameter key="parse_all_as_nominal" value="false"/><br>            <parameter key="decimal_point_character" value="."/><br>            <parameter key="trim_attribute_names" value="true"/><br>          </operator><br>          <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34"><br>            <list key="function_descriptions"><br>              <parameter key="value" value="rand()"/><br>            </list><br>            <parameter key="keep_all" value="true"/><br>          </operator><br>          <connect from_op="Create ExampleSet" from_port="output" to_op="Generate Attributes" to_port="example set input"/><br>          <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/><br>          <portSpacing port="source_in 1" spacing="0"/><br>          <portSpacing port="sink_out 1" spacing="0"/><br>          <portSpacing port="sink_out 2" spacing="0"/><br>        </process><br>        <description align="center" color="transparent" colored="false" width="126">Generate example data</description><br>      </operator><br>      <operator activated="true" class="date_to_numerical" compatibility="9.1.000" expanded="true" height="82" name="Date to Numerical" width="90" x="246" y="187"><br>        <parameter key="attribute_name" value="date"/><br>        <parameter key="time_unit" value="day"/><br>        <parameter key="millisecond_relative_to" value="second"/><br>        <parameter key="second_relative_to" value="minute"/><br>        <parameter key="minute_relative_to" value="hour"/><br>        <parameter key="hour_relative_to" value="day"/><br>        <parameter key="day_relative_to" value="epoch"/><br>        <parameter key="week_relative_to" value="year"/><br>        <parameter key="month_relative_to" value="year"/><br>        <parameter key="quarter_relative_to" value="year"/><br>        <parameter key="half_year_relative_to" value="year"/><br>        <parameter key="year_relative_to" value="era"/><br>        <parameter key="keep_old_attribute" value="true"/><br>      </operator><br>      <operator activated="true" class="aggregate" compatibility="9.1.000" expanded="true" height="82" name="Aggregate" width="90" x="380" y="187"><br>        <parameter key="use_default_aggregation" value="false"/><br>        <parameter key="attribute_filter_type" value="all"/><br>        <parameter key="attribute" value=""/><br>        <parameter key="attributes" value=""/><br>        <parameter key="use_except_expression" value="false"/><br>        <parameter key="value_type" value="attribute_value"/><br>        <parameter key="use_value_type_exception" value="false"/><br>        <parameter key="except_value_type" value="time"/><br>        <parameter key="block_type" value="attribute_block"/><br>        <parameter key="use_block_type_exception" value="false"/><br>        <parameter key="except_block_type" value="value_matrix_row_start"/><br>        <parameter key="invert_selection" value="false"/><br>        <parameter key="include_special_attributes" value="false"/><br>        <parameter key="default_aggregation_function" value="average"/><br>        <list key="aggregation_attributes"><br>          <parameter key="value" value="average"/><br>        </list><br>        <parameter key="group_by_attributes" value="date_day"/><br>        <parameter key="count_all_combinations" value="false"/><br>        <parameter key="only_distinct" value="false"/><br>        <parameter key="ignore_missings" value="true"/><br>      </operator><br>      <operator activated="true" class="aggregate" compatibility="9.1.000" expanded="true" height="82" name="Aggregate (2)" width="90" x="514" y="340"><br>        <parameter key="use_default_aggregation" value="false"/><br>        <parameter key="attribute_filter_type" value="all"/><br>        <parameter key="attribute" value=""/><br>        <parameter key="attributes" value=""/><br>        <parameter key="use_except_expression" value="false"/><br>        <parameter key="value_type" value="attribute_value"/><br>        <parameter key="use_value_type_exception" value="false"/><br>        <parameter key="except_value_type" value="time"/><br>        <parameter key="block_type" value="attribute_block"/><br>        <parameter key="use_block_type_exception" value="false"/><br>        <parameter key="except_block_type" value="value_matrix_row_start"/><br>        <parameter key="invert_selection" value="false"/><br>        <parameter key="include_special_attributes" value="false"/><br>        <parameter key="default_aggregation_function" value="average"/><br>        <list key="aggregation_attributes"><br>          <parameter key="date" value="minimum"/><br>        </list><br>        <parameter key="group_by_attributes" value="date_day"/><br>        <parameter key="count_all_combinations" value="false"/><br>        <parameter key="only_distinct" value="false"/><br>        <parameter key="ignore_missings" value="true"/><br>      </operator><br>      <operator activated="true" class="rename" compatibility="9.1.000" expanded="true" height="82" name="Rename" width="90" x="648" y="340"><br>        <parameter key="old_name" value="minimum(date)"/><br>        <parameter key="new_name" value="FirstDateInWindow"/><br>        <list key="rename_additional_attributes"/><br>      </operator><br>      <operator activated="true" class="concurrency:join" compatibility="9.1.000" expanded="true" height="82" name="Join" width="90" x="782" y="187"><br>        <parameter key="remove_double_attributes" value="true"/><br>        <parameter key="join_type" value="inner"/><br>        <parameter key="use_id_attribute_as_key" value="false"/><br>        <list key="key_attributes"><br>          <parameter key="date_day" value="date_day"/><br>        </list><br>        <parameter key="keep_both_join_attributes" value="false"/><br>      </operator><br>      <connect from_op="Subprocess" from_port="out 1" to_op="Date to Numerical" to_port="example set input"/><br>      <connect from_op="Date to Numerical" from_port="example set output" to_op="Aggregate" to_port="example set input"/><br>      <connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/><br>      <connect from_op="Aggregate" from_port="original" to_op="Aggregate (2)" to_port="example set input"/><br>      <connect from_op="Aggregate (2)" from_port="example set output" to_op="Rename" to_port="example set input"/><br>      <connect from_op="Rename" from_port="example set output" to_op="Join" to_port="right"/><br>      <connect from_op="Join" from_port="join" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>      <description align="center" color="yellow" colored="false" height="182" resized="true" width="320" x="471" y="290">Just to have the first day in Window att later. Might be usefule</description><br>    </process><br>  </operator><br></process><br><br>

Telcontar120 · January 2019

You can do this natively in RapidMiner with the new Process Windows operator which is now part of the Time Series operators in Studio. You just have to decide which value from each day you want to keep: min, max, average, etc.
Here is a sample process:

<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process" origin="GENERATED_TUTORIAL">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Lake Huron" origin="GENERATED_TUTORIAL" width="90" x="179" y="34">
        <parameter key="repository_entry" value="//Samples/Time Series/data sets/Lake Huron"/>
      </operator>
      <operator activated="true" class="time_series:process_windows" compatibility="9.1.000" expanded="true" height="82" name="Process Windows" origin="GENERATED_TUTORIAL" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="has_indices" value="true"/>
        <parameter key="indices_attribute" value="Date"/>
        <parameter key="window_size" value="30"/>
        <parameter key="no_overlapping_windows" value="false"/>
        <parameter key="step_size" value="1"/>
        <parameter key="create_horizon_(labels)" value="true"/>
        <parameter key="horizon_attribute" value="Lake surface level / feet"/>
        <parameter key="horizon_size" value="1"/>
        <parameter key="horizon_offset" value="0"/>
        <parameter key="add_last_index_in_window_attribute" value="true"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="time_series:extract_std_descriptive_features" compatibility="9.1.000" expanded="true" height="82" name="Extract Aggregates" origin="GENERATED_TUTORIAL" width="90" x="447" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="numeric"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="real"/>
            <parameter key="block_type" value="value_series"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_series_end"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="add_time_series_name" value="false"/>
            <parameter key="sum" value="true"/>
            <parameter key="mean" value="true"/>
            <parameter key="geometric_mean" value="true"/>
            <parameter key="first_quartile" value="true"/>
            <parameter key="median" value="true"/>
            <parameter key="third_quartile" value="true"/>
            <parameter key="min" value="true"/>
            <parameter key="max" value="true"/>
            <parameter key="std_deviation" value="true"/>
            <parameter key="kurtosis" value="true"/>
            <parameter key="skewness" value="true"/>
            <parameter key="ignore_invalid_values" value="false"/>
          </operator>
          <connect from_port="windowed example set" to_op="Extract Aggregates" to_port="example set"/>
          <connect from_op="Extract Aggregates" from_port="features" to_port="output 1"/>
          <portSpacing port="source_windowed example set" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="9.1.000" expanded="true" height="82" name="Append" origin="GENERATED_TUTORIAL" width="90" x="581" y="34">
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="merge_type" value="all"/>
      </operator>
      <connect from_op="Retrieve Lake Huron" from_port="output" to_op="Process Windows" to_port="example set"/>
      <connect from_op="Process Windows" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

1705410G · January 2019

@ lionelderkrikor ,

Yes to this:
To better understand, you want downsample your time series ?
a value each hour to a value each day ? and take for example the average of the 24 hours as value of the day ? I'm correct ?

I'll give the python script a shot. Thank you!

Regards,
AY

1705410G · January 2019

@ mschmitz

Hi! What are the parameters of Aggregate I should make changes to?

Regards,
AY

1705410G · January 2019

@ lionelderkrikor ,

Yes to this:
To better understand, you want downsample your time series ?
a value each hour to a value each day ? and take for example the average of the 24 hours as value of the day ? I'm correct ?

I'll give the python script a shot. Thank you!

Regards,
AY

1705410G · January 2019

@ mschmitz

Hi! What are the parameters of Aggregate I should make changes to?

Regards,
AY

Telcontar120 · January 2019

@1705410G You should really try the sample process I sent that uses the Process Windows operator, it is designed specifically for this type of use case.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Downsampling of Data (From 1 hour each to 1 day[24hrs] each)

Best Answer

Answers