Downsampling Time Series with missing and inconsistent records

senorpinattasenorpinatta Member Posts: 1 Newbie
I have a csv file of dates and temperatures.
For example:

dateTime,                       TMP
2018-08-17 06:41:46,    17.4
2018-08-17 06:42:29,    17.5
2018-08-17 06:43:12,    17.5
2018-08-17 06:43:54,    17.7
2018-08-17 06:44:37,    17.7

This data is quite dense so I'd like to down-sample it from seconds to hours and assign the mean of all values in that hour to the new feature so that it looks something like this:

dateTime,                       TMP.hourlyavg
2018-08-17 06:00:00,    16.9
2018-08-17 07:00:00,    17.5
2018-08-17 08:00:00,    17.6
2018-08-17 09:00:00,    18.7
2018-08-17 10:00:00,    19.7

However, I have two main constraints.

I can't use the Process Windows Operator because my data is not consistent enough (i.e. some hours might have had 15 sensor readings whereas other hours might have 12 or 20).
I also don't want to preserve the current date format instead of converting it into numeric/nominal.

So my question is, how can I downsample from seconds to hours and taking the mean value per hour given these constraints:
1. Inconsistent number of records per hour
2. Preserve date format

A secondary yet significant problem is that there are entire hours that are missing due to the sensor being taken offline for maintenance so downsampling might produce something like.

dateTime,                       TMP.hourlyavg
2018-08-17 06:00:00,    17.4
2018-08-17 07:00:00,    17.5
2018-08-17 08:00:00,    NaN
2018-08-17 09:00:00,    NaN
2018-08-17 10:00:00,    19.7

I'm hoping I'll be able to use the Replace Missing Values Operator or Moving Average Operator to replace NaNs that might occur from the downsampling process. Advice or recommendations on this step are appreciated too.

I looked at this thread but the proposed solution is too complicated for me to follow being new to RapidMiner
community(dot)rapidminer(dot)com/discussion/54821/time-series-gaps-for-arima-how-to-fill-them

Best Answer

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted
    Hi @senorpinatta,

    1. From memory, there is a thread with @mschmitz and @Telcontar120 solutions, but I'm not able to recover it.
    For my part, I propose a solution with a Python script because it is very easy with this language.
    To execute the process, you need to : 
     - install Python on your computer
     - install the Python Scripting extension in RapidMiner (from the Marketplace)

    The output example set looks like that : 



    2. To fill the NAN values, after downsampling your example set , you can use 
    the Replace Missing Values (Series) operator (you can choose the strategy for replacement - previous valid value, average etc.- ) 

    Here the process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.4.000-BETA">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.4.000-BETA" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="9.4.000-BETA" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
            <parameter key="csv_file" value="D:\Lionel\Formations_DataScience\Rapidminer\Tests_Rapidminer\Resample_time_series\replaceanomtmpseries.csv"/>
            <parameter key="column_separators" value=","/>
            <parameter key="trim_lines" value="false"/>
            <parameter key="use_quotes" value="true"/>
            <parameter key="quotes_character" value="&quot;"/>
            <parameter key="escape_character" value="\"/>
            <parameter key="skip_comments" value="true"/>
            <parameter key="comment_characters" value="#"/>
            <parameter key="starting_row" value="1"/>
            <parameter key="parse_numbers" value="true"/>
            <parameter key="decimal_character" value="."/>
            <parameter key="grouped_digits" value="false"/>
            <parameter key="grouping_character" value=","/>
            <parameter key="infinity_representation" value=""/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="encoding" value="windows-1252"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="dateTime.true.date_time.attribute"/>
              <parameter key="1" value="TMP.true.real.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.4.000-BETA" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
            <parameter key="attribute_name" value="dateTime"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="python_scripting:execute_python" compatibility="9.3.000" expanded="true" height="103" name="Execute Python" width="90" x="380" y="34">
            <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;  data = data.set_index(data['dateTime'])&#10;  data = data.resample('H').mean()&#10;  data.index = pd.to_datetime(data.index)&#10;  data['dateTime'] = data.index&#10;    &#10;&#10;    # connect 2 output ports to see the results&#10;  return data"/>
            <parameter key="notebook_cell_tag_filter" value=""/>
            <parameter key="use_default_python" value="true"/>
            <parameter key="package_manager" value="conda (anaconda)"/>
          </operator>
          <operator activated="true" class="time_series:replace_missing_values" compatibility="9.4.000-BETA" expanded="true" height="68" name="Replace Missing Values (Series)" width="90" x="514" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="TMP"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="has_indices" value="false"/>
            <parameter key="indices_attribute" value=""/>
            <parameter key="overwrite_attributes" value="true"/>
            <parameter key="new_attributes_postfix" value="_cleaned"/>
            <parameter key="replace_type_numerical" value="previous value"/>
            <parameter key="replace_type_nominal" value="previous value"/>
            <parameter key="replace_type_date_time" value="previous value"/>
            <parameter key="replace_value_numerical" value="0.0"/>
            <parameter key="replace_value_nominal" value="unknown"/>
            <parameter key="skip_other_missings" value="true"/>
            <parameter key="replace_infinity" value="true"/>
            <parameter key="replace_empty_strings" value="true"/>
            <parameter key="ensure_finite_values" value="false"/>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Execute Python" to_port="input 1"/>
          <connect from_op="Execute Python" from_port="output 1" to_op="Replace Missing Values (Series)" to_port="example set"/>
          <connect from_op="Replace Missing Values (Series)" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Hope this helps,

    Regards,

    Lionel

     

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Indeed there is a RapidMiner native solution as well

    One approach (the one that Lionel refers to, I am sorry, I don't have the link for that one handy either) involves creating more data at the highest frequency for your interval. So if you have a maximum of 20 readings per hour, then you create a data series of date/time stamps once every 3 minutes, and then you merge your actual data into that series using Join.  At that point, you would be able to use the Replace Missing Values to fill in the gaps based on whatever logic seems best to you and then the normal Process Windows approach to aggregate all the data into hourly intervals.

    Alternatively you can just directly use Replace Missing Values after first declaring NaN to be missing with Declare Missing Values.

    In neither case is the format of the date/time stamp an issue because that can always be reformatted according to your desires separately.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.