Options

Joining tables based on closest date value

Member Posts: 27 Contributor I
edited December 2018 in Help

Hello everyone,

I possess tow different tables that i want to join, the problem is that there no IDs from both of them to equal and get the join, the only way is to join them based on the closest value from two date columns.

The fisrt table contains this date column 1 :

`01/01/2016 00:0001/01/2016 00:1001/01/2016 00:2001/01/2016 00:3001/01/2016 00:4001/01/2016 00:5001/01/2016 01:0001/01/2016 01:1001/01/2016 01:2001/01/2016 01:3001/01/2016 01:4001/01/2016 01:5001/01/2016 02:0001/01/2016 02:1001/01/2016 02:2001/01/2016 02:3001/01/2016 02:4001/01/2016 02:5001/01/2016 03:0001/01/2016 03:1001/01/2016 03:2001/01/2016 03:3001/01/2016 03:4001/01/2016 03:5001/01/2016 04:0001/01/2016 04:1001/01/2016 04:2001/01/2016 04:3001/01/2016 04:4001/01/2016 04:5001/01/2016 05:0001/01/2016 05:1001/01/2016 05:2001/01/2016 05:30ETC`

And in the other table the column corresponds to :

`01/01/2016 05:0701/01/2016 07:1001/01/2016 08:1901/01/2016 08:2701/01/2016 09:1801/01/2016 10:1301/01/2016 10:2301/01/2016 10:3001/01/2016 10:5701/01/2016 12:2001/01/2016 14:5001/01/2016 14:5401/01/2016 15:0001/01/2016 15:2001/01/2016 16:1201/01/2016 18:2601/01/2016 19:0801/01/2016 20:0001/01/2016 21:1501/01/2016 21:2001/01/2016 22:1001/01/2016 22:1301/01/2016 22:18ETC`

Best regards

• Options
Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

Hi!

If your tables are not too large, you could do a Cartesian join (everything with everything), calculate the difference, take the absolute value (so -7 = 7), group by one of the timestamps, calculate the minimum per group, and join on (timestamp, minimum difference) to only keep this example.

If your tables are larger, you might want to do this in batches, e. g. only select data from an hour (+/- 10 minutes) for one round.

Regards,

Balázs

• Options
Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

Another approach would be to convert the second series to numerics, and then do some mod arithmetic to round to the nearest 10 minutes, and then turn them back into date/times to match the first series.  That should enable you to do the join pretty smoothly.

Brian T.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
• Options
Member Posts: 27 Contributor I

and

If only you can detail your solutions a bit more !!! I've tried Telcontar's suggestion but couldn't get it right !

Best regards

• Options
Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

Here's a simple process doing what I described, it will take a date/time stamp and round it to the nearest 10 minute interval. That can then be used to join to your first dataset.  The first subprocess just creates the data, and the date transformation could actually all be done inside a single Generate Attributes operator, but I split it out using the Date to Numericals so you could better see what was happening.

`<?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">  <context>    <input/>    <output/>    <macros/>  </context>  <operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process">    <process expanded="true">      <operator activated="true" class="subprocess" compatibility="9.0.002" expanded="true" height="82" name="Subprocess" width="90" x="45" y="34">        <process expanded="true">          <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34">            <parameter key="generator_type" value="comma_separated_text"/>            <list key="function_descriptions"/>            <list key="numeric_series_configuration"/>            <list key="date_series_configuration"/>            <list key="date_series_configuration (interval)"/>            <parameter key="input_csv_text" value="interval, actual&#10;01/01/2016 05:00, 01/01/2016 05:03&#10;01/01/2016 07:10, 01/01/2016 07:11&#10;01/01/2016 08:20, 01/01/2016 08:19&#10;01/01/2016 08:30, 01/01/2016 08:27&#10;01/01/2016 09:40, 01/01/2016 09:38&#10;01/01/2016 10:50, 01/01/2016 10:53&#10;01/02/2016 00:00, 01/01/2016 23:58"/>          </operator>          <operator activated="true" class="nominal_to_date" compatibility="9.0.002" expanded="true" height="82" name="Nominal to Date" width="90" x="179" y="34">            <parameter key="attribute_name" value="interval"/>            <parameter key="date_type" value="date_time"/>            <parameter key="date_format" value="MM/dd/yyyy hh:mm"/>          </operator>          <operator activated="true" class="nominal_to_date" compatibility="9.0.002" expanded="true" height="82" name="Nominal to Date (2)" width="90" x="313" y="34">            <parameter key="attribute_name" value="actual"/>            <parameter key="date_type" value="date_time"/>            <parameter key="date_format" value="MM/dd/yyyy hh:mm"/>          </operator>          <connect from_op="Create ExampleSet" from_port="output" to_op="Nominal to Date" to_port="example set input"/>          <connect from_op="Nominal to Date" from_port="example set output" to_op="Nominal to Date (2)" to_port="example set input"/>          <connect from_op="Nominal to Date (2)" from_port="example set output" to_port="out 1"/>          <portSpacing port="source_in 1" spacing="0"/>          <portSpacing port="sink_out 1" spacing="0"/>          <portSpacing port="sink_out 2" spacing="0"/>        </process>      </operator>      <operator activated="true" class="date_to_numerical" compatibility="9.0.002" expanded="true" height="82" name="Date to Numerical" width="90" x="179" y="34">        <parameter key="attribute_name" value="actual"/>        <parameter key="time_unit" value="minute"/>        <parameter key="keep_old_attribute" value="true"/>      </operator>      <operator activated="true" class="date_to_numerical" compatibility="9.0.002" expanded="true" height="82" name="Date to Numerical (3)" width="90" x="313" y="34">        <parameter key="attribute_name" value="actual"/>        <parameter key="time_unit" value="hour"/>        <parameter key="keep_old_attribute" value="true"/>      </operator>      <operator activated="true" class="generate_attributes" compatibility="9.0.002" expanded="true" height="82" name="Generate Attributes" width="90" x="447" y="34">        <list key="function_descriptions">          <parameter key="actual_minute_indicator" value="(actual_minute&gt;=55) &amp;&amp; (actual_hour==23)"/>          <parameter key="reference_date" value="date_millis(actual)"/>          <parameter key="reference_rounded" value="round(reference_date/600000)*600000"/>          <parameter key="new_date" value="if(actual_minute_indicator==1,reference_rounded+86400000,reference_rounded)"/>        </list>      </operator>      <operator activated="true" class="numerical_to_date" compatibility="9.0.002" expanded="true" height="82" name="Numerical to Date" width="90" x="581" y="34">        <parameter key="attribute_name" value="new_date"/>        <parameter key="keep_old_attribute" value="true"/>      </operator>      <connect from_op="Subprocess" from_port="out 1" to_op="Date to Numerical" to_port="example set input"/>      <connect from_op="Date to Numerical" from_port="example set output" to_op="Date to Numerical (3)" to_port="example set input"/>      <connect from_op="Date to Numerical (3)" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>      <connect from_op="Generate Attributes" from_port="example set output" to_op="Numerical to Date" to_port="example set input"/>      <connect from_op="Numerical to Date" from_port="example set output" to_port="result 1"/>      <portSpacing port="source_input 1" spacing="0"/>      <portSpacing port="sink_result 1" spacing="0"/>      <portSpacing port="sink_result 2" spacing="0"/>    </process>  </operator></process>`
Brian T.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts