date_diff in months

christos_karraschristos_karras Member Posts: 32  Guru
I need to calculate the exact number of months between two dates. Using date_diff and dividing by the number of milliseconds in a month won't work because the number of days varies each month. Is there any way this can be done without using Python/R scripting?
Jasmine_

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,312   Unicorn
    Isn't there a prior methodological question even if you are willing to use Python or R?  Meaning, since "month" is not a standard unit (as you have already pointed out), you have to decide how to treat months of different lengths, and that's an a priori decision.  For instance, the elapsed time from February 1 to April 1st is not the same as the elapsed time from July 1 to Sept 1st, why would you want to represent it with the same number?
    It seems like you would be much better off using a unit that does have a single consistent length, such as weeks or days, rather than months.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    Jasmine_
  • christos_karraschristos_karras Member Posts: 32  Guru
    date_add can work with DATE_UNIT_MONTH to add months to a date, which can add a variable number of days depending on which date I'm adding months to, based on knowledge of the calendar rules. For example, adding 1 month to 2019-02-01 will return 2019-03-01, and adding 1 month to 2020-02-01 will return 2020-03-01 even though 2020 is a leap year.

    If it's feasible in one direction (add months to a date), there's no reason for it not to be feasible in the other direction (subtract two dates to get the number of months): something like date_diff([FirstDate], [SecondDate], DATE_UNIT_MONTH)

    I need this as part of other date manipulations, and using exact months is the right level for the data I have.
    Jasmine_
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,860  Community Manager
    hi @christos_karras so this is a tricky question. IMHO you need to define how you want to do this. Adding one month is easy; finding the number of months between two dates is much harder. Here's why:

    18 Feb 2020 + one month = 18 Mar 2020   ---> EASY
    18 Feb 2020 - one month = 18 Jan 2020   ---> EASY

    # of months between 18 Feb 2020 and 18 Mar 2020   --->  one month I assume?
    # of months between 18 Feb 2020 and 16 Mar 2020   ---> ???
    # of months between 18 Feb 2020 and 1 Mar 2020   ---> ???
    # of months between 18 Feb 2020 and 9 Aug 2019   ---> ???

    Scott

    Jasmine_
  • christos_karraschristos_karras Member Posts: 32  Guru
    Hi @sgenzer,

    You're right, but after re-reading my question I realized there was some ambiguity because I talked about an "exact number of months". What I meant by this was that I wanted to avoid dividing a number of days by the average number of days in a month (30.416667 if we ignore leap years) because I wanted to avoid incorrect approximations for large enough time scales. But after doing the calculation I found it would not matter for time differences below around 125 years.

    For example: 2014-03-01 - 2014-06-30 = 45900 days, and 45900 / 30.416667 = 1509.04 months. But if I use dateutil.relativedelta in Python, the difference is 1508 months and 1 day: relativedelta.relativedelta(pd.to_datetime("2140-03-01"), pd.to_datetime("2014-06-30")) = 125 years, 8 months, 1 day = 1508.03 months.

    We can simplify the specifications by instead having a function that returns a whole number of months (rounded)

    For your examples:
    18 Feb 2020 + one month = 18 Mar 2020   ---> Yes
    18 Feb 2020 - one month = 18 Jan 2020   ---> Yes

    # of months between 18 Feb 2020 and 18 Mar 2020   --->  1 month
    # of months between 18 Feb 2020 and 16 Mar 2020   ---> 27 days = rounded up to 1 month
    # of months between 18 Feb 2020 and 1 Mar 2020   ---> 12 days = rounded down to 0 months
    # of months between 18 Feb 2020 and 9 Aug 2019   ---> 5 months and 20 days, rounded up to 6 months

    Other examples:

    # of months between 26 Feb 2020 and 03 Mar 2020   ---> 6 days = rounded down to 0 months
    # of months between 26 Feb 2020 and 03 Aug 2020   ---> 5 months and 6 days = rounded down to 5 months

    That being said, I think such a feature would be low priority. I hoped to do as much as possible date manipulations using native RapidMiner features, but I think I will end having to do most date manipulations in Python because I have multiple time zone conversions to do (I need to combine data from different servers that use different time zones, running on RapidMiner Studio which is in another time zone too). And the upcoming new Python Operator Framework makes that option more interesting.

    sgenzerJasmine_
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,860  Community Manager
    hi @christos_karras ah ok you're looking for integer months. Got it. So I'm certain you can do this with some kind of formula in Generate Attributes. If I can find an hour to ponder this, I will post.

    On the other hand, I agree 100% that this is a perfect candidate for the Python Operator Framework. This is exactly why we think it will be popular. :wink:

    Scott
    christos_karrasJasmine_
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,860  Community Manager
    ok I can never leave things like this alone  :)

    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000-RC">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000-RC" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.6.000-RC" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34">
            <parameter key="generator_type" value="date series"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="true"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)">
              <parameter key="date" value="2020-02-18 00:00:00.1.day"/>
            </list>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="America/New_York"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-RC" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
            <list key="function_descriptions">
              <parameter key="date2" value="date_add(date,rint(1000*rand()),DATE_UNIT_DAY)"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">date2 (random add from date)</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-RC" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="313" y="34">
            <list key="function_descriptions">
              <parameter key="yearMonth" value="100*date_get(date,DATE_UNIT_YEAR)+date_get(date,DATE_UNIT_MONTH)"/>
              <parameter key="yearMonth2" value="100*date_get(date2,DATE_UNIT_YEAR)+date_get(date2,DATE_UNIT_MONTH)"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">yearMonth</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-RC" expanded="true" height="82" name="Generate Attributes (5)" width="90" x="447" y="34">
            <list key="function_descriptions">
              <parameter key="day" value="date_get(date,DATE_UNIT_DAY)"/>
              <parameter key="day2" value="date_get(date2,DATE_UNIT_DAY)"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">day</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-RC" expanded="true" height="82" name="Generate Attributes (3)" width="90" x="581" y="34">
            <list key="function_descriptions">
              <parameter key="diffMonth" value="yearMonth2-yearMonth"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">diffMonth</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-RC" expanded="true" height="82" name="Generate Attributes (6)" width="90" x="715" y="34">
            <list key="function_descriptions">
              <parameter key="diff" value="if(day2&lt;day,diffMonth-1,diffMonth)"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">diff</description>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="9.6.000-RC" expanded="true" height="82" name="Select Attributes" width="90" x="849" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="date|date2|diff"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Generate Attributes (5)" to_port="example set input"/>
          <connect from_op="Generate Attributes (5)" from_port="example set output" to_op="Generate Attributes (3)" to_port="example set input"/>
          <connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Generate Attributes (6)" to_port="example set input"/>
          <connect from_op="Generate Attributes (6)" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    



    Jasmine_
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,860  Community Manager
    of course if you want you can take those five Generate Attributes operators and make one big formula with one operator. I just did it this way for illustration purposes.

    Scott
    Jasmine_
  • christos_karraschristos_karras Member Posts: 32  Guru
    Hi @sgenzer,

    I tried your solution but it's not giving me the right results for every cases. Examples:
    • Feb 20 2020 to Dec 29 2020  --> 10 months OK
    • Feb 18 2020 to Oct 9 2022  --> 207 months, should be 32 months
    • Feb 21 2020 to May 25 2021 --> 103 months, should be 15 months
    I made an updated process that includes your solution (RapidMiner_DiffInMonths_v1), an alternative solution I tested using RapidMiner expressions (RapidMiner_DiffInMonths_v2) and a solution using relativedelta in Python.

    RapidMiner_DiffInMonths_v2 (Note: this doesn't handle fractional month, so this is like applying a floor() function to the delta):
    (date_get(date2, DATE_UNIT_YEAR) -
     date_get(date, DATE_UNIT_YEAR)) * 12
    +
     (date_get(date2, DATE_UNIT_MONTH) -
     date_get(date, DATE_UNIT_MONTH))
    

    Python_DiffInMonths: (this implementation does rounding of months but it is not perfect for rounding differences of 14-15-16 days but should be good enough for my needs)
    from dateutil import relativedelta
    import pandas as pd
    
    def rm_main(data):
    	relative_deltas = data.apply(lambda x: relativedelta.relativedelta(pd.to_datetime(x["date2"]), pd.to_datetime(x["date"])), axis = 1)
    	data["Python_RelativeDelta"] = relative_deltas
    	data["Python_DiffInMonths"] = [round(rd.years * 12 + rd.months + rd.days/30.41667) for rd in relative_deltas]
    	
    	return data
    
    Full process XML:
    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000-BETA">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000-BETA" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.6.000-BETA" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34">
            <parameter key="generator_type" value="date series"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="true"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)">
              <parameter key="date" value="2020-02-18 00:00:00.1.day"/>
            </list>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="America/New_York"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-BETA" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
            <list key="function_descriptions">
              <parameter key="date2" value="date_add(date,rint(1000*rand()),DATE_UNIT_DAY)"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">date2 (random add from date)</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-BETA" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="313" y="34">
            <list key="function_descriptions">
              <parameter key="yearMonth" value="100*date_get(date,DATE_UNIT_YEAR)+date_get(date,DATE_UNIT_MONTH)"/>
              <parameter key="yearMonth2" value="100*date_get(date2,DATE_UNIT_YEAR)+date_get(date2,DATE_UNIT_MONTH)"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">yearMonth</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-BETA" expanded="true" height="82" name="Generate Attributes (5)" width="90" x="447" y="34">
            <list key="function_descriptions">
              <parameter key="day" value="date_get(date,DATE_UNIT_DAY)"/>
              <parameter key="day2" value="date_get(date2,DATE_UNIT_DAY)"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">day</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-BETA" expanded="true" height="82" name="Generate Attributes (3)" width="90" x="581" y="34">
            <list key="function_descriptions">
              <parameter key="diffMonth" value="yearMonth2-yearMonth"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">diffMonth</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-BETA" expanded="true" height="82" name="Generate Attributes (6)" width="90" x="715" y="34">
            <list key="function_descriptions">
              <parameter key="RapidMiner_DiffInMonths_v1" value="if(day2&lt;day,diffMonth-1,diffMonth)"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">diff</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000-BETA" expanded="true" height="82" name="Generate Attributes (4)" width="90" x="849" y="34">
            <list key="function_descriptions">
              <parameter key="RapidMiner_DiffInMonths_v2" value="(date_get(date2, DATE_UNIT_YEAR) -&#10; date_get(date, DATE_UNIT_YEAR)) * 12&#10;+&#10; (date_get(date2, DATE_UNIT_MONTH) -&#10; date_get(date, DATE_UNIT_MONTH))"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" class="python_scripting:execute_python" compatibility="9.5.000" expanded="true" height="103" name="Execute Python" width="90" x="983" y="34">
            <parameter key="script" value="from dateutil import relativedelta&#10;import pandas as pd&#10;&#10;def rm_main(data):&#10;&#9;relative_deltas = data.apply(lambda x: relativedelta.relativedelta(pd.to_datetime(x[&quot;date2&quot;]), pd.to_datetime(x[&quot;date&quot;])), axis = 1)&#10;&#9;data[&quot;Python_RelativeDelta&quot;] = relative_deltas&#10;&#9;data[&quot;Python_DiffInMonths&quot;] = [round(rd.years * 12 + rd.months + rd.days/30.41667) for rd in relative_deltas]&#10;&#9;&#10;&#9;return data&#10;"/>
            <parameter key="notebook_cell_tag_filter" value=""/>
            <parameter key="use_default_python" value="true"/>
            <parameter key="package_manager" value="conda (anaconda)"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="9.6.000-BETA" expanded="true" height="82" name="Select Attributes" width="90" x="1117" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="date|date2|Python_DiffInMonths|RapidMiner_DiffInMonths_v1|RapidMiner_DiffInMonths_v2|Python_RelativeDelta"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Generate Attributes (5)" to_port="example set input"/>
          <connect from_op="Generate Attributes (5)" from_port="example set output" to_op="Generate Attributes (3)" to_port="example set input"/>
          <connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Generate Attributes (6)" to_port="example set input"/>
          <connect from_op="Generate Attributes (6)" from_port="example set output" to_op="Generate Attributes (4)" to_port="example set input"/>
          <connect from_op="Generate Attributes (4)" from_port="example set output" to_op="Execute Python" to_port="input 1"/>
          <connect from_op="Execute Python" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
  • christos_karraschristos_karras Member Posts: 32  Guru
    By the way, which browser is best to use to post on this forum? While I was editing my answer, the page frequently became unresponsive for several minutes each time, so answering was really a struggle. I tried with Chrome and FireFox.

    Thanks
    Jasmine_abevensee
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,860  Community Manager
    thank you @christos_karras - I was sure that I was probably missing some logic somewhere. :smile:

    As for forum viewing, it should work in all browsers. Can you pls be more specific about what your setup is?

    Scott
    Jasmine_
Sign In or Register to comment.