Easy way to calculate metrics of different status

msacs09msacs09 Member Posts: 55 Contributor II
Experts,

we have a data set like below. We wanted to gather some stats on how many days different client(id) stay on each status, before closing.. and also what %age of the total client stay in each stage. A  visual representation would be phenomenal. I would greatly appreciate if you can share a sample process that would does gather these metrics.

id   date        status
1    1/1/18      proposal
1    1/8/18      analysis
1     1/20/18    accepted
1      1/30/18    closed
2      2/1/18      proposal
2      2/10/18     analysis
2      2/12/18      accepted
2     2/18/18       closed
...........
............
Many Thanks in Advance for your time

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @msacs09,

    I propose the following process : 
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-RC">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000-RC" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" breakpoints="after" class="utility:create_exampleset" compatibility="9.2.000-RC" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="85">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="id,date,status&#10;1,1/1/18,proposal&#10;1,1/8/18,analysis&#10;1,1/20/18 ,accepted&#10;1,1/30/18,closed&#10;2,2/1/18,proposal&#10;2,2/10/18,analysis&#10;2,2/12/18,accepted&#10;2,2/18/18,closed"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000-RC" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="313" y="85">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="numeric"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="real"/>
            <parameter key="block_type" value="value_series"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_series_end"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="nominal_to_date" compatibility="9.2.000-RC" expanded="true" height="82" name="Nominal to Date" width="90" x="447" y="85">
            <parameter key="attribute_name" value="date"/>
            <parameter key="date_type" value="date"/>
            <parameter key="date_format" value="MM/dd/yy"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="keep_old_attribute" value="false"/>
          </operator>
          <operator activated="true" class="time_series:lag_series" compatibility="9.2.000-BETA2" expanded="true" height="82" name="Lag" width="90" x="581" y="85">
            <list key="attributes">
              <parameter key="date" value="1"/>
            </list>
            <parameter key="overwrite_attributes" value="false"/>
            <parameter key="extend_exampleset" value="false"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.2.000-RC" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="85">
            <list key="function_descriptions">
              <parameter key="duration" value="date_diff([date-1],date)/(1000*3600*24)"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" class="aggregate" compatibility="9.2.000-RC" expanded="true" height="82" name="Aggregate" width="90" x="849" y="85">
            <parameter key="use_default_aggregation" value="false"/>
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="default_aggregation_function" value="average"/>
            <list key="aggregation_attributes">
              <parameter key="duration" value="average"/>
            </list>
            <parameter key="group_by_attributes" value="status|id"/>
            <parameter key="count_all_combinations" value="false"/>
            <parameter key="only_distinct" value="false"/>
            <parameter key="ignore_missings" value="true"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/>
          <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Nominal to Date" to_port="example set input"/>
          <connect from_op="Nominal to Date" from_port="example set output" to_op="Lag" to_port="example set input"/>
          <connect from_op="Lag" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Then you can go to the Visualizations panel and  : 
     - plot the different durations according to the step for each Id client
     - plot for example the average of each step according the step 



    Warning : It works with the last Beta version of RM 9.2

    Regards,

    Lionel
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Nice graphs, @lionelderkrikor ! How are you liking the new visualization tools?
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Thanks you @sgenzer,

    OK let's get ready for a (long ?) feedback...

    1. General
    My general feeling is that advanced operations are more intuitive . In deed in previous releases of RM, building some charts in Charts
    panel was very intuitive, but in Advanced Charts panel , that was sometimes tricky. RM 9.2 fixs that.

    2. Possible improvments (from my humble point of view) : 
    a. the only thing that is is not immediatly obvious is to "change color of datapoints" : It tooks me 10 minutes before understand how it works…For me, this method  is good… but it lacks contextual help for this operation (« drag the icons to define color order » for example).


    b. It lacks  a feature which just (re)auto-scale the chart after setting a « Axis min value » and/or a « Axis max value » in the « Axis » menus.

    c. It lacks a "Add new Z-Axis" option with the Scatter 3D chart style (like it exists a "Add new Y-Axis"...)

    3. Bug(s) report : 
    a. With Scatter 3D chart style, there is no effect when we set the Axis color to a new color  : 


    b. Always with Scatter 3D chart style, Y-Axis graduation "disappears" when we set a new Axis min Value and a new Axis max value.


    c. With Boxplot and Deviation chart styles, Y-axis name does not appear. 



    d. Sometimes (random bug), when switching between « Results » panel and « Design » panel,there is a lag of  4 / 5 sec with a black chart screen.
     
    To conclude, I would say that there is (once again) a great work of RapidMiner's team. :)

    Regards,

    Lionel

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    As you know we always appreciate constructive feedback, @lionelderkrikor. THANK YOU for taking the time to write this up. I'm cc'ing @Marco_Boeck so this goes right to the source. :wink:

    Scott
  • msacs09msacs09 Member Posts: 55 Contributor II
    edited February 2019
    Thank you @lionelderkrikor.

    The challenge with the actual data set is not every id will go through all stages.. some will skip stages and the date is not continuous for all Id's like below. Exact depiction of the true state of the data is below. So not sure if a Lag operator with 1 will help us here? how can we dynamically adjust the lag operator based on the id and available stages or perform a Lag with group by id? or can we completely avoid a lag operator here 

    id   date        status
    1    1/1/18      proposal
    1    1/8/18      analysis
    1     ?            accepted
    1      1/30/18    closed
    2      2/1/18      proposal
    2     2/18/18       closed
    3      1/2/18    proposal
    3      1/29/18      analysis
    3     ?              accepted --> missing 
    3     2/5/18       closed
    ---
    ----
    Also, I see you have converted to milliseconds "date_diff([date-1],date)/(1000*3600*24)" can you provide me the rational please.

    Many Thanks
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,
    a bit off topic, but your data looks a lot like process mining data. you might want to have a look at ProM.
    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • msacs09msacs09 Member Posts: 55 Contributor II
    Thank you sir. I'm not totally sure how ProM will help me here? Also, this isn't a log data. so I'm kind of confused :-( 
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    edited February 2019
    Hi @lionelderkrikor

    Thanks for your feedback! Let me reply to your points:
    2 a) Yep, certainly not the best UX. It works better for color axes with linear gradients, I think it's more natural there but we tried to re-use it for the time being. The entire configuration is not final yet, we have some ideas on how to improve it in the future :)

    2 b) The 'x' is a good idea for those boxes! Until that arrives, you can always just completely remove the entire content from any of the textboxes, and it will go right back to the default.

    3 a) Hm indeed, seems like axis lines are not showing for 3D plots, ever.

    3 b) Works for me, do you happen to have a data set for which it does not work?


    3 c) Yep, same as in the old charts. To be honest, I had no idea what to put there, as the attributes are at the bottom already. If you have any good suggestions for Parallel Coordinates, Deviation, and Boxplot, I'd gladly take them.

    3 d) Interesting. Never experienced that on our test systems. What operating system are you using exactly, if I may ask?

    Regards,
    Marco
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @Marco_Boeck,

    First, thanks for your answers ! Let me precise several points : 

    3 b) Works for me, do you happen to have a data set for which it does not work?
    I'm using the Iris Dataset (in the Samples repository), but I forgot to mention that is when we set a new "Axis min Value" and a new "Axis max value" of the X-Axis (and not the Y-Axis).

    3 c) Yep, same as in the old charts. To be honest, I had no idea what to put there, as the attributes are at the bottom already. If you have any good suggestions for Parallel Coordinates, Deviation, and Boxplot, I'd gladly take them.
    In deed, I forgot to mention that it comes when I used "Group by" option (for example with Boxplots). In this case, the attributes aren't anymore at the bottom...


    3 d) Interesting. Never experienced that on our test systems. What operating system are you using exactly, if I may ask?
    I'm executing Windows 10.

    I'm rigorous, maybe too rigorous... :) , but I think that all these elements will improve the user experience.
    And more seriously, once again, I have the feeling that there was a lot of work for this new visualization concept !  :)

    Regards,

    Lionel


  • msacs09msacs09 Member Posts: 55 Contributor II
    Folks, sorry apparently my question has been hijacked  :D:cold_sweat:
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    So you have a list of clients and a list of project stages, right? How do you store the change between stages? This should be easy to do with a little SQL (hence with RapidMiner operators). If you do a full join between clients and stages, you will have NULL on the stages your clients didn't participate, but it all depends on the shape of your data. Care to elaborate a bit more and see how can I help you?

    Yep, got hijacked. :(
  • msacs09msacs09 Member Posts: 55 Contributor II
    Hi,
    
    I have ended up writing a SQL in the DB. I could do the same in RM, but not sure how to use the below Oracel function in Aggregation operator 
    
    lead(dd ignore nulls) over (partition by id order by dd) - dd as dt_diff. Does RM support "Partion by" ?
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    very sorry about hijacking your question @msacs09 My bad. :smiley:

    Scott
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Does RM support "Partion by" ?
    dosomething() PARTITION BY is essentially Filter Example Range or Discretize, followed by what you want to do. At least that's how I understand it.

    All the best

    Rodrigo.
  • msacs09msacs09 Member Posts: 55 Contributor II
    Got it Thank you
Sign In or Register to comment.