Easy way to calculate metrics of different status

msacs09 · February 2019

Experts,

we have a data set like below. We wanted to gather some stats on how many days different client(id) stay on each status, before closing.. and also what %age of the total client stay in each stage. A visual representation would be phenomenal. I would greatly appreciate if you can share a sample process that would does gather these metrics.

id date status
1 1/1/18 proposal
1 1/8/18 analysis
1 1/20/18 accepted
1 1/30/18 closed
2 2/1/18 proposal
2 2/10/18 analysis
2 2/12/18 accepted
2 2/18/18 closed
...........
............
Many Thanks in Advance for your time

lionelderkrikor · February 2019

Hi @msacs09,

I propose the following process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-RC">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.000-RC" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" breakpoints="after" class="utility:create_exampleset" compatibility="9.2.000-RC" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="85">
        <parameter key="generator_type" value="comma separated text"/>
        <parameter key="number_of_examples" value="100"/>
        <parameter key="use_stepsize" value="false"/>
        <list key="function_descriptions"/>
        <parameter key="add_id_attribute" value="false"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="input_csv_text" value="id,date,status&#10;1,1/1/18,proposal&#10;1,1/8/18,analysis&#10;1,1/20/18 ,accepted&#10;1,1/30/18,closed&#10;2,2/1/18,proposal&#10;2,2/10/18,analysis&#10;2,2/12/18,accepted&#10;2,2/18/18,closed"/>
        <parameter key="column_separator" value=","/>
        <parameter key="parse_all_as_nominal" value="false"/>
        <parameter key="decimal_point_character" value="."/>
        <parameter key="trim_attribute_names" value="true"/>
      </operator>
      <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000-RC" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="313" y="85">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="id"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="nominal_to_date" compatibility="9.2.000-RC" expanded="true" height="82" name="Nominal to Date" width="90" x="447" y="85">
        <parameter key="attribute_name" value="date"/>
        <parameter key="date_type" value="date"/>
        <parameter key="date_format" value="MM/dd/yy"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="keep_old_attribute" value="false"/>
      </operator>
      <operator activated="true" class="time_series:lag_series" compatibility="9.2.000-BETA2" expanded="true" height="82" name="Lag" width="90" x="581" y="85">
        <list key="attributes">
          <parameter key="date" value="1"/>
        </list>
        <parameter key="overwrite_attributes" value="false"/>
        <parameter key="extend_exampleset" value="false"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.2.000-RC" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="85">
        <list key="function_descriptions">
          <parameter key="duration" value="date_diff([date-1],date)/(1000*3600*24)"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="aggregate" compatibility="9.2.000-RC" expanded="true" height="82" name="Aggregate" width="90" x="849" y="85">
        <parameter key="use_default_aggregation" value="false"/>
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="default_aggregation_function" value="average"/>
        <list key="aggregation_attributes">
          <parameter key="duration" value="average"/>
        </list>
        <parameter key="group_by_attributes" value="status|id"/>
        <parameter key="count_all_combinations" value="false"/>
        <parameter key="only_distinct" value="false"/>
        <parameter key="ignore_missings" value="true"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/>
      <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Nominal to Date" to_port="example set input"/>
      <connect from_op="Nominal to Date" from_port="example set output" to_op="Lag" to_port="example set input"/>
      <connect from_op="Lag" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Then you can go to the Visualizations panel and :
- plot the different durations according to the step for each Id client
- plot for example the average of each step according the step

Image: https://us.v-cdn.net/6030995/uploads/editor/gx/qzfz97ozh52r.png

Warning : It works with the last Beta version of RM 9.2

Regards,

Lionel

sgenzer · February 2019

Nice graphs, @lionelderkrikor ! How are you liking the new visualization tools?

lionelderkrikor · February 2019

Thanks you @sgenzer,

OK let's get ready for a (long ?) feedback...

1. General
My general feeling is that advanced operations are more intuitive . In deed in previous releases of RM, building some charts in Charts
panel was very intuitive, but in Advanced Charts panel , that was sometimes tricky. RM 9.2 fixs that.

2. Possible improvments (from my humble point of view) :
a. the only thing that is is not immediatly obvious is to "change color of datapoints" : It tooks me 10 minutes before understand how it works…For me, this method is good… but it lacks contextual help for this operation (« drag the icons to define color order » for example).

Image: https://us.v-cdn.net/6030995/uploads/editor/pf/s45of4vqvpuh.png

b. It lacks a feature which just (re)auto-scale the chart after setting a « Axis min value » and/or a « Axis max value » in the « Axis » menus.

Image: https://us.v-cdn.net/6030995/uploads/editor/qq/esqtz0ax40xj.png

c. It lacks a "Add new Z-Axis" option with the Scatter 3D chart style (like it exists a "Add new Y-Axis"...)

3. Bug(s) report :
a. With Scatter 3D chart style, there is no effect when we set the Axis color to a new color :

Image: https://us.v-cdn.net/6030995/uploads/editor/be/437dsnd11lhm.png

b. Always with Scatter 3D chart style, Y-Axis graduation "disappears" when we set a new Axis min Value and a new Axis max value.

Image: https://us.v-cdn.net/6030995/uploads/editor/zj/6jcuxcorx4ar.png

c. With Boxplot and Deviation chart styles, Y-axis name does not appear.

Image: https://us.v-cdn.net/6030995/uploads/editor/49/vbey317ijg3d.png

d. Sometimes (random bug), when switching between « Results » panel and « Design » panel,there is a lag of 4 / 5 sec with a black chart screen.

Image: https://us.v-cdn.net/6030995/uploads/editor/yg/vwbysqbtwqpu.png

To conclude, I would say that there is (once again) a great work of RapidMiner's team.

Regards,

Lionel

sgenzer · February 2019

As you know we always appreciate constructive feedback, @lionelderkrikor. THANK YOU for taking the time to write this up. I'm cc'ing @Marco_Boeck so this goes right to the source.

Scott

msacs09 · February 2019

Thank you @lionelderkrikor.

The challenge with the actual data set is not every id will go through all stages.. some will skip stages and the date is not continuous for all Id's like below. Exact depiction of the true state of the data is below. So not sure if a Lag operator with 1 will help us here? how can we dynamically adjust the lag operator based on the id and available stages or perform a Lag with group by id? or can we completely avoid a lag operator here

id date status
1 1/1/18 proposal
1 1/8/18 analysis
1 ? accepted
1 1/30/18 closed
2 2/1/18 proposal
2 2/18/18 closed
3 1/2/18 proposal
3 1/29/18 analysis
3 ? accepted --> missing
3 2/5/18 closed
---
----
Also, I see you have converted to milliseconds "date_diff([date-1],date)/(1000*3600*24)" can you provide me the rational please.

Many Thanks

MartinLiebig · February 2019

Hi,
a bit off topic, but your data looks a lot like process mining data. you might want to have a look at ProM.
BR,
Martin

msacs09 · February 2019

Thank you sir. I'm not totally sure how ProM will help me here? Also, this isn't a log data. so I'm kind of confused :-(

Marco_Boeck · February 2019

Hi @lionelderkrikor

Thanks for your feedback! Let me reply to your points:
2 a) Yep, certainly not the best UX. It works better for color axes with linear gradients, I think it's more natural there but we tried to re-use it for the time being. The entire configuration is not final yet, we have some ideas on how to improve it in the future

2 b) The 'x' is a good idea for those boxes! Until that arrives, you can always just completely remove the entire content from any of the textboxes, and it will go right back to the default.

3 a) Hm indeed, seems like axis lines are not showing for 3D plots, ever.

3 b) Works for me, do you happen to have a data set for which it does not work?

Image: https://us.v-cdn.net/6030995/uploads/editor/47/7pzf6ta06fow.png

3 c) Yep, same as in the old charts. To be honest, I had no idea what to put there, as the attributes are at the bottom already. If you have any good suggestions for Parallel Coordinates, Deviation, and Boxplot, I'd gladly take them.

3 d) Interesting. Never experienced that on our test systems. What operating system are you using exactly, if I may ask?

Regards,
Marco

lionelderkrikor · February 2019

Hi @Marco_Boeck,

First, thanks for your answers ! Let me precise several points :

3 b) Works for me, do you happen to have a data set for which it does not work?
I'm using the Iris Dataset (in the Samples repository), but I forgot to mention that is when we set a new "Axis min Value" and a new "Axis max value" of the X-Axis (and not the Y-Axis).

3 c) Yep, same as in the old charts. To be honest, I had no idea what to put there, as the attributes are at the bottom already. If you have any good suggestions for Parallel Coordinates, Deviation, and Boxplot, I'd gladly take them.
In deed, I forgot to mention that it comes when I used "Group by" option (for example with Boxplots). In this case, the attributes aren't anymore at the bottom...

3 d) Interesting. Never experienced that on our test systems. What operating system are you using exactly, if I may ask?
I'm executing Windows 10.

I'm rigorous, maybe too rigorous...

, but I think that all these elements will improve the user experience.
And more seriously, once again, I have the feeling that there was a lot of work for this new visualization concept !

Regards,

Lionel

msacs09 · February 2019

Folks, sorry apparently my question has been hijacked

rfuentealba · February 2019

@msacs09,

So you have a list of clients and a list of project stages, right? How do you store the change between stages? This should be easy to do with a little SQL (hence with RapidMiner operators). If you do a full join between clients and stages, you will have NULL on the stages your clients didn't participate, but it all depends on the shape of your data. Care to elaborate a bit more and see how can I help you?

Yep, got hijacked.

msacs09 · February 2019

Hi,

I have ended up writing a SQL in the DB. I could do the same in RM, but not sure how to use the below Oracel function in Aggregation operator 

lead(dd ignore nulls) over (partition by id order by dd) - dd as dt_diff. Does RM support "Partion by" ?

sgenzer · February 2019

very sorry about hijacking your question @msacs09 My bad.

Scott

rfuentealba · February 2019

Does RM support "Partion by" ?

dosomething() PARTITION BY is essentially Filter Example Range or Discretize, followed by what you want to do. At least that's how I understand it.

All the best

Rodrigo.

msacs09 · February 2019

Got it Thank you

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Easy way to calculate metrics of different status

Answers