image

🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉

Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.

CLICK HERE TO DOWNLOAD

GSP operator - min gap

Shamil7Shamil7 Member Posts: 8 Contributor I
edited November 2018 in Help
Can't understand whether GSP operator's "min gap" parameter works somehow.
In the data below min gap changing should influence to generated SPs, but it doesn't!
Min gap should exclude the transactions from patterns if they are close to each other.
Doesn't matter the value of min gap: 0, 4, 10.
Always the same result!
Please help ....
Maybe some working example ...


This is process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.5.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="6.5.002" expanded="true" height="60" name="Read CSV" width="90" x="45" y="75">
        <parameter key="csv_file" value="\\customer_sample_GSP.csv"/>
        <parameter key="column_separators" value=";"/>
        <parameter key="trim_lines" value="false"/>
        <parameter key="use_quotes" value="true"/>
        <parameter key="quotes_character" value="&quot;"/>
        <parameter key="escape_character" value="\"/>
        <parameter key="skip_comments" value="false"/>
        <parameter key="comment_characters" value="#"/>
        <parameter key="parse_numbers" value="true"/>
        <parameter key="decimal_character" value="."/>
        <parameter key="grouped_digits" value="false"/>
        <parameter key="grouping_character" value=","/>
        <parameter key="date_format" value=""/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="encoding" value="windows-1251"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Customer.true.polynominal.attribute"/>
          <parameter key="1" value="Time.true.integer.attribute"/>
          <parameter key="2" value="Product.true.polynominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="true"/>
        <parameter key="datamanagement" value="double_array"/>
      </operator>
      <operator activated="true" class="nominal_to_binominal" compatibility="6.5.002" expanded="true" height="94" name="Nominal2Binominal" width="90" x="246" y="30">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Product"/>
        <parameter key="attributes" value="|a1|a2|a3|a4"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="true"/>
        <parameter key="transform_binominal" value="true"/>
        <parameter key="use_underscore_in_name" value="false"/>
      </operator>
      <operator activated="true" class="generalized_sequential_patterns" compatibility="6.5.002" expanded="true" height="76" name="GSP" width="90" x="380" y="165">
        <parameter key="customer_id" value="Customer"/>
        <parameter key="time_attribute" value="Time"/>
        <parameter key="min_support" value="0.15"/>
        <parameter key="window_size" value="0.0"/>
        <parameter key="max_gap" value="15.0"/>
        <parameter key="min_gap" value="7.0"/>
        <parameter key="positive_value" value="true"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Nominal2Binominal" to_port="example set input"/>
      <connect from_op="Nominal2Binominal" from_port="example set output" to_op="GSP" to_port="example set"/>
      <connect from_op="GSP" from_port="example set" to_port="result 1"/>
      <connect from_op="GSP" from_port="patterns" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

This is csv data:

Customer Time Product
Alex 10 bread
Alex 15 butter
Alex 20 caviar
Peter 10 bread
Peter 15 butter
Peter 17 caviar
Peter 20 water
Igor 10 butter
Igor 20 bread
Igor 30 water
Hasan 10 bread
Hasan 20 butter
Hasan 22 caviar
Hasan 50 lemon
Pan 19 butter
Pan 20 bread
Pan 22 caviar

Answers

  • Shamil7Shamil7 Member Posts: 8 Contributor I
    Anybody!! Please help!!!
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,053  RM Data Scientist
    sorry, i never worked with this specific option. The only advice i could give you is to have a look on the source code, which is probably not the thing you want to do.
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • Shamil7Shamil7 Member Posts: 8 Contributor I
    Thank you for reply.
    Where can I get a source code of GSP?
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,053  RM Data Scientist
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • Shamil7Shamil7 Member Posts: 8 Contributor I
    Thank you
  • Shamil7Shamil7 Member Posts: 8 Contributor I
    It seems to me I found the reason of mistake. May be I'm wrong, but min_gap really isn't used properly.

  • haddockhaddock Member Posts: 849  Guru
    I agree, I can't see what it is used for in the code. Can anyone shed light?

    H
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 574   Unicorn
    It's used in a do while loop in https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/learner/associations/gsp/DataSequence.java
    That's probably where you need to start looking. 
    		while (candidateIterator.hasNext()) {
    Transaction currentTransaction = candidateIterator.next();
    TransactionSet currentSet = data.findTransaction(currentTransaction, t, countingInformations);
    if (currentSet != null) {
    double difference = currentSet.getEndTime() - t;
    if (matches.isEmpty() || difference < countingInformations.maxGap && difference > 0) { // matches
    // is
    // empty
    // as
    // indicator
    // for
    // first
    // run!
    // no
    // previous
    // to
    // check
    matchesIterator.add(currentSet);
    t = currentSet.getEndTime();
    } else {
    t = currentSet.getEndTime();
    break;
    }
  • Shamil7Shamil7 Member Posts: 8 Contributor I
    But how is min_gap applied then in GSP operator. It has no influence ...
  • haddockhaddock Member Posts: 849  Guru
    Hi there JEdward,

    Thanks for your response, it is exactly because I looked at the code that I raised my question. The original question was about the min gap . The help says...
    min gap
    This parameter specifies the minimal gap. The min gap parameter causes a customers sequence not to support a pattern, if the transactions containing this pattern are too near in time. Range: real
    As I read it transactions have to be apart in time by at least this amount, so I agree with Shamil that it should
    Min gap should exclude the transactions from patterns if they are close to each other.
    That would mean that setting the min gap to a huge number would mean no sequences, but that does not appear to be the case. If I copy Shamil's data into a file and replicate his process as follows I get four transactions in the GSPset even if I have a minimum gap larger than the largest time value. Here's the xml again
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.5.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="6.5.002" expanded="true" height="60" name="Read CSV" width="90" x="45" y="75">
            <parameter key="csv_file" value="/home/cjfpainter/blox.csv"/>
            <parameter key="column_separators" value="\s"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <parameter key="encoding" value="UTF-8"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Customer.true.polynominal.attribute"/>
              <parameter key="1" value="att2.true.attribute_value.attribute"/>
              <parameter key="2" value="att3.true.attribute_value.attribute"/>
              <parameter key="3" value="Time.true.integer.attribute"/>
              <parameter key="4" value="att5.true.attribute_value.attribute"/>
              <parameter key="5" value="att6.true.attribute_value.attribute"/>
              <parameter key="6" value="Product.true.polynominal.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="remove_useless_attributes" compatibility="6.5.002" expanded="true" height="76" name="Remove Useless Attributes" width="90" x="246" y="75"/>
          <operator activated="true" class="nominal_to_binominal" compatibility="6.5.002" expanded="true" height="94" name="Nominal2Binominal" width="90" x="447" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Product"/>
            <parameter key="attributes" value="|a1|a2|a3|a4"/>
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="transform_binominal" value="true"/>
          </operator>
          <operator activated="true" class="generalized_sequential_patterns" compatibility="6.5.002" expanded="true" height="76" name="GSP" width="90" x="380" y="165">
            <parameter key="customer_id" value="Customer"/>
            <parameter key="time_attribute" value="Time"/>
            <parameter key="min_support" value="0.5"/>
            <parameter key="window_size" value="0.0"/>
            <parameter key="max_gap" value="15.0"/>
            <parameter key="min_gap" value="1000.0"/>
            <parameter key="positive_value" value="true"/>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Remove Useless Attributes" to_port="example set input"/>
          <connect from_op="Remove Useless Attributes" from_port="example set output" to_op="Nominal2Binominal" to_port="example set input"/>
          <connect from_op="Nominal2Binominal" from_port="example set output" to_op="GSP" to_port="example set"/>
          <connect from_op="GSP" from_port="example set" to_port="result 1"/>
          <connect from_op="GSP" from_port="patterns" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    The long and short is that it does not do what the help document suggests, so Shamil has a fair question.

    As a point of interest, if you search this forum you'll see that this issue has shown up before, several times. Probably time for other people to  look at the code as well.

  • Shamil7Shamil7 Member Posts: 8 Contributor I
    I found that max_gap works. So, if we increase its value more and more transactions begin to take part in forming of sequences. But min_gap ...
  • haddockhaddock Member Posts: 849  Guru
    Indeed
  • Marco_BoeckMarco_Boeck Team Lead Software Engineering Administrator, Moderator, Employee, Member, University Professor Posts: 1,963   RM Engineering
    Hi,

    looks indeed like a bug to me. I have opened a ticket for this.

    Regards,
    Marco
  • haddockhaddock Member Posts: 849  Guru
    Cool, thanks to Shamil and Marco!
Sign In or Register to comment.