The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

[SOLVED] Data ordering

rosanarosana Member Posts: 13 Contributor II
edited November 2018 in Help
Good afternoon,

I’m a new user of RapidMiner. I would like to build the following process with RapidMiner and the data ordering is being quite complicated for me.
I explain bellow what I would like to do:

1. The final goal of my RapidMiner process is to plot curves and correlation matrix between different attributes over a period of time.

2. I will explain later the source of the data. I would like to obtain an “example set” with the following structure:

[move][/move]Pressure Flow Temperature
dd/mm/yy hh:mm P1 F1 T1
dd/mm/yy hh:mm P2 F2 T2
dd/mm/yy hh:mm P3 F3 T3
…. …. … ….
(table 1)

From this table I would be able to plot curves, correlation matrix, etc.

3. The problem is that I don’t have the data stored in such a simple format.
4. I have the data stored in .txt files. (>1000 rows per file). I can read it from the operator “Read CSV”.
5. But the data are stored as follows:

dd/mm/yy hh:01 Pressure P1
dd/mm/yy hh:02 Flow F1
dd/mm/yy hh:02 Pressure P1
dd/mm/yy hh:03 Temperature T1
dd/mm/yy hh:03 Temperature T1
…. …. ….
(table 2)

So, a very different order to what I would like to have.

6. I need to build an “example set” reordering the data (from table 2 to table 1). How can I do that?

7. Besides that I would like to reduce the final number of examples considering only an example per every 5 minutes instead of per minute. Therearefore I would need to have an example set whose examples are averages of several examples included in another “example set”. Again, I dind’t find any operator to do that.

I hope I explained my problem clearly.

Thank you very much for your help.
Best regards! Rosana

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Rosana,

    your data is in a so called relational format, and you have to use the Pivot operator to convert it to a row-based format.
    The pivot operator groups the data by a single attribute (in your case it would be the timestamp), and creates one column for each value of the "index" attribute (in your data the second row).
    The Create Transactional Data operator creates data with a similar structure to your data. In the attached process, I pivoted it, so you can adapt the process to your data (the Id attribute corresponds to your timestamp).

    Once you managed to pivot your data, please come back here and we will deal with your second problem (5 minutes averages).

    Best, Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="434" width="767">
          <operator activated="true" class="generate_transaction_data" compatibility="5.3.000" expanded="true" height="60" name="Generate Transaction Data" width="90" x="45" y="30">
            <parameter key="number_transactions" value="1000"/>
            <parameter key="number_customers" value="100"/>
            <parameter key="number_items" value="3"/>
          </operator>
          <operator activated="true" class="rename" compatibility="5.3.000" expanded="true" height="76" name="Rename" width="90" x="179" y="30">
            <parameter key="old_name" value="Item"/>
            <parameter key="new_name" value="Sensor"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="rename" compatibility="5.3.000" expanded="true" height="76" name="Rename (2)" width="90" x="313" y="30">
            <parameter key="old_name" value="Amount"/>
            <parameter key="new_name" value="Value"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.3.000" expanded="true" height="76" name="Set Role" width="90" x="447" y="30">
            <parameter key="name" value="Id"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="pivot" compatibility="5.3.000" expanded="true" height="76" name="Pivot" width="90" x="581" y="30">
            <parameter key="group_attribute" value="Id"/>
            <parameter key="index_attribute" value="Sensor"/>
          </operator>
          <connect from_op="Generate Transaction Data" from_port="output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_op="Rename (2)" to_port="example set input"/>
          <connect from_op="Rename (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Pivot" to_port="example set input"/>
          <connect from_op="Pivot" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • rosanarosana Member Posts: 13 Contributor II
    Hi Marius!

    Thank you very much!!! I managed to pivot the data, so we can start with my second problem (5 minutes average).

    Previously, I think I should also point out that I don't have the value of each attribute for each time step. For example, for the time step 10:00:01 I may have the value of pressure but I don't have the values for "flow" and "temperature". Therefore a "?" is indicated in the new pivoted table. I think this may be important when calculating the average per 5 minutes?.

    Finally, I indicated as "date format" in the "Read CSV" Operator "dd/MM/yy hh:mm:ss" (because the measures are done in a seconds scale)... but only the day is displayed in the table: dd/MM/yy... why?

    Again, thank you very much for your help!!

    Best, Rosana

  • wesselwessel Member Posts: 537 Maven
    I think you should not use the CSV operator for parsing date.
    Instead use nominal to date, and select date_time so you get both dates and time.

    Best regards,

    Wessel
  • wesselwessel Member Posts: 537 Maven
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="390" width="359">
          <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="128" y="134">
            <list key="attribute_values">
              <parameter key="myDate" value="&quot;23/12/2012 14:00:55&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="nominal_to_date" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Date" width="90" x="239" y="133">
            <parameter key="attribute_name" value="myDate"/>
            <parameter key="date_type" value="date_time"/>
            <parameter key="date_format" value="dd/MM/yyyy hh:mm:ss"/>
            <parameter key="keep_old_attribute" value="true"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Nominal to Date" to_port="example set input"/>
          <connect from_op="Nominal to Date" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • rosanarosana Member Posts: 13 Contributor II
    Thank you Wessel!!
    I have improved my process with your recommendation!
    Best regards, Rosana
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    to get the average of an interval you have  to use the Aggregate operator. The problem however is that before that you have to create an attribute which defines to which interval of 5 minutes a row belongs. We can do that with Generate Attributes and the function date_diff. date_diff delivers the time between two dates in milliseconds. What I did in Generate Attributes was to calculate the time between 01-01-1970 and the current date in milliseconds, and then divide that value by the number of milliseconds per 5 minutes (5*60*1000). If you round that with the floor() function, you get the number of 5-minute-intervals since 1970 for the current date. Just have a look at the formula in the process.

    In the Aggregate operator, you can group by that new attribute, and select one or more aggregation attributes. In your case that would be the average of your sensor values. Additionally you may want to select the minimum of the original date attribute to get the first date of the current 5-minutes-interval.

    Happy Mining!
    ~Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="235" width="480">
          <operator activated="true" class="subprocess" compatibility="5.3.000" expanded="true" height="76" name="Generate Data" width="90" x="45" y="30">
            <process expanded="true" height="452" width="300">
              <operator activated="true" class="generate_sales_data" compatibility="5.3.000" expanded="true" height="60" name="Generate Sales Data" width="90" x="45" y="30"/>
              <operator activated="true" class="generate_attributes" compatibility="5.3.000" expanded="true" height="76" name="Generate Attributes" width="90" x="180" y="30">
                <list key="function_descriptions">
                  <parameter key="date" value="date_add(date, rand()*24*60, DATE_UNIT_MINUTE)"/>
                </list>
              </operator>
              <connect from_op="Generate Sales Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="5.3.000" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="179" y="30">
            <list key="function_descriptions">
              <parameter key="five_minute_intervals_since_1970" value="floor(date_diff(date_parse_custom(&quot;01-01-1970&quot;, &quot;dd-MM-yyyy&quot;), date) / (5*60*1000))"/>
            </list>
          </operator>
          <operator activated="true" class="aggregate" compatibility="5.3.000" expanded="true" height="76" name="Aggregate" width="90" x="313" y="30">
            <list key="aggregation_attributes">
              <parameter key="amount" value="average"/>
              <parameter key="date" value="minimum"/>
            </list>
            <parameter key="group_by_attributes" value="|five_minute_intervals_since_1970"/>
          </operator>
          <connect from_op="Generate Data" from_port="out 1" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • rosanarosana Member Posts: 13 Contributor II
    Thank you very much Marius!
    It works!!
    Best, rosana
Sign In or Register to comment.