Options

"Timestamps from PostgreSQL: Different results in RapidMiner and RapidAnalytics"

LarsHLarsH Member Posts: 6 Contributor II
edited June 2019 in Help
Hi everyone,

I am facing a strange behavior of timestamps (or big numbers) in RapidMiner and RapidAnalytics when reading them from a PostgreSQL Database. I set up a very simple process, which only reads from the DB, sorts the resulting example set and stores it in the RA repository.

image
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_database" compatibility="5.3.007" expanded="true" height="60" name="Read Database (2)" width="90" x="246" y="30">
        <parameter key="connection" value="mypsqldb"/>
        <parameter key="query" value="select to_char('01.08.2012 01:01:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,&#10;'01.08.2012 01:01:00+00'::timestamptz as zeit_ts,&#10;extract(epoch from timestamp with time zone '01.08.2012 01:01:00+00') as zeit_epoch&#10;union&#10;select to_char('01.08.2012 01:02:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,&#10;'01.08.2012 01:02:00+00'::timestamptz as zeit_ts,&#10;extract(epoch from timestamp with time zone '01.08.2012 01:02:00+00') as zeit_epoch&#10;union&#10;select to_char('01.08.2012 01:03:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,&#10;'01.08.2012 01:03:00+00'::timestamptz as zeit_ts,&#10;extract(epoch from timestamp with time zone '01.08.2012 01:03:00+00') as zeit_epoch;"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="sort" compatibility="5.3.007" expanded="true" height="76" name="Sort" width="90" x="447" y="30">
        <parameter key="attribute_name" value="zeit_string"/>
      </operator>
      <operator activated="true" class="store" compatibility="5.3.007" expanded="true" height="60" name="Store" width="90" x="648" y="30">
        <parameter key="repository_entry" value="../ExampleSets/Test2"/>
      </operator>
      <connect from_op="Read Database (2)" from_port="output" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
When I execute
select 
to_char('01.08.2012 01:01:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,
'01.08.2012 01:01:00+00'::timestamptz as zeit_ts,
extract(epoch from timestamp with time zone '01.08.2012 01:01:00+00') as zeit_epoch

union

select to_char('01.08.2012 01:02:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,
'01.08.2012 01:02:00+00'::timestamptz as zeit_ts,
extract(epoch from timestamp with time zone '01.08.2012 01:02:00+00') as zeit_epoch

union

select to_char('01.08.2012 01:03:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,
'01.08.2012 01:03:00+00'::timestamptz as zeit_ts,
extract(epoch from timestamp with time zone '01.08.2012 01:03:00+00') as zeit_epoch;
directly from RapidMiner, everything seems fine and I get

image

When running the same process on the RA server, I get a different and obviously wrong result

image

I'm using Rapidminer 5.3.007, RapidAnalytics CE 1.3.007 and PostgreSQL 8.4.4. RapidMiner uses the DB connection I've defined in RapidAnalytics.

Does anyone have an idea of what is going on here?

Thanks for help,
LarsH
Tagged:

Answers

  • Options
    Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Hi,

    do you have the same SQL driver for RapidMiner and RapidAnalytics?

    Best,
    Nils
  • Options
    LarsHLarsH Member Posts: 6 Contributor II
    Hi!

    Yes, I think so. I didn't change the driver that comes with RM and RA (postgresql-9.1-901.jdbc4.jar).

    Over the weekend, I found out that the problem occurs on Linux and Windows. I've tested the following combinations:

    RM: Ubuntu 12.04 -> RA: Ubuntu 12.04
    RM: PC-BSD 9.1 -> RA: Ubuntu 10.04
    RM: Windows 7 -> RA: Windows 7

    All of them show the same behavior as described above.

    Best,
    Lars
  • Options
    LarsHLarsH Member Posts: 6 Contributor II
    I just found out that my problem isn't only related to PostgreSQL.

    I created an example set called 'Test' via "Import CSV" with the following content (automatically identified as 'integer' by RM):
    1343782860
    Then I executed the following process

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.007">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="retrieve" compatibility="5.3.007" expanded="true" height="60" name="Retrieve Test" width="90" x="45" y="30">
           <parameter key="repository_entry" value="../ExampleSets/Test"/>
         </operator>
         <operator activated="true" class="generate_attributes" compatibility="5.3.007" expanded="true" height="76" name="Generate Attributes" width="90" x="179" y="30">
           <list key="function_descriptions">
             <parameter key="att1_ms" value="att1*1000"/>
           </list>
           <parameter key="keep_all" value="false"/>
         </operator>
         <operator activated="true" class="numerical_to_date" compatibility="5.3.007" expanded="true" height="76" name="Numerical to Date" width="90" x="380" y="30">
           <parameter key="attribute_name" value="att1_ms"/>
         </operator>
         <operator activated="true" class="store" compatibility="5.3.007" expanded="true" height="60" name="Store" width="90" x="581" y="30">
           <parameter key="repository_entry" value="../ExampleSets/Test_ts"/>
         </operator>
         <connect from_op="Retrieve Test" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
         <connect from_op="Generate Attributes" from_port="example set output" to_op="Numerical to Date" to_port="example set input"/>
         <connect from_op="Numerical to Date" from_port="example set output" to_op="Store" to_port="input"/>
         <connect from_op="Store" from_port="through" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    to get an example set that contains the number from 'Test' as date '01.08.2012 03:01:00 MESZ'. Doing this directly in Rapidminer, everything works well. But in Rapidanalytics, I get the same wrong result like in my PostgreSQL example.

    And I found out another interesting detail: When I declare my CSV number 1343782860 to be 'numeric', Rapidminer imports it as 1343782900. Thats '01.08.2012 03:01:40 MESZ'! Perhaps that's the error that happens during the PostgreSQL import? But how can I fix it?
  • Options
    LarsHLarsH Member Posts: 6 Contributor II
    LarsH wrote:

    And I found out another interesting detail: When I declare my CSV number 1343782860 to be 'numeric', Rapidminer imports it as 1343782900. Thats '01.08.2012 03:01:40 MESZ'! Perhaps that's the error that happens during the PostgreSQL import? But how can I fix it?
    I've found another example for that problem. I created the following process

    image
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.007">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="5.3.007" expanded="true" height="60" name="Generate Data" width="90" x="581" y="30">
            <parameter key="number_examples" value="100000"/>
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="attributes_lower_bound" value="1.2345678912E15"/>
            <parameter key="attributes_upper_bound" value="5.6789012345E15"/>
          </operator>
          <operator activated="true" class="store" compatibility="5.3.007" expanded="true" height="60" name="Store" width="90" x="849" y="30">
            <parameter key="repository_entry" value="test_gen_data"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Store" to_port="input"/>
          <connect from_op="Store" from_port="through" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    and ran it with Rapidminer and Rapidanalytics. Lots of 16-digit random numbers.

    With Rapidminer, I get

    image

    but with Rapidanalytics, the last 7 digits are always zero?!

    image

    Of course, the numbers are "random", but is this behavior really correct?
  • Options
    Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    I think this has to do something with the way Postgres treats numbers. If you run it directly from RapidMiner, pure Java is used.
    No number is concatenated when storing the ExampleSet on disk. But when you run it in RapidAnalytics the results are stored in the Postgres SQL database you selected for RapidAnalytics.
    And there must be a weird problem. Currenlty I have no Postgres DB available, so I can't reproduce the error, but could you please have a look at the setup of your database tables?
    Is there something suspicious?

    Best,
    Nils
  • Options
    LarsHLarsH Member Posts: 6 Contributor II
    Thanks, that's a very good hint I think. I've never used the local repository, even if I run my processes locally with RapidMiner, I store the results in the RapidAnalytics repository. I've found out that these example sets are corrupt as well when I open them in RapidMiner again. I've never seen that, because I thought the example set RapidMiner displays in the result perspective after finishing the local process, would have come from the RA repository...

    I've re-executed my 'random numbers' process and found out that RA stores the 16-digit numbers as 'real's, instead of 'double' or 'numeric'.
    rapidanalytics=> select rm_rownum, att_1, att_2 as label from es_75605_data_1 limit 1;
    rm_rownum |    att_1    |  label 
    -----------+-------------+----------
            1 | 4.00508e+15 | 0.863356
    (1 Zeile)
    Even timestamps are stored as 'real'! That's why some digits (and seconds...) get lost!

    Best,
    Lars

  • Options
    Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Just for clarification:
    If you run a process from the RapidMiner GUI the result you see in the end is produced on the machine you run RapidMiner on.
    It does not matter if the process is stored an RA or in a local Repository. Nevertheless the 'Store' operator will transfer the locally generated data to your RA instance and save it in the selected RA database if the process is stored on RapidAnalytics.

    Could you please have a look what the data types the data tables have? For me it looks like this on MySQL:

    select rm_rownum, att_1, att_2 as label from `rapidanalytics`.`es_305_data_1` limit 1;

    rm_rownum att_1 label
    1 4.005079079564649e15 0.8633563476940593

    SHOW FIELDS FROM `rapidanalytics`.`es_305_data_1`;

    Field Type Null Key Default Extra
    rm_rownum int(11) NO PRI NULL
    att_1 double YES NULL
    att_2 double YES NULL
    As you can see att_1 and the label values are stored as doubles.
    Thus retrieving the values yields exactly the same results as executing the process from RapidMiner locally does.

    Best,
    Nils
  • Options
    LarsHLarsH Member Posts: 6 Contributor II
    For me it looks like this:
    SELECT a.attnum, a.attname, t.typname, a.attlen, a.atttypmod, 
    a.attnotnull, a.atthasdef FROM pg_class c, pg_attribute a, pg_type t
    WHERE a.attnum > 0 and a.attrelid = c.oid and c.relname = 'es_75605_data_1'
    and a.atttypid = t.oid order by a.attnum;
    attnum |  attname  | typname | attlen | atttypmod | attnotnull | atthasdef
    --------+-----------+---------+--------+-----------+------------+-----------
          1 | rm_rownum | int4    |      4 |        -1 | t          | f
          2 | att_1    | float4  |      4 |        -1 | f          | f
          3 | att_2    | float4  |      4 |        -1 | f          | f
    (3 Zeilen)
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    assuming that your original problem with the timestamps is still current:

    to_char('01.08.2012 01:01:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ')

    Don't do this, it is not portable. It depends on the PostgreSQL configuration setting "datestyle" and also on the settings in the client session. That again can be set by the client and maybe the JDBC connector or RapidAnalytics set it up differently.

    You can always check the value by putting "SHOW datestyle;" into a Read Database operator.

    Your example outputs "08.01.2012" instead of "01.08.2012" in my RapidMiner because my PostgreSQL installation has "iso, mdy" configured - the default configuration.

    See the "Date input" topic in the PostgreSQL documentation for safe syntax variants:
    http://www.postgresql.org/docs/9.1/static/datatype-datetime.html
Sign In or Register to comment.