"Timestamps from PostgreSQL: Different results in RapidMiner and RapidAnalytics"

LarsH · April 2013

Hi everyone,

I am facing a strange behavior of timestamps (or big numbers) in RapidMiner and RapidAnalytics when reading them from a PostgreSQL Database. I set up a very simple process, which only reads from the DB, sorts the resulting example set and stores it in the RA repository.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_database" compatibility="5.3.007" expanded="true" height="60" name="Read Database (2)" width="90" x="246" y="30">
        <parameter key="connection" value="mypsqldb"/>
        <parameter key="query" value="select to_char('01.08.2012 01:01:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,&#10;'01.08.2012 01:01:00+00'::timestamptz as zeit_ts,&#10;extract(epoch from timestamp with time zone '01.08.2012 01:01:00+00') as zeit_epoch&#10;union&#10;select to_char('01.08.2012 01:02:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,&#10;'01.08.2012 01:02:00+00'::timestamptz as zeit_ts,&#10;extract(epoch from timestamp with time zone '01.08.2012 01:02:00+00') as zeit_epoch&#10;union&#10;select to_char('01.08.2012 01:03:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,&#10;'01.08.2012 01:03:00+00'::timestamptz as zeit_ts,&#10;extract(epoch from timestamp with time zone '01.08.2012 01:03:00+00') as zeit_epoch;"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="sort" compatibility="5.3.007" expanded="true" height="76" name="Sort" width="90" x="447" y="30">
        <parameter key="attribute_name" value="zeit_string"/>
      </operator>
      <operator activated="true" class="store" compatibility="5.3.007" expanded="true" height="60" name="Store" width="90" x="648" y="30">
        <parameter key="repository_entry" value="../ExampleSets/Test2"/>
      </operator>
      <connect from_op="Read Database (2)" from_port="output" to_op="Sort" to_port="example set input"/>
      <connect from_op="Sort" from_port="example set output" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

When I execute

select 
to_char('01.08.2012 01:01:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,
'01.08.2012 01:01:00+00'::timestamptz as zeit_ts,
extract(epoch from timestamp with time zone '01.08.2012 01:01:00+00') as zeit_epoch

union

select to_char('01.08.2012 01:02:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,
'01.08.2012 01:02:00+00'::timestamptz as zeit_ts,
extract(epoch from timestamp with time zone '01.08.2012 01:02:00+00') as zeit_epoch

union

select to_char('01.08.2012 01:03:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ') as zeit_string,
'01.08.2012 01:03:00+00'::timestamptz as zeit_ts,
extract(epoch from timestamp with time zone '01.08.2012 01:03:00+00') as zeit_epoch;

directly from RapidMiner, everything seems fine and I get

When running the same process on the RA server, I get a different and obviously wrong result

I'm using Rapidminer 5.3.007, RapidAnalytics CE 1.3.007 and PostgreSQL 8.4.4. RapidMiner uses the DB connection I've defined in RapidAnalytics.

Does anyone have an idea of what is going on here?

Thanks for help,
LarsH

Nils_Woehler · April 2013

Hi,

do you have the same SQL driver for RapidMiner and RapidAnalytics?

Best,
Nils

LarsH · April 2013

Hi!

Yes, I think so. I didn't change the driver that comes with RM and RA (postgresql-9.1-901.jdbc4.jar).

Over the weekend, I found out that the problem occurs on Linux and Windows. I've tested the following combinations:

RM: Ubuntu 12.04 -> RA: Ubuntu 12.04
RM: PC-BSD 9.1 -> RA: Ubuntu 10.04
RM: Windows 7 -> RA: Windows 7

All of them show the same behavior as described above.

Best,
Lars

LarsH · April 2013

I just found out that my problem isn't only related to PostgreSQL.

I created an example set called 'Test' via "Import CSV" with the following content (automatically identified as 'integer' by RM):

1343782860

Then I executed the following process


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="5.3.007" expanded="true" height="60" name="Retrieve Test" width="90" x="45" y="30">
        <parameter key="repository_entry" value="../ExampleSets/Test"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="5.3.007" expanded="true" height="76" name="Generate Attributes" width="90" x="179" y="30">
        <list key="function_descriptions">
          <parameter key="att1_ms" value="att1*1000"/>
        </list>
        <parameter key="keep_all" value="false"/>
      </operator>
      <operator activated="true" class="numerical_to_date" compatibility="5.3.007" expanded="true" height="76" name="Numerical to Date" width="90" x="380" y="30">
        <parameter key="attribute_name" value="att1_ms"/>
      </operator>
      <operator activated="true" class="store" compatibility="5.3.007" expanded="true" height="60" name="Store" width="90" x="581" y="30">
        <parameter key="repository_entry" value="../ExampleSets/Test_ts"/>
      </operator>
      <connect from_op="Retrieve Test" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Numerical to Date" to_port="example set input"/>
      <connect from_op="Numerical to Date" from_port="example set output" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

to get an example set that contains the number from 'Test' as date '01.08.2012 03:01:00 MESZ'. Doing this directly in Rapidminer, everything works well. But in Rapidanalytics, I get the same wrong result like in my PostgreSQL example.

And I found out another interesting detail: When I declare my CSV number 1343782860 to be 'numeric', Rapidminer imports it as 1343782900. Thats '01.08.2012 03:01:40 MESZ'! Perhaps that's the error that happens during the PostgreSQL import? But how can I fix it?

LarsH · April 2013

LarsH wrote:

And I found out another interesting detail: When I declare my CSV number 1343782860 to be 'numeric', Rapidminer imports it as 1343782900. Thats '01.08.2012 03:01:40 MESZ'! Perhaps that's the error that happens during the PostgreSQL import? But how can I fix it?

I've found another example for that problem. I created the following process

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="5.3.007" expanded="true" height="60" name="Generate Data" width="90" x="581" y="30">
        <parameter key="number_examples" value="100000"/>
        <parameter key="number_of_attributes" value="1"/>
        <parameter key="attributes_lower_bound" value="1.2345678912E15"/>
        <parameter key="attributes_upper_bound" value="5.6789012345E15"/>
      </operator>
      <operator activated="true" class="store" compatibility="5.3.007" expanded="true" height="60" name="Store" width="90" x="849" y="30">
        <parameter key="repository_entry" value="test_gen_data"/>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

and ran it with Rapidminer and Rapidanalytics. Lots of 16-digit random numbers.

With Rapidminer, I get

but with Rapidanalytics, the last 7 digits are always zero?!

Of course, the numbers are "random", but is this behavior really correct?

Nils_Woehler · April 2013

I think this has to do something with the way Postgres treats numbers. If you run it directly from RapidMiner, pure Java is used.
No number is concatenated when storing the ExampleSet on disk. But when you run it in RapidAnalytics the results are stored in the Postgres SQL database you selected for RapidAnalytics.
And there must be a weird problem. Currenlty I have no Postgres DB available, so I can't reproduce the error, but could you please have a look at the setup of your database tables?
Is there something suspicious?

Best,
Nils

LarsH · April 2013

Thanks, that's a very good hint I think. I've never used the local repository, even if I run my processes locally with RapidMiner, I store the results in the RapidAnalytics repository. I've found out that these example sets are corrupt as well when I open them in RapidMiner again. I've never seen that, because I thought the example set RapidMiner displays in the result perspective after finishing the local process, would have come from the RA repository...

I've re-executed my 'random numbers' process and found out that RA stores the 16-digit numbers as 'real's, instead of 'double' or 'numeric'.

rapidanalytics=> select rm_rownum, att_1, att_2 as label from es_75605_data_1 limit 1;
 rm_rownum |    att_1    |  label   
-----------+-------------+----------
         1 | 4.00508e+15 | 0.863356
(1 Zeile)

Even timestamps are stored as 'real'! That's why some digits (and seconds...) get lost!

Best,
Lars

Nils_Woehler · April 2013

Just for clarification:
If you run a process from the RapidMiner GUI the result you see in the end is produced on the machine you run RapidMiner on.
It does not matter if the process is stored an RA or in a local Repository. Nevertheless the 'Store' operator will transfer the locally generated data to your RA instance and save it in the selected RA database if the process is stored on RapidAnalytics.

Could you please have a look what the data types the data tables have? For me it looks like this on MySQL:


select rm_rownum, att_1, att_2 as label from `rapidanalytics`.`es_305_data_1` limit 1;

rm_rownum	att_1	label
1	4.005079079564649e15	0.8633563476940593


SHOW FIELDS FROM `rapidanalytics`.`es_305_data_1`;

Field	Type	Null	Key	Default	Extra
rm_rownum	int(11)	NO	PRI	NULL	
att_1	double	YES		NULL	
att_2	double	YES		NULL

As you can see att_1 and the label values are stored as doubles.
Thus retrieving the values yields exactly the same results as executing the process from RapidMiner locally does.

Best,
Nils

LarsH · April 2013

For me it looks like this:

SELECT a.attnum, a.attname, t.typname, a.attlen, a.atttypmod, 
a.attnotnull, a.atthasdef FROM pg_class c, pg_attribute a, pg_type t 
WHERE a.attnum > 0 and a.attrelid = c.oid and c.relname = 'es_75605_data_1' 
and a.atttypid = t.oid order by a.attnum;
 attnum |  attname  | typname | attlen | atttypmod | attnotnull | atthasdef 
--------+-----------+---------+--------+-----------+------------+-----------
      1 | rm_rownum | int4    |      4 |        -1 | t          | f
      2 | att_1     | float4  |      4 |        -1 | f          | f
      3 | att_2     | float4  |      4 |        -1 | f          | f
(3 Zeilen)

BalazsBarany · April 2013

Hi,

assuming that your original problem with the timestamps is still current:

to_char('01.08.2012 01:01:00+00'::timestamptz,'DD.MM.YYYY HH24:MI:SS TZ')

Don't do this, it is not portable. It depends on the PostgreSQL configuration setting "datestyle" and also on the settings in the client session. That again can be set by the client and maybe the JDBC connector or RapidAnalytics set it up differently.

You can always check the value by putting "SHOW datestyle;" into a Read Database operator.

Your example outputs "08.01.2012" instead of "01.08.2012" in my RapidMiner because my PostgreSQL installation has "iso, mdy" configured - the default configuration.

See the "Date input" topic in the PostgreSQL documentation for safe syntax variants:
http://www.postgresql.org/docs/9.1/static/datatype-datetime.html

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Timestamps from PostgreSQL: Different results in RapidMiner and RapidAnalytics"

Answers