🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤
We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.
Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!
"Importing a time series from CSV - lots of funky problems"
I have a moderately large foreign exchange time series .csv file that I am trying to import. I had gotten frustrated trying to do this in RM 5.0 beta, but I decided to give it another try, updated to 5.1 with the time series extension and used the old repository. This is a EURUSD file with 1 minute data going back about 10 years (over 3.4 million records, 135MB with 7 fields in the standard format: date, time, open, high, low, close and volume). The date and time were nominal type. I tried to pre-process the data, converting the first two columns to date and time type. RM for some reason decided it needed over 10x the actual file size in RAM, hung, crashed, and so forth. (How are people with big time series such as hours of 96kHz vibration data expected to use this? In my admittedly slight experience R is a lot better with memory and orders of magnitude faster.) So I tried using a "cut last" as the first step (100000) and it gave me an error about the simple example set format being incompatible with time series (even with breakpoint on the cut last function), and a "WARNING: Using deprecated example set stream version 1"
So I decided to re-import using the wizard, though I couldn't see why or even how the file formats could affect such a simple case.
The .csv import does not seem to work properly. The default seems to be semicolon separators, which is a strange choice for a comma delimited format, but that's no big deal. Trying to get the date to import as a date in format "yyyy.MM.dd", it turns dates into date-times, all with 00:00:00 EDT. One can only set one format, so there is no way to specify the time column's format once the date format is set. . If the time format is specified instead (first setting date back to nominal), the times are all also converted to date-times, with a "Thu Jan 01 HH:mm:ss EST 1970" format, when HH:mm is specified as the Date format. There seems to be no capability to combine the date and time into one temporal index, at least without using a different program to preprocess the data. After trying out a few type options, all the data disappears from the wizard preview and one must start over. (I can't seem to get it to do it a fourth time. This program is clearly trying to drive me insane. AHA - it happens after trying to reload the data after an error, described below.)
The initial guess is that the date column is polynominal and the time is binominal. With the 100 record preview, that works (though it's certainly strange that a program such as this can't guess date and time formats), but loading more data throws an error - apparently 01:04 occurs more than once in the column, so it just displays all "?" for that column. Now why having the same value twice should be a problem, I don't know, nor why it hit on 01:04 as an example when 01:02 is the first repeat, but the next reload (more than 100 rows) causes all the preview data to disappear. ???
(The forum software has some issues, too - I push prieview, my login was timed out. It tried to eat my post, but I re-login, go back, there's my text. Now it will preview, but not post, and doesn't prompt for a login. Copy all, go back, restart topic from scratch... sigh. Edit: and now I find it did post the first time, but just wouldn't admit it. And I can't delete it despite there being a button supposedly for that purpose. Double sigh.)