"Suggested way to cluster time-series"
I have this problem at hand: a flat file with three columns (id, date, efficiency), actually measuring the fuel consumption of engines, on which I need to perform clustering of the IDs to discover if groups of engine types or engines operated by the same person follow similar patterns. I am facing three major obstacles and I hope someone can help me:
1. The CSV file is big (around 20mil rows). Importing it with the date field as polynomial takes about 20sec but using date as the type hangs the application (without running out of memory) after 4-5min. I believe using a database might solve that, it is strange though. I need the date to be of "date" type so that I can calculate correlation (sample dates are different for each id)
2. I need to split the 20mil rows in order to create a time series for each id. The unique IDs are around 5000, making it impossible to do by hand.
3. I need, for the time series of a given ID, to calculate the correlation coefficients with all other IDs,
To do step 3, I will first perform some kind of temporal abstraction to convert "date" to "days after service" and "efficiency" to "% of efficiency in day 0".
Then, if needed, I can use an interpolation algorithm (which I have, and is quite accurate) to create data for the same days on two given series (say ID1 has valuse for days 1,2,5,7,9,20 and ID2 has days 2,5,8,20 I will create fake values for ID2 days 1,7,9 and drop day 8).
I can do that for any two IDs (provided I have completed step 2 and have separate datasets). But how can I do it in batch? I could use 5000 "day" attributes but how would I construct the dataset from the initial flat file? And still, would that be the optimal solution?