04-10-2017 09:29 AM
I have a .csv file with 100.000 rows and 439 columns. This spreadsheet represents the customers' habits for using a specific service. For each rows there is an ID for every customer and every transaction date with the following format: 1 for Monday, 2 for Tuesday... etc. I need to predict the next date of transaction for every customer, using these past records.
Here's an example for the format of the database:
customer_id transaction1 transaction2 ... transaction438
1 1 2 3 4 5 6 7 ... 745 746 747
2 2 7 16 20 21 23 28 ... 412
3 1 2 3 4 5 6 7 ... 285 322
4 5 7 8 12 14 19 21 ... 924 925 926
Any ideas what model should I use for this prediction for the best accuracy?
NOTE: The database have lots of missing values depends on the frequency of ordering.
04-10-2017 11:10 AM
This looks like some sort of sales projection analysis. I would look at the process I shared here: http://community.rapidminer.com/t5/RapidMiner-Studio/How-to-get-forecast-values-of-future-from-time-...
You would need to do a bit of missing value replacements using the Replace Missing Values operator and need to install the Series extension from our marketplace. Is there seasonality involved?
04-10-2017 11:56 AM
It is a homework at the university, we are learning the basics of RapidMiner. We needed to do similar examples earlier, but there was a label column for the learning database, but this time I have no clue, how I could predict the possible outcome without that special column. I thinked about some sort of pattern analysis, or converting the database to a range from 1 to 7 to simplify the problem, but I couldn't move along to a real solution.
I think seasonality doesn't matter, because it's just an example.
04-10-2017 01:16 PM
If it's sales, you could sum up the values and do a Total Sales per month or week? You can use the dates as your ID and then the Total Sales as you Label.
04-10-2017 05:05 PM
AH! Did you try the Generalized Sequential Patterns operator?