Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Cumulative summing
I have a large set of examples of financial deposit amounts each with a client id and timestamp. All I want to do is add an attribute to each example that represents the total deposits done by the client up to that timestamp.
The only way I could find to do this was to sort the dataset by timestamp and client id and then loop through it example by example with macros in hold the cumulative total and reset it each time a new client id is encountered. It works but it is VERY VERY slow - it's been running for 15 hours and still going. Most processes on this dataset take seconds....
Surely there has to be a better way of doing this?! (preferably using native operators)
The only way I could find to do this was to sort the dataset by timestamp and client id and then loop through it example by example with macros in hold the cumulative total and reset it each time a new client id is encountered. It works but it is VERY VERY slow - it's been running for 15 hours and still going. Most processes on this dataset take seconds....
Surely there has to be a better way of doing this?! (preferably using native operators)
0
Best Answer
-
BalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornHi,
what you describe is trivial to do in a SQL database:
select client_id, timestamp, amount,
sum(amount) over (partition by client_id order by timestamp) as deposits
from yourtable
The sum() over () construct is called "Window function" and it has been in the SQL standard for ~ 20 years now. The concept is so popular and powerful that even NoSQL systems (those with reasonable query languages) tend to implement it.
Here's an excellent introduction to window functions: https://www.youtube.com/watch?v=D8Q4n6YXdpk (Applicable to other database systems implementing the standard.)
This example builds partitions (sets of rows that belong together) by the client id and by specifying the order by, you automatically select the application of the aggregation function until the current row inside the current partition. This gives you the running sum.
There is unfortunately no easy way to solve your problem in RapidMiner. If you can, create a local database (SQLite, H2, ...), put your data into a table and use a similar query to get them back.
If you'd like to do it in RapidMiner, it might work faster by looping the client IDs, selecting the current client's subset and doing the cumulative summing inside that loop. (The speed mainly depends on the ratio of clients and transactions.)
(There are operators in Rapidminer related to Windowing but their functionality is very different, as they solve different problems.)
Here's an incomplete implementation of Window Functions (groupwise aggregations but without the running sum functionality that you require) in RapidMiner:
https://github.com/bbarany/rapidminer-windowfunctions
Regards,
Balázs6
Answers
I appreciate that RapidMiner's model reinforces that examples are independent but this feels like quite a common use case in the customer lifetime modelling space so I am surprised there isn't an operator that natively supports this. Would be great addition IMHO.
Going through a database (or having data there in the first place) is not a bad solution though.