Compete in RapidMiner's 3rd Competition: Fantasy Football. Top prize is $750. Deadline December 19.
Yes it's out and we're super excited about it. We call it "Data Science for the Enterprise". Download today and let us know how you like it!
Read about how our community works. Meet other newbies. Get your questions answered fast!
What can be suggested as a good parctical approach while working with VERY big PostgreSQL tables within Studio+Server ecosystem? Is there any way to efficiently handle 5M rows tables making aggregations and joins and not use Radoop?
For example, while preparing the data I need to calculate some metrics which involves few joins and aggregations on a few PostgreSQL tables each from 3 to 5 millions rows. I can pull a whole table and store it in Studio repository, which sometimes results in saved ExampleSets of over 500Mb size. But this usually works (though takes time), and then I can perform needed joins etc within RM process.
I want though to move all these calculations to RM server. Until I added memory to server I was usually getting while Read Database process:
java.lang.OutOfMemoryError: Java heap space
error message, but this was solved. Still, I am not able to query some bigger tables, and I get this:
javax.ejb.EJBException: javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: could not execute statement
The question is, if there is more efficient way to work with this data sizes, except using Radoop or making aggregations within PostgreSQL and fetching the end result only?
Awww, that's a mean constraint in not letting me use the aggregations on PostgreSQL either!
I'm going to break this down "the hard way" for you below, but "the easy way" is to download and try out the Jackhammer extension for RapidMiner. Jackhammer Extension
Here you can do many useful things including shifting data from main memory to disk.
"the hard way"
Okay, well... I need to state in advance that I'd need to know a bit more about your data and JOINS to give full tips, but here's a few pointers.
First thing to consider is that I presume you are bringing the entire table into memory, but this probably includes a lot of unnecessary data for the next steps of the process. You don't need all those data columns in memory the whole time, right? Let's get rid of some using SELECT ATTRIBUTES early in your process AND combine this with the MATERIALIZE DATA and FREE MEMORY operators. This will help your process to be far more efficient in the later operators because your RAM is not cluttered with needing to remember every single customer address line when all it needs to aggregate is the date of transaction & the customer city.
Next, let's think about Radoop and Hadoop and how it works to make large calculations efficient: it breaks the calculations into smaller parts and executes each of these smaller parts on different servers. This is called Map/Reduce in fancy talk, but it's simply just making a big problem into several smaller problems.
In our case there is only one server to execute on, but we can still take the same approach what we need to do is JOIN in such a way that memory usage is kept to a minimum, so we should aim to aggregate our data in small pieces and then add all of these pieces together.
For this we'll can to use a little bit of SQL, in this case LIMIT & OFFSET. This will allow you to pass to PostgreSQL only the records that you have not already processed. https://www.postgresql.org/docs/9.6/static/queries-limit.html
You can then use a control table stored on your repository to bring through only records that you have not aggregated and store only the summary data (the aggregated tables) in your repository RapidMiner. Just topup the records as you get them. For example, if the data you need is monthly aggregations you only need to bring in one month at a time.
HOWEVER, your constraint is to not use any SQL tricks, so we need to do it all on RapidMiner and assuming we're bringing in all the data.
I'm going to assume that you only have 1 large table and several smaller ones. (For example transactions, products, categories).
So we need to get the entire large table and use it in smaller "chunks".
Both the SPLIT DATA operator or the LOOP BATCHES operator, will break your data into smaller chunks for processing which you can then combine with either the STORE/WRITE DATABASE operators or you can also perform your JOINS and aggregations on the smaller dataset and then use a FREE MEMORY operator to reduce the data stored in your RAM at any one time.
The exact order of the operators and which ones work best in which case might change so if going this route you might want to post a few more replies and we'll try to help you get it in the right order for efficiency.
You're using one of the best relational database systems of the world.
Any special reason why you don't want to use the functionality it offers to you?
Relational databases have been developed to do exactly what you want (filter, group, join) for almost 40 years. They're *very* good at these things. This is the reason you're using the database instead of CSV files, isn't it?
You can always go for a balanced approach. If you can reduce your example set to e. g. 25 % by doing a simple join or aggregation in the database, you will have a much better experience in RapidMiner afterwards when you do the rest of the data manipulation there.
Thanks @JEdward a lot for that extended reply!
I will study possible solutions which are applicable to my data specifically.
Thanks @BalazsBarany, I must admit you are totally right about SQL possibilities; just for some reasons I wanted to keep ALL the processes within RapidMiner, but who knows, maybe I will actually end up doing at least part of the manipulations in the database directly.
You are right that keeping all the logic in RapidMiner processes is good for documentation and working together in a team.
However, basic SQL knowledge is usually expected for data scientists. And in the end, you want to solve your problem - if you can do it using the best available tool, everyone will be happy. Even your colleagues if they need to execute your process and don't need to wait minutes or hours for RapidMiner to filter data that could have been filtered in the database.
The exact balance between SQL wizardry and RapidMiner is up to you and your colleagues.
There are things that are ridiculously easy in SQL (e. g. partial aggregations using window functions) and only work so-so in RapidMiner. (Shameless plug: Window Functions in RapidMiner)
(SQL Wizard and RapidMiner fan)
You could also use an Object-relational mapper like SQLAlchemy inside a Python script. I don't know if it is powerful enough to solve your problems, but writting OO Code instead of SQL sounds tempting to me.
Thanks again everyone for replies, for now I ended up with actually doing some necessary aggregations directly in PostgreSQL and fetching into RM process the end result only using a designated query.