ETL and OLAP

Boris_Petukhov · June 2009

Hi All!

I'm doing some R&D and want to see if the Rapid Miner can be used in the same way MS SSIS and MS SSAS are used.

In other words I need to be able to do the ETL stuff, buid Star Schema and publish cubes to the Clients.

Users should then be able to connect to cubes and "dice and slice" the data in any way they need.

Do people use this package for this sort of things?

Thanks in advance.

Boris Petukhov.

schlenzka · April 2010

Hi Boris,

I guess you moved on and the answer is comes a few month too late for you - but just for others searching for OLAP and Rapid-i / RM / RapidMiner. RapidMiner (afaik) is primarly a solution focused on data mining and use cases around data mining. It does provide some OLAP processing abilities (see http://www-ai.cs.uni-dortmund.de/LEHRE/VORLESUNGEN/MLRN/WS0809/rm-api/overview-summary.html), but is not an OLAP-engine itself. Also its main purpose is not ETL (also it can be (mis)used for ETL), so as you are focused for something you can use to build a relational data modell and a OLAP cube, you might turn to handcoding or some open source ETL vendors (e.g. kettle aka PentahoDataIntegrator or Talend or clover ETL) and OLAP or memory based engines (as Mondrian and Palo).

Having said this, the answer whether RM is the right tool for you depends on what your use cases are and who your users are. What architecture and technology types you use, is in my oppinion is irrelevant, as long as it fullfills your requirements sufficiently.

Regards

ms

Saikrish · April 2010

Hi ms,

i am a new Bee doing an R&D on Predictive Analytics for one of our Client Process. The requirement is, already we have a huge database full of Financial information. We have to identify our Potential Customers, active agents and flourishing regions for our marketing department.

I came across RapidMiner. Can't able to formulate how it is goin to help us? Could u kindly help us.

We have to setup datamarts and design an analysis engine and should give a graphical output.

Did RapidMiner support this?

IngoRM · April 2010

Hi,

well, yes, but RapidAnalytics (the server behind RapidMiner) might be better suited for this. With RapidMiner / RapidAnalytics you can set up a datamart, design every analysis process you can think of in the field of predictive analytics / data mining and use those models for scoring or make use of the millions of visualization schemes available within RapidMiner. With RapidAnalytics, you are even able to define services from those processes which can be directly integrating into your infrastructure or create the visualizations on-the-fly for integration.

But since you asked here in the ETL / OLAP thread, I assume that you want also an answer for those topics. The point is that the answer cannot be given in general as MS already has pointed out. Let me comment a bit on his points:

About ETL

RapidMiner / RapidAnalytics are first of all solutions for (statistical) data analysis and predictive analytics / data mining. So those tools were not designed for performing in-database-multiple-terabyte-transformations in just 1 second but for providing all necessary tools for performing great analysis processes. Where is the difference? In data mining, the modeling step - although it is often only a single step within a process of hundreds of operators - is often the bottleneck. Runtimes are high, sometimes exponential and data has to be re-iterated many times. This is not the optimal setting for databases and most solutions hence perform those model calculations in-memory instead of in-database simply because it is much faster. Hence, there is no point in loading / transforming much more data than you can model or you have to take a sample anyway. In such a setting, the traditional ETL approach is not useless but it is not necessary: there is no point in transforming terabytes of data on-the-fly when you cannot model the data later on anyway.

On the other hand, traditional ETL tools sometimes cannot offer calculations which are useful for data mining. A simple example might be an aggregation where the median (instead of a mean, max, min, or count) should be calculated. Not possible in databases and hence often also not possible in ETL tools. But still sometimes useful in data analysis.

So we have two arguments here: first, it is often not necessary to pipe the data through an ETL process since when it comes to modelling, this cannot be done anyway. Second, many ETL tools have some restrictions exactly for this reason, meaning that as much as possible should be done piped and / or directly in-database.

What is the RapidMiner solution I often refer to as "Analytical ETL"? As default, data is retrieved from a data base into memory, transformed and modelled there and results are written back. This is appropriate if data mining is your primary goal. But there is another powerful option which many RapidMiner users oversee: Almost all preprocessing operators also provide an option "create view" which means that data is not changed and stored but all calculations are made on the fly. If you now read your data and transform it batchwise (which is possible by using the appropriate input operators or create the batches in loops yourself and make use of the limit definitions in your database), scalabilitiy is no longer an issue. You can transform data sets of arbitrary size with this.

So, yes, it is possible to perform ETL processes with RapidMiner and you can even do things which are not possible with other tools around. Are other ETL tools hence useless? Of course not: If your primary goal is ETL and not data mining or if processing time is a really important issue (don't start with data mining then

), then those tools are the way to go. If, however, the primary goal is the analysis, you are often really fine with only using this "Analytical ETL" approach of RapidMiner. In fact, we have done more than 200 projects now and we never had any need for an additional ETL tool but did everything with RapidMiner processes.

About OLAP

Again MS is absolutely right: RapidMiner / RapidAnalytics is not an OLAP engine by itself. We will, however, release a new extension this year which make working with cubes possible directly within our products. But in the moment the best idea would be to use another tool for OLAP until you end up with a table which can be fed into RapidMiner.

And I fully agree on his statement

What architecture and technology types you use, is in my oppinion is irrelevant, as long as it fullfills your requirements sufficiently.

Thanks for this discussion. Cheers,
Ingo

Saikrish · April 2010

Thank you Ingo,

for these valuable suggestions and points.

Cheers

Saikrish · April 2010

Hi Ingo,

With the Discussions yesterday, i would like to clarify some more points from you. At present, our Customer data is about 9Million records in the database. Considering the scalability, down the line in 4 years or so, it may cross 13 to 14 million entries.

Did RapidMiner / Analytics cope with this much huge volume of data?

Is it Scalable to that extent?

IngoRM · April 2010

Hi,

yes, if you use a 64 bit machine with sufficient memory you can directly work on this amount of data without having to think about batch + view processing at all. Of course it depends on the concrete preprocessing processes, but in general RapidMiner / Analytics should be directly able to work on data sets of that size. If your hardware is not sufficient, you can always change to the batch + view approach as stated above. So no problem with this.

As a side note: Recently, we ourself successfully processed 120 million transactions - some parts where done per batch, some parts even directly in database by sending SQL statements for certain preprocessing steps. The data was condensed after those RapidMiner processes in a way so that it perfectly fit into memory and we were able to create the desired models then. Actually, those processes we have created would have been able to process much more tupels than the 120 million we had from our customer - although running time started to become the limiting factor (the complete ETL + modeling process took about 4 days).

Hope that helps. Cheers,
Ingo

Saikrish · April 2010

Hi Ingo,

Thats great to hear and thank you for your prompt responses.
Cheers.

crappy_viking · March 2011

Hello Ingo, and thanks for your explanations. Just a piece of "personal knowledge" : It seems that Palo is a real open source M-OLAP (let's say "native OLAP") project, working with GPU acceleration. Is that kind of GPU feature is available for the RM Community edition ?
http://www.jedox.com/de/produkte/palo-gpu-accelerator.html
Best regards.

IngoRM · March 2011

Hi,

not yet but there are several groups working on GPU support right now all over the world. I have also seen the first amazing results recently (a speed-up by a factor of several hundred) and I am pretty positive we will here more about that during the RCOMM 2011 this year in Dublin.

Another interesting - although not open source - option would be the combination of RapidAnalytics with Ingres VectorWise. We are working a lot with Ingres on the integration and achieved speed-ups up to a factor of 100 for several mining schemes. Some results about this were presended at last year's RCOMM by the way.

Cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

ETL and OLAP

Answers