distributed data mining support

forest520 · June 2010

When data is growing larger and larger, data mining algorithms can't finish the computation in time.
Distributed data mining might be a good solution. There are already MPI framework data mining algorithms and MapReduce framework DM algorithms such as Apache Mahout which is based on Hadoop.
Google is now providing Prediction API about data mining, which supports 100M dataset. I think it is a trend that large-scale data mining will be a popular requirement. In my opinion, it is also a good chance for RapidMiner to exceed Clementine etc and to be TOP.1 data mining tool in the world.

andydempsey · September 2010

I agree.

Preko · September 2010

Hi guys,

At RCOMM 2010 there was a talk by Alexander Arimond on MapReduce integration to RapidMiner.
Currently it is specific for a few algorithms, but we have already started conversations about extending it as a general distibuted plugin for RapidMiner. Of course it is not a matter of weeks to have such an extension out, but I would be surprised if we don't have it in one year.

dragoljub · September 2010

This is great news.

I have experimented using RM on a distributed LSF cluster with 100s of cores and 100s of Gigs of ram. It does work in its current state for independent computations like cross validation or parallel parameter optimization, however I doubt its optimized for such a system. Keep us posted.

-Gagi

Preko · September 2010

I think there are (at least) two reasons why people want to use distributed tools in RapidMiner:
1. The task needs a lot of computations.
2. The task has a lot of data.

In the first case, the parallel extension is great. I have never seen it running on multiple computers, but I think it can be done. Probably it is not well optimized, but it works.

The second case is not solved currently in RapidMiner. If you cannot fit into memory then you have very limited (almost no) chance of get anything done. So I think that should be the main goal of this project, to provide data analysis operators for very large datasets. Of course if you are using many machines for the computation then it will be faster, but my interest is not optimizing runtime, but handling large datasets.

Let me know if you agree or disagree!

dragoljub · September 2010

Hi Preko,

That makes sense. I was under the impression that RM has some current ability to work directly on databases, so this could get around some memory problems and handle huge data sets. Interestingly, RAM is becoming higher and higher density and SSDs are becoming ever faster, so fitting lots of data into fast memory is more and more possible. I would think that there have to be dramatic improvements in algorithms performance to merit multiple computer distributed computing, rather than simply adding a RAM disk. The problem is porting single threaded algorithms to multi threaded with significant performance improvement, but again this depends on the algorithm. If you cannot break down the problem into many independent bits there is no hope to distributing the problem.

-Gagi

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

distributed data mining support

Answers