Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
distributed data mining support
When data is growing larger and larger, data mining algorithms can't finish the computation in time.
Distributed data mining might be a good solution. There are already MPI framework data mining algorithms and MapReduce framework DM algorithms such as Apache Mahout which is based on Hadoop.
Google is now providing Prediction API about data mining, which supports 100M dataset. I think it is a trend that large-scale data mining will be a popular requirement. In my opinion, it is also a good chance for RapidMiner to exceed Clementine etc and to be TOP.1 data mining tool in the world.
Distributed data mining might be a good solution. There are already MPI framework data mining algorithms and MapReduce framework DM algorithms such as Apache Mahout which is based on Hadoop.
Google is now providing Prediction API about data mining, which supports 100M dataset. I think it is a trend that large-scale data mining will be a popular requirement. In my opinion, it is also a good chance for RapidMiner to exceed Clementine etc and to be TOP.1 data mining tool in the world.
0
Answers
At RCOMM 2010 there was a talk by Alexander Arimond on MapReduce integration to RapidMiner.
Currently it is specific for a few algorithms, but we have already started conversations about extending it as a general distibuted plugin for RapidMiner. Of course it is not a matter of weeks to have such an extension out, but I would be surprised if we don't have it in one year.
I have experimented using RM on a distributed LSF cluster with 100s of cores and 100s of Gigs of ram. It does work in its current state for independent computations like cross validation or parallel parameter optimization, however I doubt its optimized for such a system. Keep us posted.
-Gagi
1. The task needs a lot of computations.
2. The task has a lot of data.
In the first case, the parallel extension is great. I have never seen it running on multiple computers, but I think it can be done. Probably it is not well optimized, but it works.
The second case is not solved currently in RapidMiner. If you cannot fit into memory then you have very limited (almost no) chance of get anything done. So I think that should be the main goal of this project, to provide data analysis operators for very large datasets. Of course if you are using many machines for the computation then it will be faster, but my interest is not optimizing runtime, but handling large datasets.
Let me know if you agree or disagree!
That makes sense. I was under the impression that RM has some current ability to work directly on databases, so this could get around some memory problems and handle huge data sets. Interestingly, RAM is becoming higher and higher density and SSDs are becoming ever faster, so fitting lots of data into fast memory is more and more possible. I would think that there have to be dramatic improvements in algorithms performance to merit multiple computer distributed computing, rather than simply adding a RAM disk. The problem is porting single threaded algorithms to multi threaded with significant performance improvement, but again this depends on the algorithm. If you cannot break down the problem into many independent bits there is no hope to distributing the problem.
-Gagi