comparing data mining tools

jesslynjesslyn Member Posts: 5 Contributor II
I am currently evaluating Rapidminer, R, SAS Enterprise and Orange. Can someone provides some useful information to me?

which software provides better features in terms of 1)scalability and 2)power and flexibility, 3)how well the tools access and manage the data, 4) which is more graphical user friendly as well as 5) visualization.

I've done some research and I found out that rapidminer is better than the other 3 softwares.

I need someone to provide me more information about this topic as I am currently evaluating on these 3 tools. thanks.


  • Options
    Nils_WoehlerNils_Woehler Member Posts: 463 Maven

    we are glad you are interested in RapidMiner. But please don't double post. You questions have been answered here: http://rapid-i.com/rapidforum/index.php/topic,5187.0.html

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Jesslyn,

    I just have answered a couple of questions already here:


    Let me add some information to the new ones:

    The desktop version of RapidMiner is working, well, on your desktop. Hence, there is a limit for the amount of data by the amount of memory your desktop system has. Things are of course much better for the server RapidAnalytics, which is usually running on better hardware. And there are several specific extensions for improving scalability for RapidMiner: a) an In-DB-Extension for executing processes directly in the database (for many processes, there is then literally no limit anymore), b) a Streaming Extension which offers operators so that data is no longer completely loaded into memory, and c) there is the Radoop Extension which allows running data transformation and modeling processes in distributed Hadoop clusters.

    2)power and flexibility
    This has been partly answered already. Right now, there is no other graphical data mining suite offering more operations and more options for combining them including all necessary control structures like loops, branches, macros (variables), etc. More can be found in the fact sheet.

    3)how well the tools access and manage the data
    Again, please have a look at the fact sheet. There are plenty of operators for connecting to data sources and transforming the data. Actually, many users of RapidMiner do not perform data analysis but ETL processes  ;)

    4) which is more graphical user friendly
    Although this is a matter of taste I would like to point out that the Rapid-I team has put a lot of efforts into better supporting analysts, especially beginners. There are a lot of features like meta data propagation, quick fixes, error detection, online help, operator recommendations etc. to simplify the analyst's life. More, as you might guess already, can be found in the fact sheet.

    5) visualization
    And a last time: the most important visualization techniques are listed in the fact sheet. This is actually an area we are pretty proud on since RapidMiner offers really a huge amount of different visualization techniques. And there is the new "Advanced Plotter" section (the documentation for this can be found in our download section).

    Fact Sheet

    Probably, you will find the following fact sheet for RapidMiner and RapidAnalytics interesting:


  • Options
    jesslynjesslyn Member Posts: 5 Contributor II
    Hi Ingo,
    thanks for replying. I understand that for Rapidminer Community edition is a free software and there is a limitation in size constraints like how many rows or records it can handle. however, can i have an estimation on what the limit will be? Millions of rows of data? 1 million, 2 million? thanks.
  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello jesslyn

    I don't believe there is an explicit limitation in the community edition on the maximum number of rows that can be processed. There is always a physical resource limit imposed by the machine you are running on however. Whenever these limits are encountered there are plenty of approaches that can be adopted to work round them. For example, the stream database and loop batch operators let you process things in batches at the expense of increased running time of course. The other thing is to use Rapid Analytics and run processes remotely.


Sign In or Register to comment.