Difference between WEKA and RapidMiner

Legacy User · December 2008

Hi @all,

I dont know if this is the right category for this topic, but...

Can anyone please tell me what are the main differences between WEKA and RapidMiner and what makes RapidMiner so special?

Thanks in advance
JJP

IngoRM · December 2008

Hi,

hmm, this will hopefully not turn out to become just another another RapidMiner vs. Weka discussion. But anyway, here are some links:

* In the following thread, Martin has posted his opinion why he and his company preferred RapidMiner and he pointed out some differences:

http://rapid-i.com/rapidforum/index.php/topic,362.0.html

* And a Google search for Weka and RapidMiner would have give you the following link leading to a statement of mine within the KDnuggets newsletter (I would actually rather not like to be remembered to this discussion

):

http://www.kdnuggets.com/news/2007/n24/5i.html

* There was also a study done for the Data Mining Cup 2007 showing some differences of RapidMiner compared to other open source data mining solutions as well as proprietary ones:

http://www.prudsys.de/Service/Downloads/bin/DMC2007_schieder_tuchemnitz.pdf

Finally, you could also have a look into our KDD 2006 paper explaining some conceptual ideas behind RapidMiner to see those differences as well. And there were also some threads in the old forum at SourceForge where you can try to find some of these old threads discussing some of the differences.

But in any case: why did you not simply try RapidMiner and find it out yourself? The learning curve might be steep (hey, data mining is a complicated topic after all...) but it's usually worth the effort.

Cheers,
Ingo

crappy_viking · November 2009

Hi @all,

Reading the benchmark between Weka, RapidMiner and KNIME, RM is a bit weak in data preparation. The solution is "datacleaner" here :
http://datacleaner.eobjects.org

Pure Java, query batch optimization, so efficient that sometimes for analyses purposes', you need not data-mining. Clementine has a "Data Quality Audit" showing features' histograms. Which such a tool as "datacleaner", it can go back to bed.

c.v.

RalfKlinkenberg · November 2009

Hi "crappy_viking",

thanks for providing the link to the Data Cleaner project. However, I am not aware of any data cleaning or data preprocessing functionality offered by Data Cleaner that is not already provided by RapidMiner. Could you name any?

RapidMiner actually provides significantly more data preprocessing functions and operators than Weka, KNIME, and SPSS Clementine. Feature histograms are also available in RapidMiner and RapidMiner also provides many data cleaning features. If you are not aware of those, I can recommend the RapidMiner training course on Advanced Data Preprocessing for Data Mining with RapidMiner as well as a series of webinars on data preprocessing and data cleaning with RapidMiner.

Best regards,
Ralf

steffen · November 2009

Hello

I think that ETL tools and Data Mining Tools cannot be compared directly.
This can illustrated (example: kettle ((Pentaho Data Integration) )) how the data flow is organized: In iterators. A process in kettle is a good one, if all steps process only one row at once. This way you can load, process and save the data in small portions instead of loading all at once in the memory (like R, *snicker*). RapidMiner has improved regarding such tasks, but as far I as see it is still not possible to (e.g.) load data row-wise from a csv-file. I know it is possible to do this by loading data from a database, but then it is not possible to monitor the processed rows, i.e. ...

show the current process state of the rows
if a row could not be processed without an error, store it in an extra - file to check manually what has happened

If one step does not satisfy this condition (like sorting), the process is getting really slow. Pentaho Corp has bought Weka to include the data mining framework into their application (http://wiki.pentaho.com/display/DATAMINING/Using+the+Knowledge+Flow+Plugin), but frankly: I do not think that this was a good idea, embedding one dataflow philosophy into another one.

Another point is the separation of data management and data analysis. Departments have to talk to each other, but in general I think this are different areas with different targets and responsibilities.

Conclusion:
I would use etl tools for cleaning (which does not include steps like discretization, more steps like duplicate checking) and managing data and shifting data around from one source to another. Shall the DW - specialists take care of it. But if it comes to the point of solving actual data mining problems, I would ask the DW-guys to tell me how to get exactly the data I want and then perform the analysis with RapidMiner.

my (of course subjective) point of view

kind regards,

Steffen

RalfKlinkenberg · November 2009

Hello Steffen,

I agree that the way the data flow is organized is a major differentiator between most ETL and data mining tools. And Kettle and Weka do have different flow logics. However, to some extend, RapidMiner offers both flow logics:

RapidMiner can load the full data set into memory, if the memory size is sufficient and if you like to operate this way, and perform time-efficient in-memory preprocessing and mining: CSVExampleSource, DatabaseExampleSource, etc. in RapidMiner 4.6 and CSVReader, DatabaseReader, etc. in RapidMiner 5.
RapidMiner can alternatively read in the data in chunks, e.g. database line by databse line or file by file, and thereby work on the database or on large document collections or large file collections: CachedDataBaseExampleSource, FileIterator, etc. in RapidMiner 4.6 and corresponding operators and further iterators in RapidMiner 5.
In RapidMiner 5, you can also iterate over tables in memory row by row, i.e. iterate over examples.
Similar to the CachedDatabaseExampleSource, there is a good chance that RapidMiner will also support line-by-line reading of large CSV files in a future version. But you are right, as of now, this is not directly supported yet.
The already existing iteration and branching operators of RapidMiner would then allow to perform the line-wise monitoring of data preprocessing and data cleaning, i.e. different subprocesses could handle correct and incorrect lines, respectively: ProcessBranch and iterators in RapidMiner 4.6 and 5.
My personal point of view and conclusion: For most data preprocessing and data cleaning tasks we have encountered so far in our data mining, text mining, web mining, audio mining, and time series analysis and forecasting applications at Rapid-I, RapidMiner provides all data preprocessing, cleaning , and transformation operators necessary (see also our list of references to get an idea of the scope of our projects). Furthermore we keep on extending the preprocessing and ETL capabilities of RapidMiner to meet future challenges and we partner with ETL and data integration tool providers like Talend, Cubeware, Pervasive, etc. to meet any further demands. So, if DataCleaner really offers additional value, it might be a reasonable potential extension to the aforementioned list of ETL tools.

Best regards,
Ralf

crappy_viking · November 2009

Hi All

Ralf Klinkenberg wrote:

[...]I am not aware of any data cleaning or data preprocessing functionality offered by Data Cleaner that is not already provided by RapidMiner. Could you name any?

Best regards,
Ralf

Datacleaner has very, very handy features for string analysis. Typically pattern analysis in the "profiler" gives aggregates, a kind of OLAP exampleset where each classifier is in fact the string pattern. If your string is the following email "ralf123@hotmail.de", it will be classified in the category "aaaa999@aaaaaaa.aa". It is powerful for two reasons :
- It prepares preprocessing, verifying a few consistency points in your datas
- It gives you the main pattern to use in predicates or in regexps when you do linguistics analysis, NER, indexing, etc...
For each string, "String analysis" can give the number of blank spaces (useful for trimming), Lower/Uppercase, number of words in a string (string vs nominal).

In another profiler (FEBRL, not to give it), you can use distances between words, for phonetic indexing widely spread in data quality :
- soundex, phonex, phonix, metaphone, NYSIIS, etc...
- block/canopy indexing

Other distances, as jaro-winkler or levenstein are available here :
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

All this stuff is indeed string data quality and is not in RapidMiner, except a few algorithms in TextInput (TF-IDF, cosine distance)

c.v.

reoroman · July 2016

http://www.slideshare.net/MayurSurani/data-mining-tools-45159317

This make clear statement that Rapidminer is the best.

LaurieMoseley · August 2016

I tried to visit the first suggested site for a comparison between WEKA and RapidMiner

http://rapid-i.com/rapidforum/index.php/topic,362.0.html

and got Error 404 file not found. As it was the first link that I had ever followed from the Community, I thought it reasonable to report the problem.

Best wishes for an exciting venture

Laurie

LaurieMoseley · August 2016

I also tried to follow the link to

http://www.prudsys.de/Service/Downloads/bin/DMC2007_schieder_tuchemnitz.pdf

On that occasion I got

Seite nicht gefunden. (which I translate as "site not found")

Keep up the good work, but check some of the links

Best wishes

Laurie

IngoRM · August 2016

Hi Laurie,

Thanks for reporting. We migrated from an old forum system to this new community portal a couple of months ago and unfortunately not all links have been automatically replaced during this migration process. So whenever you see a link still going to "...rapid-i.com/..." is is not going to work unfortunately :smileysad:

But anyways: Have fun here in the community,

Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Difference between WEKA and RapidMiner

Answers