Difference between WEKA and RapidMiner

Legacy UserLegacy User Member Posts: 0 Newbie
edited November 2018 in Help
Hi @all,

I dont know if this is the right category for this topic, but...

Can anyone please tell me what are the main differences between WEKA and RapidMiner and what makes RapidMiner so special?


Thanks in advance
JJP

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    hmm, this will hopefully not turn out to become just another another RapidMiner vs. Weka discussion. But anyway, here are some links:


    * In the following thread, Martin has posted his opinion why he and his company preferred RapidMiner and he pointed out some differences:

    http://rapid-i.com/rapidforum/index.php/topic,362.0.html


    * And a Google search for Weka and RapidMiner would have give you the following link leading to a statement of mine within the KDnuggets newsletter (I would actually rather not like to be remembered to this discussion  ;) ):

    http://www.kdnuggets.com/news/2007/n24/5i.html


    * There was also a study done for the Data Mining Cup 2007 showing some differences of RapidMiner compared to other open source data mining solutions as well as proprietary ones:

    http://www.prudsys.de/Service/Downloads/bin/DMC2007_schieder_tuchemnitz.pdf


    Finally, you could also have a look into our KDD 2006 paper explaining some conceptual ideas behind RapidMiner to see those differences as well. And there were also some threads in the old forum at SourceForge where you can try to find some of these old threads discussing some of the differences.

    But in any case: why did you not simply try RapidMiner and find it out yourself? The learning curve might be steep (hey, data mining is a complicated topic after all...) but it's usually worth the effort.

    Cheers,
    Ingo
  • crappy_vikingcrappy_viking Member Posts: 16 Maven
    Hi @all,

    Reading the benchmark between Weka, RapidMiner and KNIME, RM is a bit weak in data preparation. The solution is "datacleaner" here :
    http://datacleaner.eobjects.org

    Pure Java, query batch optimization, so efficient that sometimes for analyses purposes', you need not data-mining. Clementine has a "Data Quality Audit" showing features' histograms. Which such a tool as "datacleaner", it can go back to bed.

    c.v.
  • RalfKlinkenbergRalfKlinkenberg Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member, Unconfirmed, University Professor Posts: 68 RM Founder
    Hi "crappy_viking",

    thanks for providing the link to the Data Cleaner project. However, I am not aware of any data cleaning or data preprocessing functionality offered by Data Cleaner that is not already provided by RapidMiner. Could you name any?

    RapidMiner actually provides significantly more data preprocessing functions and operators than Weka, KNIME, and SPSS Clementine. Feature histograms are also available in RapidMiner and RapidMiner also provides many data cleaning features. If you are not aware of those, I can recommend the RapidMiner training course on Advanced Data Preprocessing for Data Mining with RapidMiner as well as a series of webinars on data preprocessing and data cleaning with RapidMiner.

    Best regards,
    Ralf
  • steffensteffen Member Posts: 347 Maven
    Hello

    I think that ETL tools and Data Mining Tools cannot be compared directly.
    This can illustrated (example: kettle ((Pentaho Data Integration) )) how the data flow is organized: In iterators. A process in kettle is a good one, if all steps process only one row at once. This way you can load, process and save the data in small portions instead of loading all at once in the memory (like R, *snicker*). RapidMiner has improved  regarding such tasks, but as far I as see it is still not possible to (e.g.) load data row-wise from a csv-file. I know it is possible to do this by loading data from a database, but then it is not possible to monitor the processed rows, i.e. ...
    • show the current process state of the rows
    • if a row could not be processed without an error, store it in an extra - file to check manually what has happened
    If one step does not satisfy this condition (like sorting), the process is getting really slow. Pentaho Corp has bought Weka to include the data mining framework into their application (http://wiki.pentaho.com/display/DATAMINING/Using+the+Knowledge+Flow+Plugin), but frankly: I do not think that this was a good idea, embedding one dataflow philosophy into another one.

    Another point is the separation of data management and data analysis. Departments have to talk to each other, but in general I think this are different areas with different targets and responsibilities.

    Conclusion:
    I would use etl tools for cleaning (which does not include steps like discretization, more steps like duplicate checking) and managing data and shifting data around from one source to another. Shall the DW - specialists take care of it. But if it comes to the point of solving actual data mining problems, I would ask the DW-guys to tell me how to get exactly the data I want and then perform the analysis with RapidMiner.

    my (of course subjective) point of view

    kind regards,

    Steffen
  • RalfKlinkenbergRalfKlinkenberg Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member, Unconfirmed, University Professor Posts: 68 RM Founder
    Hello Steffen,

    I agree that the way the data flow is organized is a major differentiator between most ETL and data mining tools. And Kettle and Weka do have different flow logics. However, to some extend, RapidMiner offers both flow logics:
       
    • RapidMiner can load the full data set into memory, if the memory size is sufficient and if you like to operate this way, and perform time-efficient in-memory preprocessing and mining: CSVExampleSource, DatabaseExampleSource, etc. in RapidMiner 4.6 and CSVReader, DatabaseReader, etc. in RapidMiner 5.
    •  
    • RapidMiner can alternatively read in the data in chunks, e.g. database line by databse line or file by file, and thereby work on the database or on large document collections or large file collections: CachedDataBaseExampleSource, FileIterator, etc. in RapidMiner 4.6 and corresponding operators and further iterators in RapidMiner 5.
    •  
    • In RapidMiner 5, you can also iterate over tables in memory row by row, i.e. iterate over examples.
    •  
    • Similar to the CachedDatabaseExampleSource, there is a good chance that RapidMiner will also support line-by-line reading of large CSV files in a future version. But you are right, as of now, this is not directly supported yet.
    •  
    • The already existing iteration and branching operators of RapidMiner would then allow to perform the line-wise monitoring of data preprocessing and data cleaning, i.e. different subprocesses could handle correct and incorrect lines, respectively: ProcessBranch and iterators in RapidMiner 4.6 and 5.
    •  
    • My personal point of view and conclusion: For most data preprocessing and data cleaning tasks we have encountered so far in our data mining, text mining, web mining, audio mining, and time series analysis and forecasting applications at Rapid-I, RapidMiner provides all data preprocessing, cleaning , and transformation operators necessary (see also our list of references to get an idea of the scope of our projects). Furthermore we keep on extending the preprocessing and ETL capabilities of RapidMiner to meet future challenges and we partner with ETL and data integration tool providers like Talend, Cubeware, Pervasive, etc. to meet any further demands. So, if DataCleaner really offers additional value, it might be a reasonable potential extension to the aforementioned list of ETL tools.
    Best regards,
    Ralf
  • crappy_vikingcrappy_viking Member Posts: 16 Maven
    Hi All
    Ralf Klinkenberg wrote:

    [...]I am not aware of any data cleaning or data preprocessing functionality offered by Data Cleaner that is not already provided by RapidMiner. Could you name any?


    Best regards,
    Ralf
    Datacleaner has very, very handy features for string analysis. Typically pattern analysis in the "profiler" gives aggregates, a kind of OLAP exampleset where each classifier is in fact the string pattern. If your string is the following email "ralf123@hotmail.de", it will be classified in the category "aaaa999@aaaaaaa.aa". It is powerful for two reasons :
    - It prepares preprocessing, verifying a few consistency points in your datas
    - It gives you the main pattern to use in predicates or in regexps when you do linguistics analysis, NER, indexing, etc...
    For each string, "String analysis" can give the number of blank spaces (useful for trimming), Lower/Uppercase, number of words in a string (string vs nominal).

    In another profiler (FEBRL, not to give it), you can use distances between words, for phonetic indexing widely spread in data quality :
    - soundex, phonex, phonix, metaphone, NYSIIS, etc...
    - block/canopy indexing

    Other distances, as jaro-winkler or levenstein are available here :
    http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

    All this stuff is indeed string data quality and is not in RapidMiner, except a few algorithms in TextInput (TF-IDF, cosine distance)

    c.v.
  • reoromanreoroman Member Posts: 1 Contributor I

    http://www.slideshare.net/MayurSurani/data-mining-tools-45159317

     

    This make clear statement that Rapidminer is the best.

  • LaurieMoseleyLaurieMoseley Member Posts: 3 Contributor I

    I tried to visit the first suggested site for a comparison between WEKA and RapidMiner

     

    http://rapid-i.com/rapidforum/index.php/topic,362.​0.html

     

    and got Error 404 file not found. As it was the first link that I had ever followed from the Community, I thought it reasonable to report the problem.

     

    Best wishes for an exciting venture

     

    Laurie

     

     

  • LaurieMoseleyLaurieMoseley Member Posts: 3 Contributor I

    I also tried to follow the link to

     

    http://www.prudsys.de/Service/Downloads/bin/DMC200​7_schieder_tuchemnitz.pdf

     

    On that occasion I got

     

    Seite nicht gefunden. (which I translate as "site not found")

     

     

    Keep up the good work, but check some of the links

     

    Best wishes

     

    Laurie

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    Hi Laurie,

     

    Thanks for reporting.  We migrated from an old forum system to this new community portal a couple of months ago and unfortunately not all links have been automatically replaced during this migration process.  So whenever you see a link still going to "...rapid-i.com/..." is is not going to work unfortunately :smileysad:

     

    But anyways: Have fun here in the community,

    Ingo

Sign In or Register to comment.