Towards a Pervasive Data Mining Engine—Architecture Overview

JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
edited November 2018 in Help
I was just reading this paper and thought it an interesting concept & worthy of some discussion. 

They are decribing a system which chooses models and runs in the background as a black box for novice users without any intervention. 
It doesn't actually sound as though they are describing anything particularly new. 
In fact, does the ML Wizard extension already do this?  So can the accelerators in 6.5 and so can RapidMiner Server.  (As can many other platforms)

My worry, and this is not addressed in all 10 pages of this paper, is the potential dangers of such a system.  Particularly as their use case is on hospital data.  What is the danger of spurious correlations, bad data inputs and potentially their 'Black Box' decision support system for novices may put lives at risk? 
How can we be sure that the system gets it right without having an expert cast his eye over the created processes? 
What do you guys think of the pros & cons of such a system? 

Personally I would prefer a series of processes created by experts and managed by RM Server. 


  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,525 RM Data Scientist
    Hi John,

    i had this idea a year ago. As usual with those ideas sebastian told me, that "RapidMiner" tried this long ago. The project was EU funded and was called e-lico. See: http://www.e-lico.eu/ The result was i think, that it is not easy to do this. Maybe it gets better with newer systems.

    I am personally very sceptical if it comes down to black boxes. Do you know BigML? Sent a dataset and get a model. I am thinking a lot about validation. I think it is way too easy to miss-use the system by putting example selection, normalization etc. in front of it. People tend to do it w/o knowing that this might create a serious bias / validation problem.

    Oh - and i am personally convinced that you get superior results by defining a customly designed performance value. This is usually not reflected in those machines.

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Yes, it's not easy to do. 

    The MLWizard project had a very nice approach using the meta-data of a dataset to guide model choice.  http://rapid-i.com/rapidforum/index.php/topic,2584.0.html
    The paper on the extension is pretty interesting & it still works in version 7, it's pretty fun to play with the Automatic System Construction, but don't use it on large datasets. 
    MIT recently published a paper on using deep learning to achieve much the same system too, although their one adds in ETL as well which is impressive. 

    I agree with you though, I still wouldn't 100% trust it.  There's a nice quote from Alvin Toffler that goes "You can use all the quantitative data you can get, but you still have to distrust it and use your own intelligence & judgement". 
  • Options
    earmijoearmijo Member Posts: 271 Unicorn
    Very interesting discussion. I'm afraid it is a trend that will continue. I use many software packages and in the recent past all of them have offered some operator to perform analysis blindly. For instance,
    • Wolfram Mathematica has a couple of commands Classify and Predict that can be used without specifying the algorithm and they will choose it for you.
    • Skytree just made its really nice product available for free and it includes a command Auto-Model.
    • Microsoft is funding a contest that has been running for a while to automate the whole process: https://competitions.codalab.org/competitions/2321
    I think there is a high demand for analytics and a shortage of people who can do it right. This is an answer. But I share your skepticism.

    JEdward: Could you pass the reference to the MIT paper about Deep Learning please?

  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Here's a link to an article on it: http://www.techtimes.com/articles/96392/20151017/data-science-machine-eliminates-human-intuition-for-big-data-analysis.htm 
    It basically plays Kaggle competitions at the moment. 

    Personally I do see a place for automation & templating to make the process easier, but I think we're quite a way from reliably making the whole creation process without any human intervention.  The struggles IBM Watson has been having to implement and the amount of data it's needing to consume is a good example of this. 
Sign In or Register to comment.