Towards a Pervasive Data Mining Engine—Architecture Overview

JEdward · March 2016

I was just reading this paper and thought it an interesting concept & worthy of some discussion.
http://link.springer.com/chapter/10.1007/978-3-319-31307-8_58

They are decribing a system which chooses models and runs in the background as a black box for novice users without any intervention.
It doesn't actually sound as though they are describing anything particularly new.
In fact, does the ML Wizard extension already do this? So can the accelerators in 6.5 and so can RapidMiner Server. (As can many other platforms)

My worry, and this is not addressed in all 10 pages of this paper, is the potential dangers of such a system. Particularly as their use case is on hospital data. What is the danger of spurious correlations, bad data inputs and potentially their 'Black Box' decision support system for novices may put lives at risk?
How can we be sure that the system gets it right without having an expert cast his eye over the created processes?
What do you guys think of the pros & cons of such a system?

Personally I would prefer a series of processes created by experts and managed by RM Server.

MartinLiebig · March 2016

Hi John,

i had this idea a year ago. As usual with those ideas sebastian told me, that "RapidMiner" tried this long ago. The project was EU funded and was called e-lico. See: http://www.e-lico.eu/ The result was i think, that it is not easy to do this. Maybe it gets better with newer systems.

I am personally very sceptical if it comes down to black boxes. Do you know BigML? Sent a dataset and get a model. I am thinking a lot about validation. I think it is way too easy to miss-use the system by putting example selection, normalization etc. in front of it. People tend to do it w/o knowing that this might create a serious bias / validation problem.

Oh - and i am personally convinced that you get superior results by defining a customly designed performance value. This is usually not reflected in those machines.

~Martin

JEdward · March 2016

Yes, it's not easy to do.

The MLWizard project had a very nice approach using the meta-data of a dataset to guide model choice. http://rapid-i.com/rapidforum/index.php/topic,2584.0.html
The paper on the extension is pretty interesting & it still works in version 7, it's pretty fun to play with the Automatic System Construction, but don't use it on large datasets.
MIT recently published a paper on using deep learning to achieve much the same system too, although their one adds in ETL as well which is impressive.

I agree with you though, I still wouldn't 100% trust it. There's a nice quote from Alvin Toffler that goes "You can use all the quantitative data you can get, but you still have to distrust it and use your own intelligence & judgement".

earmijo · March 2016

Very interesting discussion. I'm afraid it is a trend that will continue. I use many software packages and in the recent past all of them have offered some operator to perform analysis blindly. For instance,

Wolfram Mathematica has a couple of commands Classify and Predict that can be used without specifying the algorithm and they will choose it for you.
Skytree just made its really nice product available for free and it includes a command Auto-Model.
Microsoft is funding a contest that has been running for a while to automate the whole process: https://competitions.codalab.org/competitions/2321

I think there is a high demand for analytics and a shortage of people who can do it right. This is an answer. But I share your skepticism.

JEdward: Could you pass the reference to the MIT paper about Deep Learning please?

JEdward · March 2016

Here's a link to an article on it: http://www.techtimes.com/articles/96392/20151017/data-science-machine-eliminates-human-intuition-for-big-data-analysis.htm
It basically plays Kaggle competitions at the moment.

Personally I do see a place for automation & templating to make the process easier, but I think we're quite a way from reliably making the whole creation process without any human intervention. The struggles IBM Watson has been having to implement and the amount of data it's needing to consume is a good example of this.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Towards a Pervasive Data Mining Engine—Architecture Overview

Answers