By: Nithin Mahesh
My name is Nithin Mahesh, I just finished my sophomore year at the University of Massachusetts Amherst studying Informatics Data Science. I recently took some classes on R programming and introduction statistics courses so getting an internship at RapidMiner was a great way to gain some experience in my field!
I am currently interning on the marketing team for the summer working on a variety of projects involving the product, RapidMiner Studio. One of the first tasks I was given was to download and sign up on the software. Part of my job was to understand the process for new RapidMiner Studio users and help provide suggestions on how we can improve how users navigate, get help, and work with the product.
I was given the KDD Cup 2009 data set; essentially a competition created by the leading professional organization of data miners. Many of the top companies participate including Microsoft, IBM Research, and many more using their own machines and data mining techniques. The large data set consisted of 10,000 rows and 15,000 attributes with mostly numerical and nominal data but also included some missing values. The small set consisted of 50,000 rows and 230 attributes; containing some missing values as well.
The data set was taken from the French Telecom Company Orange and is from their large marketing database. The challenge of Orange’s data is that one must be able to deal with a very large database containing noisy data, unbalanced class distributions, and both numerical/categorical data. The competition task was to find the customer churn rate, appetency, and up-selling with the results evaluated by the Area Under Curve (AUC). The main objective was to be able to make good predictions using the target variables, which needed to be predicted. This can then be displayed in a confusion matrix to represent the number of examples falling into each possible outcome.
There were two types of winners, those of the slow challenge and those of the fast challenge since KDD released both a large and small data set. The slow challenge was to achieve results on either the large or small data before the deadline and the fast challenge was a submission within five days of the release of the training labels. The results of the fast challenge was IBM Research taking the lead, followed by ID Analytics Inc, and last Old Dogs with New Tricks. The slow challenge was University of Melbourne, followed by Financial Engineering Group Inc Japan, and National Taiwan University Computer Science and Information Engineering. The AUC evaluation for churn by IBM Research ended up being 0.7611, which is what I’d be comparing my results to.
Orange Labs already has their own customer analysis platform capable of building prediction models with a very large number of input variables. Their powerful platform implements a variety of features such as processing methods for instances and variable selection or variable selection regularization and model averaging method. Orange’s platform can scale on very large datasets with hundreds of instances and thousands of variables, with the KDD challenge goal to be able to beat their in-house system.
In my next post, I will talk about how I began to learn RapidMiner, starting with how to prep the data.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.