By: Nithin Mahesh
In my last post, I talked about how I began prepping my data, some of the operators I used, and some of the issues I ran into. In this post, I will talk briefly about my results and some of the most useful features in RapidMiner Studio.
As I mentioned last week, I ended up not getting the results I wanted. Even after running different validations on my data including cross validation. I kept getting an AUC of about 0.519 which is really bad compared to the results of IBM Research that were at 0.7611.
A couple of small things to consider that I wish I had started with before jumping into the data set. I found signing up for the workshop earlier would have been helpful; this gave me a nice review on how to import, prep, model, and interpret my data. The instructor was good at answering any questions I had and it helped a lot that the workshop was interactive. I also picked up a lot of simple productivity features such as how to disable unused operators or how to organize my operators so they weren’t all over the screen. Another feature I learned later about was the background process feature, worth looking at, for commercial users that gives one the ability to work on other processes while running some in the meantime.
Viewing your data table can also be a challenge at times, if you’ve ever run any processes on RapidMiner you have probably ran into the issue of not being able to view your data table after closing the results, application, or after running another process (shown below).
It took me a bit to realize that breakpoints let one view the table at any operator in the process, which is really useful to debug and view changes to your set. This can be done by right clicking on an operator as shown below:
After a lot of data prep, I ran some models and validations as mentioned in the last post. The problem with this was that my data prep was very process intensive and despite having access to all the cores of my computer I ran into hours of loading time before even getting to my models. I learned later that there is a way to cut down the time using the multiply and store operators meaning I essentially took a copy of my data prep (multiply) and then stored it (store). I then created a new process in which I used a retrieve operator to grab the data prep. In my new process, I could run my cross validation and models without having to reload all my data prep, which saved me some hours of waiting time. One thing to note was that any time I changed a parameter in data prep I would have to run that process again so that the model process had the change.
This brings me to another important feature to keep in mind, the logs. With a large data set some of the validations I ran would take too long to load. I would wait for these for hours just to get an error telling me the computer was out of memory. I eventually found the logs, under view then show panel, which gave me warning errors during the process, so I wouldn’t have to waste time for the process to eventually end.
The help tab on RapidMiner Studio is another useful resource that gives a nice overview of all the parameters and their functions for any of the hundreds of operators there are. The documentation includes links to hands on tutorials right in RapidMiner under the Tutorial Process. RapidMiner’s Wisdom of Crowds feature was another useful feature within Studio, great for finding operators that would be the most useful for that task, especially when I was unsure of what to use. The community page was the next best resource, any specific questions I had were either mentioned in past posts or I could make my own post. The response time was quick to any questions I posted as well!
In my next post, I will talk about my end results and what I did to my data prep to finally get the AUC I was looking for.
By: Nithin Mahesh
In my last post, I gave an introduction of my work at RapidMiner for this summer and the Churn data science project I was given to work on. I touched briefly on the data set provided by the KDD Cup 2009 from the French Telecom Company Orange’s large marketing database. In this post, I will talk about how I began to learn RapidMiner starting with my data prep.
The first challenge was figuring out which models and prep were needed, since the majority of the data contained numeric values. Categorizing them was hard! Here is the data I was given once I uploaded it into RapidMiner Studio:
Before attempting any data prep, I opened the RapidMiner Studio tutorials under the create process button -> then learn as shown below:
This was useful to understand how to navigate the software and the basic flow necessary to run analytics on a data set. The RapidMiner website provided some good introductory getting started videos that were very useful. After playing around with several tutorials I had a basic knowledge of how things worked and began to plan prepping this data set. In the same create new process section there are many templates that can be run to see examples of the analytics you may use on your data. In my case I was looking at customer churn rate and found a template running analysis on a data set to find whether a customer was true for turning over or false for not.
After going through some tutorials, I was ready to import the data and began by clicking the add data button, but ran into some errors. I found that the read csv operator was much more powerful and ended up using this instead, despite the data having an odd file type (.chunk). Initially I had some issues with how the data was being spaced and realized this was due to not configuring the right parameter in import wizard. After getting the data in and connecting it to the results port I started to plan how to organize the random numeric values. First thing I noticed was the set contained many missing values indicated by “?” so I needed to use the replace missing values operator, which I then set the parameters to replace these values with zero, I later changed this to average these values instead.
I then downloaded and imported the label data using the using the add data button then joined the data as shown below:
This gave me an error since I needed to create a new column to add in the label data. I tried using the append operator which I learned after some playing around was for merging example sets. I eventually found out the generate ID operator is the right one to use to create a new column.
When preparing data, it is useful to split it into train and test sets, this way we can train and tune the model with the training set. Once this is done we can test the training set on how well it generalizes to data not seen before with the test set data. One thing I was unaware of is RapidMiner Studio contains many operators that combine multiple steps into a single one. The sub process panels are the most powerful within these types of operators allowing me to perform a lot of analytics all in one go as shown below:
Splitting data into test and train sets, running models, and running performance can all be done using a single operator. One can run multiple processes that in R would take a couple of steps to complete. In my case, cross validation seemed like the best option since I needed to split my data to train/test then check my models accuracy, precision, and performance.
At this point in my data prep I ran into a couple of problems that comes along with using a new software for the first time. Since I was working with such a large data set running the cross validation, which is a big process, I did not have enough memory to run it. When this occurs it’s best to either narrow down the number of operators used or try to reduce the number of attributes. Using the remove useless attributes operator, I could cut down some features that were not being used. Some other useful operators were the free memory and the filter examples operators.
In my next post, I will talk about my results (or lack of), what issues I faced, how I went about solving them, and what I found to be the most useful features in RapidMiner Studio.
By: Nithin Mahesh
My name is Nithin Mahesh, I just finished my sophomore year at the University of Massachusetts Amherst studying Informatics Data Science. I recently took some classes on R programming and introduction statistics courses so getting an internship at RapidMiner was a great way to gain some experience in my field!
I am currently interning on the marketing team for the summer working on a variety of projects involving the product, RapidMiner Studio. One of the first tasks I was given was to download and sign up on the software. Part of my job was to understand the process for new RapidMiner Studio users and help provide suggestions on how we can improve how users navigate, get help, and work with the product.
I was given the KDD Cup 2009 data set; essentially a competition created by the leading professional organization of data miners. Many of the top companies participate including Microsoft, IBM Research, and many more using their own machines and data mining techniques. The large data set consisted of 10,000 rows and 15,000 attributes with mostly numerical and nominal data but also included some missing values. The small set consisted of 50,000 rows and 230 attributes; containing some missing values as well.
The data set was taken from the French Telecom Company Orange and is from their large marketing database. The challenge of Orange’s data is that one must be able to deal with a very large database containing noisy data, unbalanced class distributions, and both numerical/categorical data. The competition task was to find the customer churn rate, appetency, and up-selling with the results evaluated by the Area Under Curve (AUC). The main objective was to be able to make good predictions using the target variables, which needed to be predicted. This can then be displayed in a confusion matrix to represent the number of examples falling into each possible outcome.
There were two types of winners, those of the slow challenge and those of the fast challenge since KDD released both a large and small data set. The slow challenge was to achieve results on either the large or small data before the deadline and the fast challenge was a submission within five days of the release of the training labels. The results of the fast challenge was IBM Research taking the lead, followed by ID Analytics Inc, and last Old Dogs with New Tricks. The slow challenge was University of Melbourne, followed by Financial Engineering Group Inc Japan, and National Taiwan University Computer Science and Information Engineering. The AUC evaluation for churn by IBM Research ended up being 0.7611, which is what I’d be comparing my results to.
Orange Labs already has their own customer analysis platform capable of building prediction models with a very large number of input variables. Their powerful platform implements a variety of features such as processing methods for instances and variable selection or variable selection regularization and model averaging method. Orange’s platform can scale on very large datasets with hundreds of instances and thousands of variables, with the KDD challenge goal to be able to beat their in-house system.
In my next post, I will talk about how I began to learn RapidMiner, starting with how to prep the data.
Welcome to another edition of the Data Science Link Roundup! In today's post we talk GPS Coordinates, JSON, and much more!
From the Community
Out on the Interwebz
As always, take it to the limit!
Greetings Community! Here's a quick interesting link roundup for your Data Science needs!
From the Community
Interesting Links from the Interwebz