In my last post, I talked a little bit about my results and some of the most useful features of RapidMiner. In this post, I will talk about how I got to a much more accurate AUC than the 0.519 I was getting before.
After struggling for a couple weeks with this data set I eventually was given advice from @IngoRM who told me to double check my data import. Make sure to always take your time with data prep, before jumping into models and validations. Took me a while to realize that I had imported my label data incorrectly, which is what gave me the bad AUC. When I used the retrieve operator it automatically took the first row as the header. This caused the data to move each label up a row, but once I fixed this all I had to do was to add in some sample operators in order to balance the data which was biased with the -1 label. The data prep was minimal, just had to connect it to a cross validation which as I mentioned before ran my model (boosted gradient tree), split my data automatically, and tested the performance. My results after some simple changes ended up being an AUC of 0.736 compared to the winning IBM results that were 0.7611. This shows how powerful RapidMiner really is, with minimal prep I could try out different models, run them, and test them to see which gave me the best AUC. I attached some screenshots of the results below:
Another important thing to keep in mind was how imbalanced the data was, most of my data prep was centered around balancing the data better. I had to try a number of things before settling with the sample operator, including feature selection and reducing the number of attributes. In the end since the gradient boosted tree was the most powerful and dealt with my many missing values I kept that and cut down on my prep.
After a final discussion with Ingo to go over my final models vs his models I learned that I didn’t necessarily need to worry about the data being balanced but focusing on optimizing the AUC since that’s what the results were based on. I was under the impression that getting better balance and precision within the confusion matrix would better my AUC. Unfortunately, being early in my data science career I am also still learning!
However, after working on this project I have come to realize that RapidMiner is great in optimizing time spent building models and running analysis. The software is very efficient in that sense, if I were to recreate this in R or python, the same tasks would have taken much longer, maybe even a couple weeks more.
Ingo had mentioned that a lot of the competitors in the KDD Cup spent months on this data set simply due to the number of models they ran. Using ensemble modeling the cup contestants would have run hundreds of models optimizing/training each one to get that 0.7611. In that way RapidMiner is impressive, I could take about a half an hour (not accounting the time I spent learning the product) and end up with a result of 0.736 AUC.
Thanks for following my last couple of blog posts, I hope I provided some useful tips that will aid in unleashing all of RapidMiner Studio’s full potential!