RapidMiner

RapidMiner Data Modeling CHALLENGE - $200 in cash and prizes

Community Manager

RapidMiner Data Modeling CHALLENGE - $200 in cash and prizes

[ Edited ]

Hello RapidMiners -

 

I thought it would be fun (and useful) if we had some Kaggle-like challenges here on the community forum.  So I am sponsoring the very first RapidMiner Data Modeling Challenge.  Smiley Happy  This is a real training data set that is in need of a good model.  It is not like the classic iris data set; it has missing data, errors, etc..  Welcome to the real world.  Here's the challenge:

 

Goal: produce a model in RapidMiner 7.5 that will predict the label attribute given prior data in the series of the attached training set "RMChallengeDataSet" with the highest accuracy.  This will be verified via the SLIDING WINDOW VALIDATION operator.  As it a series of dates over an 18+ year span and no one wants to sit and watch their computer spin forever, I suggest the following parameters:

 

   training window width: 1000 (about three years' worth)

   training window step size: 3 (to cut down on iterations)

   test window width: 1 (I only want one day at a time)

   horizon: 1 (I want the next day)

   cumulative training: yes

   average performances only: yes

 

It is a SERIES - every day from 1968 to 1986 - with 6726 examples and 262 numerical attributes.  The label is an A/B/C selection.  You are welcome to do any feature selection, adding of attributes, etc... and use any model(s) as long as it's within RapidMiner and its publically-available extensions.  No scripting or APIs allowed.  The data are 1:1 hashed to protect the identity of the source - please do not try to reverse-engineer.

 

Winner: the winner of the competition is the one who can produce the highest accuracy % ≥ 60 as shown with the standard Performance operator within the cross-validation.  Why 60?  Because that's the highest I have gotten so far [honest disclaimer: I actually only got 60% accuracy with A/B labels but I know you are all smarter than I am...]

 

Submission: all submissions for this challenge must be in THIS THREAD so it is open for all to see.  All you need to do is submit your process XML as a reply to this message (please use the "insert code" item so it does not get long) AND a screenshot of your performance.  You can post as many submissions as you want (within reason).

 

Determination of winner: Hopefully the community will all agree on the winner (all submissions are public) but in case of some drama, I will be the sole judge and will verify the winner's submission.  If there is more than one identical (and highest) accuracy, the one which was submitted first will be the winner.

 

Who can enter: anyone who is a registered user on the RapidMiner Community Forum.  Yes even you, @IngoRM!

 

Due date: all entries must be posted in this forum by June 15, 2017 at 21:00 EST.

 

Notification: I will give myself three days to independently verify the winner and then post to this thread.  I will then PM the winner to get a mailing address and mail a check for $100!

 

Good luck!


Scott

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.

Attachments

32 REPLIES
Highlighted
RMStaff

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

[ Edited ]

Hey RapidMiners,

 

First of all, let's thank Scott for his initiative here!  This is really appreciated and will be a fun challenge!  And this indeed is a challenge: I tried some first models in the last 15 minutes and I am still very far from the 60% accuracy threshold - but I will get there Smiley Tongue

 

RapidMiner and I personally would like to support this initiative.  Therefore, we will match the $100 price money with a $100 Amazon Gift Card.  So now we have a total $200 of prices in the pool.  So better fire up your RapidMiner and show us your modeling skills Smiley Very Happy

 

Much success to all participants, and let us know from time to time where you are and if you have questions or ideas.

 

Have fun,

Ingo

 

 


How to load processes in XML from the forum into RapidMiner: Read this!
RMStaff

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

Since RapidMiner now donates some price as well I can't really participate any longer.  Or maybe I could use an anonymous account  Smiley Wink.  Anyway, I will still try a bit to see where I can get to...

So here is a quick update: I am now at 45% accuracy which is still far away from your 60%.  This is a good challenge indeed!  I did not really optimize the model itself but focused on feature selection first...  Let's see what else we can do :-)


How to load processes in XML from the forum into RapidMiner: Read this!
Community Manager

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

Well done @IngoRM! Especially since I admitted that I got 60% from an A/B label, rather than A/B/C here. And I've been working on it on-and-off for three weeks. 😊 Tell you what - if a RM employee wins, I will ship a quart of my homemade maple syrup to the office where s/he works. Fair?
Scott Genzer
Senior Community Manager
RapidMiner, Inc.
RMStaff

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

Oh, I missed the A/B vs. A/B/C part.  Then I don't feel too bad with my 45% :-)

You probably don't know it but I am a huge fan of maple syrup.  So I will support the people in the Boston office (sorry London, Dortmund, Budapest :-))  You got a deal here!


How to load processes in XML from the forum into RapidMiner: Read this!
Elite III

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

Yes, 60% accuacy is much easier to achieve when trying to predict 2 classes versus 3 classes! ;-)

I'm wondering since there is no holdout/test set how much "sample tuning" is allowed here.  For instance, after some exploratory EDA it is obvious that some attributes are missing for large portions of the date ranges.  Is it acceptable then to partition the examples into date ranges and building different models on different attribute subsets based on date range availability?  

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

[ Edited ]

Yay, a RapidMiner Challenge! As a former RapidMiner employee, do I qualify for maple syrup? Smiley Wink

 

As this is time series data, you would like to be allowed to fill missing values with the latest known value, or for e.g., attributes 30-49 use the weekly value.

This is impossible to do inside the cross validation if you use shuffled sampling (which is the default of the X-Validation for classification problems if set to automatic). On the other hand, if you do it before the X-Validation, you leak information from the test data to the training data. Maybe we should change the rules to use Sliding Windows validation? Or do you want to get preditions day by day, without using information from previous data points?

 

Assuming we can use values of earlier days to fill missing values etc before the validation, 60% accuracy on A/B/C is possible to achieve on when the performance is estimated via shuffled-sampling cross validation Smiley Happy

Community Manager

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

All good points, Marius.  Here are my thoughts...

 

- Anyone who wants maple syrup in lieu of the $100 prize is welcome to it.

 

- When I did the modeling, I imputed the missing series data before the validation.  I did not think about the "leaking" factor nor to use Sliding Windows validation instead of X-Validation.  As the goal is to be able to predict the labeled column and use all previous data points, I would say YES, we need to change the rules to Sliding Windows validation.

 

I look forward to seeing your submission, Marius!


Scott

 

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Community Manager

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

[ Edited ]

Hi Brian -

 

All good points and yes, there are large gaps of missing examples for all sorts of reasons.  Smiley Happy  But I would say no, it is not ok to partition.  As it is a series, the goal is to predict the label with dates moving forward using the historical information (e.g. predict the label for a date, given all prior data in the series).  As Marius pointed out, I believe a more valid way to show performance is with the Sliding Window validation instead of X-Validation. This is my error but I think he's right.

 

Make sense?  GAME ON!


Scott

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Community Manager

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

**NOTE** I have pondered this a bit and suggest the following changes to the rules: 

 

- Goal: produce a model in RapidMiner 7.5 that will predict the label attribute given prior data in the series of the attached training set "RMChallengeDataSet" with the highest accuracy.  This will be verified via the SLIDING WINDOW VALIDATION operator.  As it a series of dates over an 18+ year span and no one wants to sit and watch their computer spin forever, I suggest the following parameters:

 

   training window width: 1000 (about three years' worth)

   training window step size: 3 (to cut down on iterations)

   test window width: 1 (I only want one day at a time)

   horizon: 1 (I want the next day)

   cumulative training: yes

   average performances only: yes

 

These are rather unusual parameters but I think they make sense (at least to me).  One thing I have found immediately is that the Sliding Window validation is not parallelized - it takes a while to go through the iterations given these parameters with most models.

 

As this is a FUN competition, and a great way to learn from one another, please give feedback if there is a better way to do this.  If people concur, I will make the edits in the initial post.

 

Ok off to bed!

 

Scott

 

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.