RapidMiner

RapidMiner Data Modeling CHALLENGE - $200 in cash and prizes

Community Manager

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

Well done Dan! Can't wait to see what you did.

Martin - yes it's a lot of steps. When I tried some simple modeling (e.g. Naive Bayes) it didn't take too long but yes anything else took a while. I am assuming that it's slow because the sliding window validation is not parallelized. I always keep a keen eye on my CPU/memory usage when I'm running big models like that, and this validation does not push my 6-core like X-validation does. Feature request? Smiley Happy

Scott
Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Contributor

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

[ Edited ]

Hi Martin,

Yes I am using the sliding windows validation parameters @sgenzer proposed.  It runs about 1925 iterations for me in about 14-17 minutes and about 10-12 minutes on my 24 core 64GB RAM desktop.  Was toying with the idea of running on my hadoop cluster for more speed but probably overkill.  Not sure what you mean by 'comm training'?  Perhaps I missed something.  I am getting up to 67.8 on my grid search parameter tuning but only using 80/20 split so some overfitting.  

 

Will post my full code before I go on vacation next Saturday so others can recreate it.

Dan

Contributor

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

Here is my best model so far.  I've posted my code for others to use.  

 

My first big hint:  use the new Gradient Boosted Trees algorithm from H2O (similar to the popular XGBoost package).  It's wicked fast and frankly is my go to algo these days.

 

rm top.png

My code was too big for the insert code window so I attached a docx file. 
I'm sure someone will have a good idea how to reduce the size on the preprocessing part
(neep 2 loops that I couldn't quite get right).
Dan

 

Attachments

RM Staff

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

Dear Dan,

thanks for your insights. I had not a look on your process, because that's feels like cheating at the moment.

 

I indeed also use H20, but GLM at the moment. A validation with bigger stepsize yielded to this:

 

when i tried to ran it with scott's proposed step size it crashed after 3 hours. The reason is that i pipe a lot of attributes into the GLM. I am working on fixing this at the moment Smiley Happy

new current.png

in any case we might create an ensemble model of both our solutions afterwards. Just to give Scott the best model.

 

~Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
RM Staff

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

[ Edited ]

Ok,

 

got it, i don't believe my results, possibly i made something wrong somewhere:

 

Edit, found my issue. I am now at 61 as well.faster GLM.png

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Community Manager

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

Hello all RapidMiners -

 

After 5 hrs, 12 min and 09 seconds of runtime* (I want Dan's computer!), I have confirmed his entry at 61.62%**:

Screen Shot 2017-06-04 at 4.51.57 PM.png

 

 

 

 

 

 

 

 

 

I did spend some of the past five hours looking at Dan's entry and wish to applaud him for some clever pre-processing (yes I agree there is probably a neater way to do that...) as well as some nice feature selection and tweaking of the GBT model.

 

Hence I declare that Dan is officially in 1st Place with one successfully verified entry at 61.62% accuracy.  

 

You have 11 days to get your entry in.  Game ON!

 

Scott

 

* If you also verify someone's results, could you also please post your runtime and your machine specs?  I think this is a great way for us to benchmark our machines against a common process.  A bonus as far as I'm concerned.  FWIW my machine is a Mac Pro (late 2013) running a 3.5GHz Intel Xeon E5-1650v2 6-core with 16GB memory.  And yes it was running full-out on this process.

 

** This is actually slightly higher than Dan's screenshot, actually.  I am assuming the discrepancy is due to the fact that the Gradient Boosted Trees operator has slightly different results due to the number of threads of the CPU?  If so, I also assume that if we both had checked "reproducible", it would have been exact.  For fairness, I am taking my accuracy as the "official" one.

 

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Community Manager

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

OK folks - two more days to submit your models to this challenge!  There is cash, gift cards, and homemade maple syrup on the line.  @bigD has this in the bag right now with the only submission.  Game time.

 

Scott

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Community Manager

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

less than 10 hours to go...any other entries?  Smiley Happy

 

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
RM Staff

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

I am 2% less good than him so Dan deserves it for sure!  But maybe something else cracked this?


How to load processes in XML from the forum into RapidMiner: Read this!
Highlighted
RM Staff

Re: RapidMiner Data Modeling CHALLENGE - $100 prize

Hi,

 

attached is my whole repository, because i use reusable subprocesses. I have a GLM based model of similar quality of what Dan built. But it's way faster! Smiley Happy - Anyway. He deserves the win.

 

The repo has also all my processes in so you see my progress. 07- is i think the final one. The tweeks after wards did not help.

 

https://drive.google.com/file/d/0B5hecBOIu9zMeXBNZThaMXNpZmc/view?usp=sharing

 

 

Best,

Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner