06-03-2017 12:49 PM
06-03-2017 02:17 PM - edited 06-03-2017 02:51 PM
Yes I am using the sliding windows validation parameters @sgenzer proposed. It runs about 1925 iterations for me in about 14-17 minutes and about 10-12 minutes on my 24 core 64GB RAM desktop. Was toying with the idea of running on my hadoop cluster for more speed but probably overkill. Not sure what you mean by 'comm training'? Perhaps I missed something. I am getting up to 67.8 on my grid search parameter tuning but only using 80/20 split so some overfitting.
Will post my full code before I go on vacation next Saturday so others can recreate it.
06-03-2017 09:33 PM
Here is my best model so far. I've posted my code for others to use.
My first big hint: use the new Gradient Boosted Trees algorithm from H2O (similar to the popular XGBoost package). It's wicked fast and frankly is my go to algo these days.
My code was too big for the insert code window so I attached a docx file.
I'm sure someone will have a good idea how to reduce the size on the preprocessing part
(neep 2 loops that I couldn't quite get right).
06-04-2017 04:34 AM
thanks for your insights. I had not a look on your process, because that's feels like cheating at the moment.
I indeed also use H20, but GLM at the moment. A validation with bigger stepsize yielded to this:
when i tried to ran it with scott's proposed step size it crashed after 3 hours. The reason is that i pipe a lot of attributes into the GLM. I am working on fixing this at the moment
in any case we might create an ensemble model of both our solutions afterwards. Just to give Scott the best model.
06-04-2017 04:48 AM - edited 06-04-2017 05:37 AM
got it, i don't believe my results, possibly i made something wrong somewhere:
Edit, found my issue. I am now at 61 as well.
06-04-2017 05:11 PM
Hello all RapidMiners -
After 5 hrs, 12 min and 09 seconds of runtime* (I want Dan's computer!), I have confirmed his entry at 61.62%**:
I did spend some of the past five hours looking at Dan's entry and wish to applaud him for some clever pre-processing (yes I agree there is probably a neater way to do that...) as well as some nice feature selection and tweaking of the GBT model.
Hence I declare that Dan is officially in 1st Place with one successfully verified entry at 61.62% accuracy.
You have 11 days to get your entry in. Game ON!
* If you also verify someone's results, could you also please post your runtime and your machine specs? I think this is a great way for us to benchmark our machines against a common process. A bonus as far as I'm concerned. FWIW my machine is a Mac Pro (late 2013) running a 3.5GHz Intel Xeon E5-1650v2 6-core with 16GB memory. And yes it was running full-out on this process.
** This is actually slightly higher than Dan's screenshot, actually. I am assuming the discrepancy is due to the fact that the Gradient Boosted Trees operator has slightly different results due to the number of threads of the CPU? If so, I also assume that if we both had checked "reproducible", it would have been exact. For fairness, I am taking my accuracy as the "official" one.
06-13-2017 08:45 AM
OK folks - two more days to submit your models to this challenge! There is cash, gift cards, and homemade maple syrup on the line. @bigD has this in the bag right now with the only submission. Game time.
06-15-2017 10:52 PM
I am 2% less good than him so Dan deserves it for sure! But maybe something else cracked this?
06-16-2017 02:58 AM
attached is my whole repository, because i use reusable subprocesses. I have a GLM based model of similar quality of what Dan built. But it's way faster! - Anyway. He deserves the win.
The repo has also all my processes in so you see my progress. 07- is i think the final one. The tweeks after wards did not help.