The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Ideal ratio with respect to scoring dataset and training dataset
Like the 70 - 30 ratio for trainig and testing, is there a suggested ratio for the datasets of training and scoring?
(This is so as to reduce the training data to the correct proportion for best scoring)
Tagged:
0
Best Answers
-
varunm1 Member Posts: 1,207 UnicornHello @Abi
70-30 is a general ratio that you find in many processes where split validation is used. I really like the validation used in the Auto model. So, What auto model does is, it train a model on 60% of data and then score on 40% data. The way it scores 40% data is by splitting this 40% into 7 subsets and test on each subset and then average the performance of these 7 subsets. This way it is also having the advantages of cross-validation by splitting into subsets.
My suggestion, go with 60% training (Cross validate) and 40 % testing (divide into 7 or 5 subsets) for scoring. If you can cross-validate whole data, that is fine as well, but test the model on at least 10% hold out data after CV.Regards,
Varun
https://www.varunmandalapu.com/Be Safe. Follow precautions and Maintain Social Distancing
6 -
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornThe ideal ratio is to use cross validation. There is a reason this is considered the "gold standard" for validation. This approach ensures that 100% of the data is used in both training and testing. Otherwise you are inviting bias from random effects of which records are in your training set vs your testing set.
I understand the reasons why AutoModel has chosen to implement a form of split validation, which is primarily to save processing time. That is probably a smart choice for an automated tool like that which is designed to work on pretty much any size data set that users might choose to use with it. It also is potentially doing a lot of other complicated things like feature engineering and feature selection so some corners have to be cut to make the best use of the overall time that users are willing to wait for the output.
However, if you are doing your own process manually and can choose to set it up any way you like, then your default should probably be do to cross-validation and only deviate from that when you have a specific need. If you have tons of data and you are also doing many other complicated things, then perhaps it is better to do split validation. But if you have smaller data sets or more time you can devote to model preprocessing and processing, then cross-validation is really the way to go.10
Answers
Scoring typical is real time rather than batch. I assume you mean train, dev/hold-out and test sets ratio. Thumb rule is, If the number of rows is less than 100k it could be 60%,20%,20% or 70%,15%,15%. But if you have 1 million or more rows, it could 98%,1%,1% or even 99.5%,0.4%,0,1%.
As far as reducing the total rows goes, a trick is to train the model on the whole data post your validation of the final model.
Harshit
Dortmund, Germany
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing