Ideal ratio with respect to scoring dataset and training dataset

AbiAbi Member Posts: 1 Contributor I

Like the 70 - 30 ratio for trainig and testing, is there a suggested ratio for the datasets of training and scoring?

(This is so as to reduce the training data to the correct proportion for best scoring)

Best Answers


  • Options
    hbajpaihbajpai Member Posts: 102 Unicorn
    Hey @Abi,

    Scoring typical is real time rather than batch. I assume you mean train, dev/hold-out and test sets ratio. Thumb rule is, If the number of rows is less than 100k it could be 60%,20%,20% or 70%,15%,15%. But if you have 1 million or more rows, it could 98%,1%,1% or even 99.5%,0.4%,0,1%.

    As far as reducing the total rows goes, a trick is to train the model on the whole data post your validation of the final model. 

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    Scoring typical is real time rather than batch.
    I would challenge you on this. In Customer Analytics its often fine to do scorings once a day / once a week.

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited April 2020
    Totally agree with @Telcontar120 on CV. If one cannot afford to implement CV due to time constraints, huge data or specific needs, then other validation similar to AM can be used

    Be Safe. Follow precautions and Maintain Social Distancing

Sign In or Register to comment.