RAPIDMINER 9.7 BETA ANNOUNCEMENT

The beta program for the RapidMiner 9.7 release is now available. Lots of amazing new improvements including true version control!

CLICK HERE TO DOWNLOAD

Ideal ratio with respect to scoring dataset and training dataset

AbiAbi Member Posts: 1 Contributor I

Like the 70 - 30 ratio for trainig and testing, is there a suggested ratio for the datasets of training and scoring?

(This is so as to reduce the training data to the correct proportion for best scoring)

Best Answers

Answers

  • hbajpaihbajpai Member Posts: 51  Guru
    Hey @Abi,

    Scoring typical is real time rather than batch. I assume you mean train, dev/hold-out and test sets ratio. Thumb rule is, If the number of rows is less than 100k it could be 60%,20%,20% or 70%,15%,15%. But if you have 1 million or more rows, it could 98%,1%,1% or even 99.5%,0.4%,0,1%.

    As far as reducing the total rows goes, a trick is to train the model on the whole data post your validation of the final model. 


    varunm1lionelderkrikorAbi
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,408  RM Data Scientist
    Scoring typical is real time rather than batch.
    I would challenge you on this. In Customer Analytics its often fine to do scorings once a day / once a week.

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    hbajpai
  • varunm1varunm1 Moderator, Member Posts: 1,185   Unicorn
    edited April 22
    Totally agree with @Telcontar120 on CV. If one cannot afford to implement CV due to time constraints, huge data or specific needs, then other validation similar to AM can be used
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

    mschmitzlionelderkrikor
Sign In or Register to comment.