random seed

k_vishnu772k_vishnu772 Member Posts: 34 Contributor I
edited November 2018 in Help

HI All,

 

I use the split data operator and if i use the ramdom seed the accuray of test data is only 74 % and if i don't use a random seed in my split operator my performance gor improved to 80% ,i don't know which one to consider for my model .

 

why random seed has such huge affect on accuracy of model?

 

Regards,

Vishnu

Best Answers

  • Telcontar120Telcontar120 Posts: 1,226   Unicorn
    Solution Accepted

    In general, on such a small dataset, there is very limited benefit to doing a split validation and then a cross validation on the original training set, followed by another validation on the original testing set.  The point of cross validation is to estimate your model uncertainty by making the best use of all your data for both training and testing.  The way you are doing it now, you are still only using 80% of the data for training and 20% of the data for testing.  You might as well not be doing cross-validation at all on the training set.

    Where the split/cross validation hybrid makes sense is if you have two separate samples, say from two different time periods.  In that case, you can build using cross validation one one sample, and then check the performance on the other sample and compare it to the performance from the cross validation of the original sample to see how stable it is.

    In some other cases, if you have an extremely large original sample, then splitting it and doing a separate holdout validation---once again by comparing it to the performance of the original cross-validation can also make sense.

    But in your case, with a 20% holdout of only 257 examples, you are always getting validation results from your split on just 50 examples, which introduces way too much noise to provide a consistent performance result.

     

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 Posts: 1,226   Unicorn
    Solution Accepted

    Correct, I would only use the cross-validation approach for most ordinary modeling projects, especially with that small dataset.  In theory, it shouldn't matter which random seed you choose.  That's just to allow reproducibility of your results.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,111  RM Data Scientist

    Hi,

    because you measure an estimate for the true accuracy of the model. Like all measures, it has an uncertainty.

     

    Best,

    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    sgenzer
  • k_vishnu772k_vishnu772 Member Posts: 34 Contributor I

    Which one should i use finally ? the high accuracy model by removing the local seed?

    i have another model where if i put the local seed the accuracy improved by 10% .so i am totally confused which one to use?

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,111  RM Data Scientist

    Hi,

    thats why you want to use Cross Validation and take the average in there. It's proven that this is the best unbiased estimator of the true performance.

    Best,

    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    sgenzer
  • k_vishnu772k_vishnu772 Member Posts: 34 Contributor I

    let me explain little bit more about my data. i have 257 rows and 17 columns of data ,first i am splitting the the data into two parts (80%,20%) using split data operator and then i am applying the cross validation on 80 percent of data and then finally apply the model to the 20 percent of data.First i ran the auto model and got 86 percent accuracy it has a local random seed 1992 .if i remove the local random seed the performance drops to 76 % .i am totally confused as i am new to machine learning .Could you please share your thoughts on that?

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 280   Unicorn

    Hi @k_vishnu772

     

    My thought is that you use either local or global random seeds which initialize random number generator differently. 

    This is why the split operator does the split differently in both cases, which leads to different performance. 

    If you could share your process XML, we could have a look to make sure. 

  • k_vishnu772k_vishnu772 Member Posts: 34 Contributor I

    Hi Brian, Thanks for your explanation.So in this case where my data set is small i should avoid split data and use only cross validation operator ?

     

    for random seed can i choose the random seed the one with high accuracy?

     

    I am attaching the process image for your reference as i am not authuorised to share the process xml.

Sign In or Register to comment.