random seed

k_vishnu772 · May 2018

HI All,

I use the split data operator and if i use the ramdom seed the accuray of test data is only 74 % and if i don't use a random seed in my split operator my performance gor improved to 80% ,i don't know which one to consider for my model .

why random seed has such huge affect on accuracy of model?

Regards,

Vishnu

Telcontar120 · May 2018

In general, on such a small dataset, there is very limited benefit to doing a split validation and then a cross validation on the original training set, followed by another validation on the original testing set. The point of cross validation is to estimate your model uncertainty by making the best use of all your data for both training and testing. The way you are doing it now, you are still only using 80% of the data for training and 20% of the data for testing. You might as well not be doing cross-validation at all on the training set.

Where the split/cross validation hybrid makes sense is if you have two separate samples, say from two different time periods. In that case, you can build using cross validation one one sample, and then check the performance on the other sample and compare it to the performance from the cross validation of the original sample to see how stable it is.

In some other cases, if you have an extremely large original sample, then splitting it and doing a separate holdout validation---once again by comparing it to the performance of the original cross-validation can also make sense.

But in your case, with a 20% holdout of only 257 examples, you are always getting validation results from your split on just 50 examples, which introduces way too much noise to provide a consistent performance result.

Telcontar120 · May 2018

Correct, I would only use the cross-validation approach for most ordinary modeling projects, especially with that small dataset. In theory, it shouldn't matter which random seed you choose. That's just to allow reproducibility of your results.

MartinLiebig · May 2018

Hi,

because you measure an estimate for the true accuracy of the model. Like all measures, it has an uncertainty.

Best,

Martin

k_vishnu772 · May 2018

Which one should i use finally ? the high accuracy model by removing the local seed?

i have another model where if i put the local seed the accuracy improved by 10% .so i am totally confused which one to use?

MartinLiebig · May 2018

Hi,

thats why you want to use Cross Validation and take the average in there. It's proven that this is the best unbiased estimator of the true performance.

Best,

Martin

k_vishnu772 · May 2018

let me explain little bit more about my data. i have 257 rows and 17 columns of data ,first i am splitting the the data into two parts (80%,20%) using split data operator and then i am applying the cross validation on 80 percent of data and then finally apply the model to the 20 percent of data.First i ran the auto model and got 86 percent accuracy it has a local random seed 1992 .if i remove the local random seed the performance drops to 76 % .i am totally confused as i am new to machine learning .Could you please share your thoughts on that?

kypexin · May 2018

Hi @k_vishnu772

My thought is that you use either local or global random seeds which initialize random number generator differently.

This is why the split operator does the split differently in both cases, which leads to different performance.

If you could share your process XML, we could have a look to make sure.

Telcontar120 · May 2018

In general, on such a small dataset, there is very limited benefit to doing a split validation and then a cross validation on the original training set, followed by another validation on the original testing set. The point of cross validation is to estimate your model uncertainty by making the best use of all your data for both training and testing. The way you are doing it now, you are still only using 80% of the data for training and 20% of the data for testing. You might as well not be doing cross-validation at all on the training set.

Where the split/cross validation hybrid makes sense is if you have two separate samples, say from two different time periods. In that case, you can build using cross validation one one sample, and then check the performance on the other sample and compare it to the performance from the cross validation of the original sample to see how stable it is.

In some other cases, if you have an extremely large original sample, then splitting it and doing a separate holdout validation---once again by comparing it to the performance of the original cross-validation can also make sense.

But in your case, with a 20% holdout of only 257 examples, you are always getting validation results from your split on just 50 examples, which introduces way too much noise to provide a consistent performance result.

k_vishnu772 · May 2018

Hi Brian, Thanks for your explanation.So in this case where my data set is small i should avoid split data and use only cross validation operator ?

for random seed can i choose the random seed the one with high accuracy?

I am attaching the process image for your reference as i am not authuorised to share the process xml.

Telcontar120 · May 2018

Correct, I would only use the cross-validation approach for most ordinary modeling projects, especially with that small dataset. In theory, it shouldn't matter which random seed you choose. That's just to allow reproducibility of your results.

random seed

Best Answers

Answers

Categories