SplitValidation Issue
@sgenzer, I believe there may be some issue with the splitvalidation operator. The model output through the entire splitvalidation process does not correspond to the model with which the validation performance metrics are computed.
I have attached an Excel spreadsheet to show the computations with a formula. The RMSE computed for the validation dataset (using the Performance operator) corresponds to the "ValidModel and ApplyModel" (in Excel worksheet) which is one of the models output by the process when dissected through a remember/recall operators and breakpoints. However, the RapidMiner process outputs a LinearRegression model that is same as the "TrainModel" (in Excel worksheet) whose RMSE does not match the one given by the Performance (Regression) operator. Why the discrepancy? Which is the correct model here?
I have tried this issue with multiple datasets and have documented it in a process with the sample Polynomial dataset. Any ideas on what may be going on here?
I have attached an Excel spreadsheet to show the computations with a formula. The RMSE computed for the validation dataset (using the Performance operator) corresponds to the "ValidModel and ApplyModel" (in Excel worksheet) which is one of the models output by the process when dissected through a remember/recall operators and breakpoints. However, the RapidMiner process outputs a LinearRegression model that is same as the "TrainModel" (in Excel worksheet) whose RMSE does not match the one given by the Performance (Regression) operator. Why the discrepancy? Which is the correct model here?
I have tried this issue with multiple datasets and have documented it in a process with the sample Polynomial dataset. Any ideas on what may be going on here?
Tagged:
0
Best Answer

lionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,193 UnicornHi @avd,
I will try to explain this difference (thanks to the RM staff to correct me if I'm wrong) :
There are in deed 2 models built in this process :
The first one is built with 60% of the data and then is tested with the 40% of remaining data : You called this model ""ValidModel and ApplyModel". This model is used to calculate the performance of this first model on unseen data, performance given by the Performance (Regression) operator.
However the model delivered at the output mod of the Split Validation (called "TrainModel" in your case) operator is built with 100% of the input example set : Thus it is a different model from the first one described above. You can check this by reading the help section of the Split Validation operator :
"Output Model :The training subprocess must return a model, which is trained on the input ExampleSet. Please note that the model built on the complete input ExampleSet is delivered from this port"
To sum up, the "ValidModel and ApplyModel" is built with 60 % of the input example set and the "TrainModel" is built with 100 % of the input example set. Thus this second model has a different performance from the first one because it is a different model...
Hope this helps,
Regards,
Lionel
NB : Sometimes, what you called "TrainModel" is called "Production model" (built with 100% of the data)6
Answers