Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Problems SplitValidation
Hello,
i have some probems understanding the splitvalidation. I thought the model learned on the "training" side (left) is the same model wich is applied on the testing side (right). But when i store the models on both sides and retrieve them in another process they are different (picture)
.
Is this a bug?
Micha
i have some probems understanding the splitvalidation. I thought the model learned on the "training" side (left) is the same model wich is applied on the testing side (right). But when i store the models on both sides and retrieve them in another process they are different (picture)
.
Is this a bug?
Micha
0
Answers
welcome to our RapidMiner forum. Did you connect the model output port of your splitvalidation operator? In that case, the model which is delivered by the whole validation would be produced again on the complete data set and hence also stored with Store (3).
Cheers,
Ingo
thank you for the fast reply. You are absolutely right, when i connect the model-output of the splitvalidation operator then it computes the model for the whole dataset (good to know ). But i think this is a little bit counterintuitive because in both cases (model-output of splitval. connected and not connected) the model applied on the test-data is the model learned on the train data. So when i connect the splitval. model-output is see a diffent model than the actual applied model. I dont know if iam the only one who think this is counterintuitive. Maybe as solution you could add another model-output to the spiltval for the actual applied model.
Thanks again
Micha
the reason why we didn't do this is, that the applied model was only trained on a subset of the data. So it is most probably that the model trained on all available training data will perform much better on new, unseen data, because it simply saw more of "the world". So you are strongly discouraged to use this model anyway.
If you want to have it anyway, you can make use of the modular conception of RapidMiner and use a Remember / Recall operator pair to tunnel the objects out of the subprocess. Here's a small example: Please note that we have to introduce another advanced RapidMiner technique: The macro handling. We have used the predefined macro a, accessed by %{a}, that gives the apply count of the operator. So we are remembering each application of the models that are generated in the learning subprocess of the Split validation. After the Split validation operator has been executed (take a look at the execution order to be sure (Menu Process / Operator Execution Order / Show...)), we can recall the remembered objects with their name. Note that we have replaced the macro here with the constant 2, since the complete model will be trained in the second run. You will see this when reaching the breakpoint I set in the above process.
Greetings,
Sebastian
thank you for showing me the remember/recall functionality - works fine. The reasoning why you provide the modell trained on the whole DS instead of the actual applied model on the test-DS is clear. But i still think its counterintuitve. In our special usecase the 2 modells (whole-DS and train-DS) lead to complete different results (this was due to useing very thin staffed data, which is of course a problem itself) and i didn't knew that the operator applies a different model than the one wich is plugged at the output. So it i couldn't explain the result with the given modell - wich confused me . Maybe you should provide, as mentioned earlier, two outputs for both models (they are computed anyway).
Greetings
Micha
I'm a RapidMiner newbie. I like the program very much, it's really amazing!
I just struggled with the same problem as Micha for several hours. Finally, I've found this thread which made it clear to me what is the model returned from split validation - the training part is run for second time with the whole data set. I understand the reasing for recaltulation the model but I find it counterintuitive as well.
I'd suggest either add original model used for training to the output of split validation as suggested before or at least add one sentence describing the behaviour to the documentation. It could save some time to another newbie...
Otherwise RapidMiner rocks! :-)
Kuba
Let me make sure I understand this. The operator called 'Split Validation' splits the data, trains a model on a subset of the data on the left side, then applies this "same" model trained on a subset of data to classify the unseen data on the right? Now the confusion comes in when you use the "model output" of the 'Split Validation' operator which will produce a more general model based on all training data. I guess this makes sense from the perspective that we want to estimate model generality from our training data but use all training data to train the best model model which we will actually deploy.
Thanks for the clarification, ;D
-Gagi
yes a short sentence in the operator documentation could help a lot. Unfortunately currently it's quite difficult to just add such a sentence, but a solution is near:
We are going to set up a Wiki containing all the operator documentation. Then you could just drop a sentence there if you feel that it is needed and each RapidMiner user can see it in the help window if he is online.
Greetings,
Sebastian