First of all: I am a total beginner in data science. For my university project, I need to create a process in rapidminer which predicts a customer satisfaction based on a survey. The dataset can be obtained from kaggle by searching for "Airline Passenger Satisfaction" by TJ Klein (cannot post links yet).

I get a train and a test set. I build my process based on the train set. so currently my process looks like this:

The thing that now confuses me is, where do I use my test set? I don't really now where and I should use it - if I should use it at all. The test set is not unlabeled btw. As it says on kaggle, it was just splitted from the train set and represents 20% of all data.


    ceaperez Member
    Hi @Bella0812,

    You are using the Cross-validation operator in your model. 
    This operator performs the training and validation process in you. Basically, the operator divides the data set into k subsets of equal size, then the operator retains one subset and trains the model on the other k-1 subsets. the process is repeated k times, with a different test subset selected each time. 


    Bella0812 Member
    Thanks for your answer @ceaperez !

    I know how the cross validator works, and thats why I am confused. Do I still need to use the test set which I got in a seperate file, or can i ignore it as the cross validator already did the testing?

    ceaperez Member
    Hi @Bella0812,
    The cross-validation operator performed the tests as mentioned above. In this case, you can use other data sets for validation purposes by using the Apply Model operator. 

