Performance result: Training vs Test

HeikoeWin786HeikoeWin786 Member Posts: 64 Contributor II
Dear all,

I am new to rapidMiner and I wanted to perform NBC on airline dataset. I have a airline dataset with labelled data of sentiment (pos, neg, and netural).  I had divided the dataset 75/25 data split and perform the text processing (i.e. nominal to text, data to document, preprocess document with tokenization, stopwords). However, when the result out in word from preprocess document operator, I found the neg,pos and netural data columns have all zero value. Then, after I implemented the NBC, I receive accuracy of 87% for training but 0.00% accuracy for the test dataset. 

Can you please kindly help me to understand what I am missing here?

Thanks a lot in advance!

Best Answer


  • Options
    HeikoeWin786HeikoeWin786 Member Posts: 64 Contributor II

    Thanks a lot.
    I revisit the whole process, I split the data and for test data, I used the word output from text pre-processing from train dataset. Then I received the result. But, the result for train data and test data is the same. Is this normal?
    E.g. Train data --> Text preprocessing (store the word output) --> NBC
    Test data --> Text preprocessing (input the word output from above step) --> NBC
    The accuracy is 65% for both process, that is ideal?

    thanks and regards,
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    It is impossible to say without seeing the data.  It is certainly possible.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    aengleraengler Member Posts: 6 Contributor II
    edited July 2020
    Hi Heiko,
    seems for me as if the label gets somehow lost. Can you check if the word-list still provides a label attribute (it is marked in some green column) in the word-training data set? You can also check for the roles. Some operators skip special attributes like the label and it gets lost
    Also if you split 25-75 between test and training it would be interesting to see this in the same process. If you do it always like this in the same process you prevent yourself to process the trainings-data somehow differently then the test-data.
  • Options
    HeikoeWin786HeikoeWin786 Member Posts: 64 Contributor II
    Thanks a lot for the explanation. Yes, I had followed the same process. And, every time my result for test and training (SVM or NBC) returns almost the same result.
    I was a bit unsure if that is ideal thats why.

    thanks much,
Sign In or Register to comment.