The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

Overfitting - Sentiment Analysis

SilkeKochSilkeKoch Member Posts: 5 Newbie
edited May 2020 in Help
Hi together,
I am not very experienced. I did use the sentiment template and created a model with about 83 % accuracy. But the model does not predict the sentiments of my unseen data well. The confidence average is about 50 to 60% only. What can I do to get a model which generalizes better? And is there an opportunity to compare my labeled data with the unlabeled data to see if the bad confidence is really so so bad.
 my training data is balanced about 1000 positive / 1000 negative. And I applied the model to about 100 unlabeled data.

Thank you very much for your help Silke


  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    It's hard to answer a general question like this without knowing more about the details of the analysis.
    For example, How did you construct your model?  Did you use cross-validation?  What ML algorithm are you using?
    Are the new 100 cases that you are validating similar to the original 2000 that you built the model on?
    Did you ensure they were all having their text data processed in the same way that you did on your original model development sample?  You should be using a pre-built wordlist when you use this approach, and this is something that many users forget to do.  There are some good tutorials and lessons on text mining in the RapidMiner Academy that you might want to check out. 
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    SilkeKochSilkeKoch Member Posts: 5 Newbie
    I did use crossvalidation with SVM (dot.kernel). the 100 cases are similar to the original as it is from the same customer review dataset. I used exactly the same preprocessing.  I tried pruning but it doesn' t help. Should I do pruning? 
  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    HI @SilkeKoch,

    Maybe SVM is not the most relevant model for your data.
    Have you tried to submit your dataset to AutoModel to determine how some other models perform ?



  • Options
    SilkeKochSilkeKoch Member Posts: 5 Newbie
    Thank you, I  did not try that yet, but I will.
Sign In or Register to comment.