Should One-Class SVM be trained on Positive or Negative examples?

mohammadrezamohammadreza Member Posts: 23 Contributor I
edited November 2018 in Help
Hi all,

I have a data set containing roughly 1000 examples with 900 negative examples and 100 positive examples. I want to apply one-class SVM and train the model using just one class label. Is there any idea which help me find out whether I should train the model on negative examples or on the positive ones?

Cheers,

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 563   Unicorn
    Personally it seems to be a two class problem that you have so I'd choose a different model learner. 

    However, as you are wanting to use one class I would advise training for the class that has the highest ROI for your problem. 
    So, if getting matches to the positive class is worth more to you than negative (for example in a direct marketing problem) then train on the positive class. 
    If getting matches to the negative class is worth more to you (for example on insurance fraud) then train on the negative class. 

    Given that you have far more positive than negative in your data though, I would imagine that this is where you'll find the most success as you are not looking at the differences between the positive and the negative classes so training on the negative class might simply return you results that should be positive. 

    You could always learn a one-class SVM on both and compare/combine the models.    ;)
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,074  RM Data Scientist
    Hi,

    could you eloberate on the reason why you do not use a usual SVM or any other supervised learner? Going unsupervised/semisupervised is always tricky, because it is hard to define performance values. You def. need them to tune the paramaters of your one-class SVM.

    Cheers,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mohammadrezamohammadreza Member Posts: 23 Contributor I
    Especial thanks to Edward for his illuminating explanation.
    Martin, the reason that I am considering the semi supervised approach such as one-class is that my actual negative examples are impossible to gather since they are too many negative examples. I gathered 1000 so far, but if I want I can gather even 100000000 negative examples. On the other hand the number of positive samples are so rare in compare to negatives. Please let me know if you think think this is a good justification for using semi-supervised methods?

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,074  RM Data Scientist
    Hi,

    I personally would rather go use 1000 examples per class and look if it works that way. You simply loose to much predictive power if you go unsupervised.

    Cheers,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mohammadrezamohammadreza Member Posts: 23 Contributor I
    Thank you all. Nice discussion.
Sign In or Register to comment.