The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

figure out if there is there is any problem with dataset.

njasajnjasaj Member Posts: 18 Contributor II
edited November 2018 in Help
Hi,
I am trying to classify a data set with three label and 7 attribute with libsvm operator. my data set is imbalanced. class distribution is 882,237, 273. When ever i try to classify this data set the computed model can not discriminate between classes and classify all the points (except 30 of them) into the biggest one.I tried under sampling with sampling 200 point of every class with simple sampling operator implemented in rapidminer but the result is not acceptable.
Is there any problem with my data set? I repeated this procedure for iris data set and it worked.
Thanks.

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Then probably your data is not separable with the learning method (svm) or the parameters you are using. Did you try to optimize the parameters of the SVM? You should try different kernels (linear/dot and radial/rbf are good choices to start with), and optimize the C parameter. When using the rbf kernel, also parameter gamma needs to be optimized.
    Try an Optimize Parameters or Loop Parameters with a sensible Log operator inside to get an overview of the impact of the parameters. Good starting values for both C and gamma are 10e-4 to 10e+4 on a logarithmic scale.

    Best, Marius
  • Options
    njasajnjasaj Member Posts: 18 Contributor II
    Thank you Marius. the poor results was gained by parameter optimization.I have tested evolutionary parameter optimization and tried to tune C and gamma of rbf kernel. I will try poly nominal and sigmoid kernels too.Would you mind please describe or put xml code for how using cost sensitive meta learning with parameter optimization in rapidminer for imbalanced data set? I guess that simply lowering the number of samples of the larger class by random is not proper task and must use more advance sampling technic.
    Thanks a lot.
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    instead of the evolutionary search I would go for a systematic grid search (Optimize Parameters (Grid)), since the evolutionary one has some disadvantages (e.g. very long execution times if by chance a bad parameter combination is chosen in one of the generations). By logging the values, you then get a very nice overview of the impact of different parameters.

    For the balancing, I personally would optimize the balancing in a separate step/process with the same technique, i.e. trying different balancing values with Loop Parameters (Grid) and logging the values, than fix the best value (which will in most cases near a balanced data set) and use in the the actual SVM parameter optimization.

    Best, Marius
  • Options
    njasajnjasaj Member Posts: 18 Contributor II
    Thank you for your answers and support.
Sign In or Register to comment.