The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

Options

# figure out if there is there is any problem with dataset.

Hi,

I am trying to classify a data set with three label and 7 attribute with libsvm operator. my data set is imbalanced. class distribution is 882,237, 273. When ever i try to classify this data set the computed model can not discriminate between classes and classify all the points (except 30 of them) into the biggest one.I tried under sampling with sampling 200 point of every class with simple sampling operator implemented in rapidminer but the result is not acceptable.

Is there any problem with my data set? I repeated this procedure for iris data set and it worked.

Thanks.

I am trying to classify a data set with three label and 7 attribute with libsvm operator. my data set is imbalanced. class distribution is 882,237, 273. When ever i try to classify this data set the computed model can not discriminate between classes and classify all the points (except 30 of them) into the biggest one.I tried under sampling with sampling 200 point of every class with simple sampling operator implemented in rapidminer but the result is not acceptable.

Is there any problem with my data set? I repeated this procedure for iris data set and it worked.

Thanks.

0

## Answers

1,869UnicornTry an Optimize Parameters or Loop Parameters with a sensible Log operator inside to get an overview of the impact of the parameters. Good starting values for both C and gamma are 10e-4 to 10e+4 on a logarithmic scale.

Best, Marius

18Contributor IIThanks a lot.

1,869Unicorninstead of the evolutionary search I would go for a systematic grid search (Optimize Parameters (Grid)), since the evolutionary one has some disadvantages (e.g. very long execution times if by chance a bad parameter combination is chosen in one of the generations). By logging the values, you then get a very nice overview of the impact of different parameters.

For the balancing, I personally would optimize the balancing in a separate step/process with the same technique, i.e. trying different balancing values with Loop Parameters (Grid) and logging the values, than fix the best value (which will in most cases near a balanced data set) and use in the the actual SVM parameter optimization.

Best, Marius

18Contributor II