train set and test set proportion...

yogafire · April 2010

hello

i would like to ask you about your experience in partioning a data set into train-set and test-set (all labelled) for modelling and validation phase of data mining

my condition is described as the following.

I have about 32000 record of data which has 20 dimension and 1 binominal class. what I want to ask here, according to your experience, how is the proportion to divide dataset into train and test set. what proportion option should I choose between this 2 options?

1. more data for training, the rest is for testing (e.g. 80% dataset for training, 20% dataset for testing)
2. more data for testing, the rest is for training (e.g. 20% dataset for training, 80% dataset for training)

after finishing this option there's another problem, that is sampling. what type of sampling I should choose? is RM5 have some procedure to solve that issue?

I really appreciate your reply.
thank you.

regard,

Dimas Yogatama

IngoRM · April 2010

Hello Dimas,

well, actually I would propose to not use a single train-set / testing-set scenario but to use a cross validation instead. Why? For two reasons: a k-fold cross validation uses each example (k-1) times for training and each example exactly one time for testing. So there is less selection bias for sampling an overly easy or overly bad test set. The second reason is that due to the average you will also get a standard deviation which gives you some feeling about the robustness of your performance estimation. And the average + standard deviation would also allow for significance tests in order to decide if one modelling was really superior to another. By the way, there are even a lot of people who suggests to use even a 10x10-fold cross validation (possible in RapidMiner with a cross validation nested into a Loop and Average operator) but since this makes 100 learning runs necessary this is often no option and from my experience I would say that a 10-fold cross validation is often a good option (unless you want to predict time series in which case a batched validation would be better).

Runtime is from my point of view the only real argument why one should prefer a single split over a cross validation. I would use at least half of the data for training, a value of 2/3 or 70% is often recommended. However, you can still get bad luck with your 30% sample for testing and you have no chance to identify this without repetition (which brings me back again to my pro-cross-validation-speech ;D ).

Cheers,
Ingo

yogafire · April 2010

maybe I would imply just like this :

1. 70 % of dataset with feature selection and 10-fiold cross validation to acknowledge performance of training set.
2. 30 % of dataset will be testing set.

after that i will do prediction for my forecast data.

but I have some problem in defining algorithm because my data is highly heterogeneous, it does contain binominal label but the variables consist of numeric, binominal and polynominal data type.

do you have any idea what algorithm I should choose? ??? and do bagging (or maybe other meta modeling technique) do a lot difference in this binominal classification case?

thank you very much.

regards,

dimas yogatama

IngoRM · April 2010

Hi,

actually I would also recommend to use an outer cross validation also around your features selection cross validation subprocess. Why not and you would have the same advantages as known from the inner cross validation...

but I have some problem in defining algorithm because my data is highly heterogeneous, it does contain binominal label but the variables consist of numeric, binominal and polynominal data type.

This is a typical scenario and not problem in general. There are some modelling schemes directly supporting such data sets (as for example decision trees). And there are others which cannot work on such data but this can be augmented by using appropriate data preprocessing processes. The quick fixes of RapidMiner 5 should help you in most cases if the modelling scheme does not fit the given data. Just use the repository as data source and allow for the meta data propagation and this should give you more insights.

and do bagging (or maybe other meta modeling technique) do a lot difference in this binominal classification case?

If bagging helps of not is not a question of the type of classification problem. In general, bagging makes the model a bit more robust but often does not really improve prediction accuracy. Boosting might help here more but from my experience a good data representation and / or additionally extracted or generated features often help much more than applying meta learning schemes.

Cheers,
Ingo

yogafire · April 2010

can boosting handle all kind of attribute and all kind of label?

then I would ask about, maybe it's out of this topic. what meta modeling technique should I use to improve accuracy in the case of classification with numerical (continuous data ) label?

IngoRM · April 2010

Hi,

can boosting handle all kind of attribute and all kind of label?

as far as I remember this depends on the inner learning scheme which should be boosted. And I am also not completely sure if for the label anything else than nominal values are allowed but I would doubt it. Just try it out in RapidMiner. It works, if it works

Cheers,
Ingo

yogafire · April 2010

thank you very much.

then how much accuracy is considered reasonable for classification? is 81% good enough?

IngoRM · April 2010

Hello,

if 81% is good enough cannot be answered in general and depends on many factors. Just a few examples: if you have two classes and one of them is the class of 81% of your data it probably isn't since you could just guess instead of using a data mining model. If you are far better than just "guessing", 81% still can be bad if the life of somebody depends on this prediction. On the other hand, if you are able to predict the outcome of a flipped coin with 81% I would start betting and become rich in a short time and 81% are certainly good enough (as 51% would have been). You see: no general answer

Cheers,
Ingo

yogafire · April 2010

and now, I'm having some problems now with my dataset...

as you know, My dataset consist of binominal label (yes and no)

I use all kind of tree and all type of validation and attribute selection as you suggest, but only this appear in my dataset,

it means that the model can't predict the other class (yes). the records with class labeled "yes" is only about 20% from the entire dataset (about 5200 out of 27000 overall). so the accuracy seems good (about 80%), but it can be "harakiri" if I apply this model.

what should I do? I desperately need for help.... :-[

Thank you very much for your reply...

Regards,

Dimas Yogatama

IngoRM · April 2010

Sorry, but this clearly leaves the topic and also the purpose of the general "data mining" board here. You might post this question in another RapidMiner board, for example in the "Data Mining Processes" subboard or in "Problems & Support".

Cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

train set and test set proportion...

Answers