The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
sampling / learning curve
Dear all,
Sampling the training set can have a major impact on classification accuracy.
Especially when the data is skewed.
Lets say you have a dataset of 100k negative examples and 1k positive examples.
And you wish experiment with different pos/neg ratios in the training set.
To do this you need:
example filter: select all negative
example filter: absolute amount
example filter: select all positive
example filter: absolute amount
merge
when there are more then two classes, it gets even more cumbersome.
Would be cool if this could be combined into a single operator.
This might also be faster and more memory efficient.
Best regards,
Wessel
Sampling the training set can have a major impact on classification accuracy.
Especially when the data is skewed.
Lets say you have a dataset of 100k negative examples and 1k positive examples.
And you wish experiment with different pos/neg ratios in the training set.
To do this you need:
example filter: select all negative
example filter: absolute amount
example filter: select all positive
example filter: absolute amount
merge
when there are more then two classes, it gets even more cumbersome.
Would be cool if this could be combined into a single operator.
This might also be faster and more memory efficient.
Best regards,
Wessel
0
Answers
just to get it right: What would be the parameters of your operator? If I get it right, it would be
- a ratio for each class
- an absolute number of examples you want as output?
Cheers,
Simon
Input: a dataset
Parameters fields:
label = class_A [absolute amount] or [relative amount] and [sampling type]
label = class_B [absolute amount] or [relative amount] and [sampling type]
...
label = class_Z [absolute amount] or [relative amount] and [sampling type]
Defaults: absolute amount = '' relative amount = 1 sampling type = linear
Examples:
Input, dataset with 2000 examples of class A
class_A [1000] or [] and [linear] Returns a dataset containing the first 1000 instances of class A
class_A [1000] or [] and [random] Returns a dataset containing 1000 instances of class A randomly sampled
class_A [] or [0.5] and [linear] Returns a dataset containing the first 1000 instances of class A
class_A [] or [0.5] and [random] Returns a dataset containing 1000 instances of class A randomly sampled
class_A [3000] or [] and [random] Returns an error?
class_A [] or [1.4] and [random] Returns an error?