The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Discretizing by frequency with highly modal data
tennenrishin
Member Posts: 177 Contributor II
Discretize by Frequency operator says:
The selected number of ranges is not applicable for the attribute x, because it has too many equal values.
If there are too many same values, a bin might grow over specified size, because values can't be distinguished. If it grows more than twice it's size some bins would vanish completely, causing this error.
The parent process is run on a wide variety of different input example sets. Is there any simple way to make RM solve this problem by allowing individual bins to grow indefinitely, and basing the frequency discretization on the remainder of the data?
For example, the data {1,1,1,1,1,2,3,8,9} with bin count 3, should be binned as follows:
1,1,1,1,1
2,3
8,9
EDIT: What I'm basically saying is:
The Discretize by Frequency operator can fail fatally just because of a coincidence in the input data. Should this exception not be handled internally by the operator, perhaps with a warning message stating that some bins might be bigger than expected?
The selected number of ranges is not applicable for the attribute x, because it has too many equal values.
If there are too many same values, a bin might grow over specified size, because values can't be distinguished. If it grows more than twice it's size some bins would vanish completely, causing this error.
The parent process is run on a wide variety of different input example sets. Is there any simple way to make RM solve this problem by allowing individual bins to grow indefinitely, and basing the frequency discretization on the remainder of the data?
For example, the data {1,1,1,1,1,2,3,8,9} with bin count 3, should be binned as follows:
1,1,1,1,1
2,3
8,9
EDIT: What I'm basically saying is:
The Discretize by Frequency operator can fail fatally just because of a coincidence in the input data. Should this exception not be handled internally by the operator, perhaps with a warning message stating that some bins might be bigger than expected?
0
Answers
I created an internal bug report for that.
Best, Marius
Here is a minimalistic demo of the problem.