Options

# "Bug in MinimalEntropyParitioning?"

Member Posts: 0 Newbie
edited May 2019 in Help
Hello everybody,

I get strange results when I apply MinimumEntropyPartitioning on some datasets and wonder whether this is due to a bug in the implementation.

Let me illustrate the problem: I have a dataset with one attribute ("X") and one label with two possible values.
There are 6 possible values for X, 1 to 6. In total, I have 1116 rows, with the following target label distributions:

X-value    #negatives #positives #rows
1.0        124        62         186
2.0        124        62         186
3.0          0        186        186
4.0          0        186        186
5.0        124        62         186
6.0        124        62         186

Now of course I would expect a discretization into [-infty,2], ]2,4], ]4,infty] with 372. Instead, I get:

range1 [-∞ - 2] (372), range2 [2 - 5] (558), range3 [5 - ∞] (186)

It seems like there is a bug in the operator that does not correctly distinguish open and closed interval limits.
Does anybody know of a solution or a workaround?

Best,

Henrik
Tagged:

• Options
RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
Hi Henrik,
this seems to be a problem indeed. Perhabs you could add a tiny litte noise on your values. Resolving the not uniquenes causing your problem.

But to solve it in general I will take a look at the code.

Greetings,
Sebastian
• Options
Member Posts: 0 Newbie
Hi Sebastian,

thanks for the reply, I also thought that the problem could be diminished if I had more continuous values. But of course if would be best if you could fix the problem in general.

Best,

Henrik
• Options
Member Posts: 0 Newbie
Hi,

in the meantime I found the bug and fixed it. The bug is in the function
private Double getMinEntropySplitpoint(LinkedList<double[]> truncatedExamples, Attribute label) {

in the class MinimalEntropyDiscretization. It does not consider the case where a split results in 0 examples of one class. Here is the fix:

// Calculate entropies.
double entropy1 = 0.0d;
for (int i = 0; i < label.getMapping().size(); i++) {
entropy1 -= frequencies1 * MathFunctions.ld(frequencies1);
}
double entropy2 = 0.0d;
for (int i = 0; i < label.getMapping().size(); i++) {
entropy2 -= frequencies2 * MathFunctions.ld(frequencies2);
}

Best,

Henrik
• Options
Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
Hi Henrik,

thanks for sending this in! We will check and integrate your suggestion as soon as possible.

Cheers,
Ingo