Options

# "Bug in MinimalEntropyParitioning?"

Legacy User
Member Posts:

**0**Newbie
Hello everybody,

I get strange results when I apply MinimumEntropyPartitioning on some datasets and wonder whether this is due to a bug in the implementation.

Let me illustrate the problem: I have a dataset with one attribute ("X") and one label with two possible values.

There are 6 possible values for X, 1 to 6. In total, I have 1116 rows, with the following target label distributions:

X-value #negatives #positives #rows

1.0 124 62 186

2.0 124 62 186

3.0 0 186 186

4.0 0 186 186

5.0 124 62 186

6.0 124 62 186

Now of course I would expect a discretization into [-infty,2], ]2,4], ]4,infty] with 372. Instead, I get:

range1 [-∞ - 2] (372), range2 [2 - 5] (558), range3 [5 - ∞] (186)

It seems like there is a bug in the operator that does not correctly distinguish open and closed interval limits.

Does anybody know of a solution or a workaround?

Best,

Henrik

I get strange results when I apply MinimumEntropyPartitioning on some datasets and wonder whether this is due to a bug in the implementation.

Let me illustrate the problem: I have a dataset with one attribute ("X") and one label with two possible values.

There are 6 possible values for X, 1 to 6. In total, I have 1116 rows, with the following target label distributions:

X-value #negatives #positives #rows

1.0 124 62 186

2.0 124 62 186

3.0 0 186 186

4.0 0 186 186

5.0 124 62 186

6.0 124 62 186

Now of course I would expect a discretization into [-infty,2], ]2,4], ]4,infty] with 372. Instead, I get:

range1 [-∞ - 2] (372), range2 [2 - 5] (558), range3 [5 - ∞] (186)

It seems like there is a bug in the operator that does not correctly distinguish open and closed interval limits.

Does anybody know of a solution or a workaround?

Best,

Henrik

Tagged:

0

## Answers

2,531Unicornthis seems to be a problem indeed. Perhabs you could add a tiny litte noise on your values. Resolving the not uniquenes causing your problem.

But to solve it in general I will take a look at the code.

Greetings,

Sebastian

0Newbiethanks for the reply, I also thought that the problem could be diminished if I had more continuous values. But of course if would be best if you could fix the problem in general.

Best,

Henrik

0Newbiein the meantime I found the bug and fixed it. The bug is in the function

private Double getMinEntropySplitpoint(LinkedList<double[]> truncatedExamples, Attribute label) {

in the class MinimalEntropyDiscretization. It does not consider the case where a split results in 0 examples of one class. Here is the fix:

// Calculate entropies.

double entropy1 = 0.0d;

for (int i = 0; i < label.getMapping().size(); i++) {

entropy1 -= frequencies1

* MathFunctions.ld(frequencies1);}

double entropy2 = 0.0d;

for (int i = 0; i < label.getMapping().size(); i++) {

entropy2 -= frequencies2

* MathFunctions.ld(frequencies2);

}

Best,

Henrik

1,751RM Founderthanks for sending this in! We will check and integrate your suggestion as soon as possible.

Cheers,

Ingo