Options

"Bug in MinimalEntropyParitioning?"

Legacy UserLegacy User Member Posts: 0 Newbie
edited May 2019 in Help
Hello everybody,

I get strange results when I apply MinimumEntropyPartitioning on some datasets and wonder whether this is due to a bug in the implementation.

Let me illustrate the problem: I have a dataset with one attribute ("X") and one label with two possible values.
There are 6 possible values for X, 1 to 6. In total, I have 1116 rows, with the following target label distributions:

X-value    #negatives #positives #rows
1.0        124        62         186
2.0        124        62         186
3.0          0        186        186
4.0          0        186        186
5.0        124        62         186
6.0        124        62         186

Now of course I would expect a discretization into [-infty,2], ]2,4], ]4,infty] with 372. Instead, I get:

range1 [-∞ - 2] (372), range2 [2 - 5] (558), range3 [5 - ∞] (186)

It seems like there is a bug in the operator that does not correctly distinguish open and closed interval limits.
Does anybody know of a solution or a workaround?

Best,

Henrik
Tagged:

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Henrik,
    this seems to be a problem indeed. Perhabs you could add a tiny litte noise on your values. Resolving the not uniquenes causing your problem.

    But to solve it in general I will take a look at the code.

    Greetings,
      Sebastian
  • Options
    Legacy UserLegacy User Member Posts: 0 Newbie
    Hi Sebastian,

    thanks for the reply, I also thought that the problem could be diminished if I had more continuous values. But of course if would be best if you could fix the problem in general.

    Best,

    Henrik
  • Options
    Legacy UserLegacy User Member Posts: 0 Newbie
    Hi,

    in the meantime I found the bug and fixed it. The bug is in the function
    private Double getMinEntropySplitpoint(LinkedList<double[]> truncatedExamples, Attribute label) {

    in the class MinimalEntropyDiscretization. It does not consider the case where a split results in 0 examples of one class. Here is the fix:


    // Calculate entropies.
    double entropy1 = 0.0d;
    for (int i = 0; i < label.getMapping().size(); i++) {
    entropy1 -= frequencies1 * MathFunctions.ld(frequencies1);
    }
    double entropy2 = 0.0d;
    for (int i = 0; i < label.getMapping().size(); i++) {
    entropy2 -= frequencies2 * MathFunctions.ld(frequencies2);
    }


    Best,

    Henrik
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Henrik,

    thanks for sending this in! We will check and integrate your suggestion as soon as possible.

    Cheers,
    Ingo
Sign In or Register to comment.