Short question on FrequencyDiscretization

calvinuscalvinus Member Posts: 5 Contributor II
edited November 2018 in Help
Hi there,

I have a quick question. Take for example the following output of FrequencyDiscretization:
q58_B -Infinity <= range3 [4.500 - 0] <= 0.0 <= range1 [-8 - 2.500] <= 2.5 <= range2 [2.500 - 4.500] <= 4.5 <= range5 [0 - 8] <= Infinity  
Despite that the ranges are not sorted (which is a bit confusing), range3 is odd to me. Why does it go from 4,5 to 0? And why is it in front and not in line?
And where is range4? Why does range5 start again at 0? So the ranges are overlapping?
Values in field q58_B only go from 1-5 and some missing values.
Could you please give me some hints on how to use this output?

Thanks in advance,
best regards
Jörg

Answers

  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi Jörg,

    as far as I can see, the operator seems to contain a bug, we will have to check that. Maybe next week one of our developers has the time to look into that problem. Thanks for pointing out the problem.

    Regards,
    Tobias
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Jörg,
    this is not realy a bug. Its caused because your data contains too many same values. If there are too many same values, the containing bin grows over its targeted size, because they can't be distinguished. If there are more than twice the bin size of same values, the bin steals the example from the following bin(s). If this happens, they don't have any example determining their limits.
    The developer version now throws an error if that happens, because its probably not the intended behavior.
    There are be two possibilities: Reduce the number of bins or add some noise, which would make the values distinguishable.

    Please keep in mind, that missing values are not treated at all.

    Greetings,
      Sebastian
Sign In or Register to comment.