Discretize by Entropy not working properly?

miguelbironmiguelbiron Member Posts: 1 Contributor I
edited November 2018 in Help
Hello,

I'm doing some experimental tests on the capabilities of this software, which is apparently really great for datamining tasks, and I'm encountering a problem when using the "Discretize by Entropy" operator. Using the Iris Database, I apply the latter function and get that the two most powerful features, namely "Petal Width" and "Petal Length" (called "a3" and "a4" in the sample database that comes with Rapidminer), get erased by this operator as "useless atributes". This is nonsense (or I'm really missing something), since those attributes get selected by any method of attribute selection, or like i did, using "Decision Tree" operator, they are the only ones used on the resulting tree.

I looked all over the forum and googled, but couldn't find the answer. Interestingly, Weka uses a similar procedure called "Discretize", and it works great, but sadly it doesn't come with the implementation Rapidminer has.

Thanks, and sorry for the poor english...

P.S: this is the XML code of the procedure i'm experimenting with
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="386" width="614">
      <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.2.008" expanded="true" height="94" name="Multiply" width="90" x="179" y="120"/>
      <operator activated="true" class="decision_tree" compatibility="5.2.008" expanded="true" height="76" name="Decision Tree" width="90" x="375" y="155"/>
      <operator activated="true" class="discretize_by_entropy" compatibility="5.2.008" expanded="true" height="94" name="Discretize" width="90" x="380" y="30">
        <parameter key="attributes" value="lapiz|peo|"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Discretize" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Decision Tree" to_port="training set"/>
      <connect from_op="Decision Tree" from_port="model" to_port="result 3"/>
      <connect from_op="Discretize" from_port="example set output" to_port="result 2"/>
      <connect from_op="Discretize" from_port="original" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • SebastianBerlinSebastianBerlin Member Posts: 1 Contributor I
    Hello,

    I just started to use RapidMiner after several years of working with Weka.  I am experiencing the same problem with the entropy-based disretization.

    Since the entropy-based descretization of Irani and Fayyad is extremly helpful for learners such as NB or J48, it would be nice if this problem would be fixed or, at least, the Weka discretization would be included.

    Cheers,
    Sebastian
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    you are right, something seems to be wrong. I created an internal issue for this operator. Thanks for reporting!

    Best regards,
    Marius
  • Mario_HofmannMario_Hofmann Member Posts: 9 Contributor II
    Today I compared the results of this operator to a similar operator in spss. Most attributes where handled very similar, but rapidminer (or spss) seemed to be 1 off very often. There was no real structure in it, but there might be differences in the rounding of values or in the granularity they are handled.

    Regards,

    Mario
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    The problem seems to occur on very pure split points (e.g. in Iris on a3). In that case the calculation would include the logarithm of 0, which is undefined and needs some special handling.
  • fischerfischer Member Posts: 439 Maven
    The problem has been fixed and will be part of the next release.
Sign In or Register to comment.