RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.


[SOLVED] discretize by entropy evaluation

makakmakak Member Posts: 13 Contributor I
edited November 2018 in Help

I would like to use discretize by entropy operator with naive bayes classifier. As far as I understand discretize by entropy depends on class value and I it would not be correct to first discretize all dataset and then perform cross validation. I would like to set up experiment where in every test fold of cross-validation I discretize data by entropy and in test fold the classifier is evaluated on on test set discretize by bin intervals from train set fold. Is this possible. I am not sure If I was clear, simply I wish to classified new data using classifier build on discretized data, how I should apply the same discretization intervals on new data?
Any help, comment would be very appreciated.
Thank you.



  • frasfras Member Posts: 93 Contributor II
    Discretize operators provide an additional port with a preprocessing model. This can be used
    in the X-Validation to ensure that the same preprocessing model from train is used with the test set:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.003">
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="6.0.003" expanded="true" height="60" name="Sonar" width="90" x="112" y="120">
            <parameter key="repository_entry" value="//Samples/data/Sonar"/>
          <operator activated="true" class="x_validation" compatibility="6.0.003" expanded="true" height="112" name="Validation" width="90" x="246" y="120">
            <process expanded="true">
              <operator activated="true" class="discretize_by_entropy" compatibility="6.0.003" expanded="true" height="94" name="Discretize" width="90" x="45" y="30">
                <parameter key="attribute" value="Temperature"/>
              <operator activated="true" class="naive_bayes" compatibility="6.0.003" expanded="true" height="76" name="Naive Bayes" width="90" x="112" y="165"/>
              <operator activated="true" class="group_models" compatibility="6.0.003" expanded="true" height="94" name="Group Models" width="90" x="246" y="30"/>
              <connect from_port="training" to_op="Discretize" to_port="example set input"/>
              <connect from_op="Discretize" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
              <connect from_op="Discretize" from_port="preprocessing model" to_op="Group Models" to_port="models in 1"/>
              <connect from_op="Naive Bayes" from_port="model" to_op="Group Models" to_port="models in 2"/>
              <connect from_op="Group Models" from_port="model out" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="6.0.003" expanded="true" height="76" name="Apply Model" width="90" x="112" y="30">
                <list key="application_parameters"/>
              <operator activated="true" class="performance_binominal_classification" compatibility="6.0.003" expanded="true" height="76" name="Performance" width="90" x="246" y="30"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
          <connect from_op="Sonar" from_port="output" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="90"/>
          <portSpacing port="sink_result 2" spacing="54"/>

  • makakmakak Member Posts: 13 Contributor I
    Thank you very much. You saved my * ,exactly what I was looking for.
  • halimprabowohalimprabowo Member Posts: 1 Contributor I

    So it means the model or the bin created from the discretized process is applied to the new data right?

    Not applying a new "discretize by entropy" preprocessing to the new data, I'm sorry if this is confusing, I only want to make sure.

    Thank You

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,475  RM Data Scientist


    yes. You should apply the preprocessing model to the new data set.




    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.