Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

[SOLVED] discretize by entropy evaluation

makakmakak Member Posts: 13 Contributor II
edited November 2018 in Help

I would like to use discretize by entropy operator with naive bayes classifier. As far as I understand discretize by entropy depends on class value and I it would not be correct to first discretize all dataset and then perform cross validation. I would like to set up experiment where in every test fold of cross-validation I discretize data by entropy and in test fold the classifier is evaluated on on test set discretize by bin intervals from train set fold. Is this possible. I am not sure If I was clear, simply I wish to classified new data using classifier build on discretized data, how I should apply the same discretization intervals on new data?
Any help, comment would be very appreciated.
Thank you.



  • frasfras Member Posts: 93 Contributor II
    Discretize operators provide an additional port with a preprocessing model. This can be used
    in the X-Validation to ensure that the same preprocessing model from train is used with the test set:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.003">
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="6.0.003" expanded="true" height="60" name="Sonar" width="90" x="112" y="120">
            <parameter key="repository_entry" value="//Samples/data/Sonar"/>
          <operator activated="true" class="x_validation" compatibility="6.0.003" expanded="true" height="112" name="Validation" width="90" x="246" y="120">
            <process expanded="true">
              <operator activated="true" class="discretize_by_entropy" compatibility="6.0.003" expanded="true" height="94" name="Discretize" width="90" x="45" y="30">
                <parameter key="attribute" value="Temperature"/>
              <operator activated="true" class="naive_bayes" compatibility="6.0.003" expanded="true" height="76" name="Naive Bayes" width="90" x="112" y="165"/>
              <operator activated="true" class="group_models" compatibility="6.0.003" expanded="true" height="94" name="Group Models" width="90" x="246" y="30"/>
              <connect from_port="training" to_op="Discretize" to_port="example set input"/>
              <connect from_op="Discretize" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
              <connect from_op="Discretize" from_port="preprocessing model" to_op="Group Models" to_port="models in 1"/>
              <connect from_op="Naive Bayes" from_port="model" to_op="Group Models" to_port="models in 2"/>
              <connect from_op="Group Models" from_port="model out" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="6.0.003" expanded="true" height="76" name="Apply Model" width="90" x="112" y="30">
                <list key="application_parameters"/>
              <operator activated="true" class="performance_binominal_classification" compatibility="6.0.003" expanded="true" height="76" name="Performance" width="90" x="246" y="30"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
          <connect from_op="Sonar" from_port="output" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="90"/>
          <portSpacing port="sink_result 2" spacing="54"/>

  • makakmakak Member Posts: 13 Contributor II
    Thank you very much. You saved my * ,exactly what I was looking for.
  • halimprabowohalimprabowo Member Posts: 1 Learner II

    So it means the model or the bin created from the discretized process is applied to the new data right?

    Not applying a new "discretize by entropy" preprocessing to the new data, I'm sorry if this is confusing, I only want to make sure.

    Thank You

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,531 RM Data Scientist


    yes. You should apply the preprocessing model to the new data set.




    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.