Options

classification models

bookitsabookitsa Member Posts: 15 Contributor I
edited May 2020 in Help

I have a data set with different variables about a disease and a variable with yes or now about the patient(if he has the disease or not). I have to create two classification models(decision tree and knn) and to make a diagram with a centralized chart of the performance of these individual methods. How I will do it? What is the process I have to follow? I saw the videos but I got confused as a beginner in rapidminer..

Tagged:

Answers

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @bookitsa,

    What performance metric do you want to calculate ?
    For a binominal problem (like yours), you can use the Compare ROCs operator.
    Here a sample process using this operator : 
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" breakpoints="after" class="retrieve" compatibility="9.2.000-SNAPSHOT" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="136">
            <parameter key="repository_entry" value="//Samples/data/Titanic"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Set Role" width="90" x="313" y="136">
            <parameter key="attribute_name" value="Survived"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="compare_rocs" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="Compare ROCs" width="90" x="581" y="136">
            <parameter key="number_of_folds" value="10"/>
            <parameter key="split_ratio" value="0.7"/>
            <parameter key="sampling_type" value="stratified sampling"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="use_example_weights" value="true"/>
            <parameter key="roc_bias" value="optimistic"/>
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="9.2.000-SNAPSHOT" expanded="true" height="82" name="k-NN" width="90" x="246" y="85">
                <parameter key="k" value="5"/>
                <parameter key="weighted_vote" value="true"/>
                <parameter key="measure_types" value="MixedMeasures"/>
                <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
                <parameter key="nominal_measure" value="NominalDistance"/>
                <parameter key="numerical_measure" value="EuclideanDistance"/>
                <parameter key="divergence" value="GeneralizedIDivergence"/>
                <parameter key="kernel_type" value="radial"/>
                <parameter key="kernel_gamma" value="1.0"/>
                <parameter key="kernel_sigma1" value="1.0"/>
                <parameter key="kernel_sigma2" value="0.0"/>
                <parameter key="kernel_sigma3" value="2.0"/>
                <parameter key="kernel_degree" value="3.0"/>
                <parameter key="kernel_shift" value="1.0"/>
                <parameter key="kernel_a" value="1.0"/>
                <parameter key="kernel_b" value="0.0"/>
              </operator>
              <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.2.000-SNAPSHOT" expanded="true" height="103" name="Decision Tree" width="90" x="246" y="238">
                <parameter key="criterion" value="gain_ratio"/>
                <parameter key="maximal_depth" value="10"/>
                <parameter key="apply_pruning" value="true"/>
                <parameter key="confidence" value="0.1"/>
                <parameter key="apply_prepruning" value="true"/>
                <parameter key="minimal_gain" value="0.01"/>
                <parameter key="minimal_leaf_size" value="2"/>
                <parameter key="minimal_size_for_split" value="4"/>
                <parameter key="number_of_prepruning_alternatives" value="3"/>
              </operator>
              <connect from_port="train 1" to_op="k-NN" to_port="training set"/>
              <connect from_port="train 2" to_op="Decision Tree" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="model 1"/>
              <connect from_op="Decision Tree" from_port="model" to_port="model 2"/>
              <portSpacing port="source_train 1" spacing="0"/>
              <portSpacing port="source_train 2" spacing="0"/>
              <portSpacing port="source_train 3" spacing="0"/>
              <portSpacing port="sink_model 1" spacing="0"/>
              <portSpacing port="sink_model 2" spacing="0"/>
              <portSpacing port="sink_model 3" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Titanic" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Compare ROCs" to_port="example set"/>
          <connect from_op="Compare ROCs" from_port="exampleSet" to_port="result 2"/>
          <connect from_op="Compare ROCs" from_port="rocComparison" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Hope it helps,

    Regards,

    Lionel

  • Options
    bookitsabookitsa Member Posts: 15 Contributor I
    I can not understand the code you write... i suppose that i have to enter the data and somehow to enter the knn and then the decision tree and to measure the accuracy of the results..but i dont know the process...in other words which are the operators i have to use and in what order.
  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hi @bookitsa

    @lionelderkrikor gave perfect example to compare two models. In case you are looking to get the performance indicators like AUC, Kappa etc using Cross-validation (recommended) you can check below code. Here you need to note the performances and check which worked well.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="9.1.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
            <parameter key="target_function" value="sum classification"/>
            <parameter key="number_examples" value="100"/>
            <parameter key="number_of_attributes" value="5"/>
            <parameter key="attributes_lower_bound" value="-10.0"/>
            <parameter key="attributes_upper_bound" value="10.0"/>
            <parameter key="gaussian_standard_deviation" value="10.0"/>
            <parameter key="largest_radius" value="10.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.1.000" expanded="true" height="103" name="Multiply" width="90" x="179" y="187"/>
          <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation (2)" width="90" x="514" y="136">
            <parameter key="split_on_batch_attribute" value="false"/>
            <parameter key="leave_one_out" value="false"/>
            <parameter key="number_of_folds" value="5"/>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="9.1.000" expanded="true" height="82" name="k-NN" width="90" x="112" y="85">
                <parameter key="k" value="5"/>
                <parameter key="weighted_vote" value="true"/>
                <parameter key="measure_types" value="MixedMeasures"/>
                <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
                <parameter key="nominal_measure" value="NominalDistance"/>
                <parameter key="numerical_measure" value="EuclideanDistance"/>
                <parameter key="divergence" value="GeneralizedIDivergence"/>
                <parameter key="kernel_type" value="radial"/>
                <parameter key="kernel_gamma" value="1.0"/>
                <parameter key="kernel_sigma1" value="1.0"/>
                <parameter key="kernel_sigma2" value="0.0"/>
                <parameter key="kernel_sigma3" value="2.0"/>
                <parameter key="kernel_degree" value="3.0"/>
                <parameter key="kernel_shift" value="1.0"/>
                <parameter key="kernel_a" value="1.0"/>
                <parameter key="kernel_b" value="0.0"/>
              </operator>
              <connect from_port="training set" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_binominal_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="136">
                <parameter key="main_criterion" value="first"/>
                <parameter key="accuracy" value="true"/>
                <parameter key="classification_error" value="false"/>
                <parameter key="kappa" value="true"/>
                <parameter key="AUC (optimistic)" value="false"/>
                <parameter key="AUC" value="true"/>
                <parameter key="AUC (pessimistic)" value="false"/>
                <parameter key="precision" value="false"/>
                <parameter key="recall" value="false"/>
                <parameter key="lift" value="false"/>
                <parameter key="fallout" value="false"/>
                <parameter key="f_measure" value="true"/>
                <parameter key="false_positive" value="false"/>
                <parameter key="false_negative" value="false"/>
                <parameter key="true_positive" value="false"/>
                <parameter key="true_negative" value="false"/>
                <parameter key="sensitivity" value="false"/>
                <parameter key="specificity" value="false"/>
                <parameter key="youden" value="false"/>
                <parameter key="positive_predictive_value" value="false"/>
                <parameter key="negative_predictive_value" value="false"/>
                <parameter key="psep" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
              <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation" width="90" x="313" y="34">
            <parameter key="split_on_batch_attribute" value="false"/>
            <parameter key="leave_one_out" value="false"/>
            <parameter key="number_of_folds" value="5"/>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.1.000" expanded="true" height="103" name="Decision Tree" width="90" x="112" y="34">
                <parameter key="criterion" value="gain_ratio"/>
                <parameter key="maximal_depth" value="10"/>
                <parameter key="apply_pruning" value="true"/>
                <parameter key="confidence" value="0.1"/>
                <parameter key="apply_prepruning" value="true"/>
                <parameter key="minimal_gain" value="0.01"/>
                <parameter key="minimal_leaf_size" value="2"/>
                <parameter key="minimal_size_for_split" value="4"/>
                <parameter key="number_of_prepruning_alternatives" value="3"/>
              </operator>
              <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
              <connect from_op="Decision Tree" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_binominal_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance" width="90" x="246" y="136">
                <parameter key="main_criterion" value="first"/>
                <parameter key="accuracy" value="true"/>
                <parameter key="classification_error" value="false"/>
                <parameter key="kappa" value="true"/>
                <parameter key="AUC (optimistic)" value="false"/>
                <parameter key="AUC" value="true"/>
                <parameter key="AUC (pessimistic)" value="false"/>
                <parameter key="precision" value="false"/>
                <parameter key="recall" value="false"/>
                <parameter key="lift" value="false"/>
                <parameter key="fallout" value="false"/>
                <parameter key="f_measure" value="true"/>
                <parameter key="false_positive" value="false"/>
                <parameter key="false_negative" value="false"/>
                <parameter key="true_positive" value="false"/>
                <parameter key="true_negative" value="false"/>
                <parameter key="sensitivity" value="false"/>
                <parameter key="specificity" value="false"/>
                <parameter key="youden" value="false"/>
                <parameter key="positive_predictive_value" value="false"/>
                <parameter key="negative_predictive_value" value="false"/>
                <parameter key="psep" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
              </operator>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Cross Validation" to_port="example set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Cross Validation (2)" to_port="example set"/>
          <connect from_op="Cross Validation (2)" from_port="performance 1" to_port="result 2"/>
          <connect from_op="Cross Validation" from_port="performance 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Thanks,
    Varun

    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi again @bookitsa,

    The "code" you see is the XML of the process. You have to import it in RapidMiner : 

    The step by step is : 

    In order to import such an XML description of your process, e.g. to use a process someone else has posted here in the forum, please follow the following steps:

    1. Create a new process and go the the XML panel (see above).
    2. Clear the view and copy the XML code you got into that panel.
    3. Then press the green checkmark icon on top of the panel.
    4. Switch back to the Process panel.
    Regards,

    Lionel
  • Options
    bookitsabookitsa Member Posts: 15 Contributor I
    edited January 2019
    Thank you @lionelderkrikorabout how to insert xml code! I didn't know this!

    @varunm1you gave me an operator "generate data". In your code about what data it works? And how i ll made it to work for my data? Where will I insert them?

    They dont tell us which method for the diagram to use because we are beginners and obviously they wanted to find one by searching...which diagram to prefer as a beginner?
  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited January 2019
    Hi @bookitsa

    As we don't have your dataset, I just randomly generated data, In your case you need to delete generate data and import your data into RapidMiner by specifying label column in column attribure and attach your data set to the multiply operator. Multiply operatir just creates a copy of dataset to use for two algorithms which are inside cross validation operator. You can double click cross-validation and see which model is placed inside. You can see below tutorial from RM to see how to import data.

    https://www.youtube.com/watch?v=eLR0IiBT76w
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    bookitsabookitsa Member Posts: 15 Contributor I
    For accuracy the knn' rate is 63,95 and for decision tree is 94,28. The k for knn is set to 5. The total data are 700records. Which is the appropriate number for k? Because i tried and values bigger from 5 but the rate very little changed. In general decision tree is faster from knn?
  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    @bookitsa Based on your dataset decision tree does better job. Appropriate number for k is based on checking different values and see if the accuracy is going up or down. I cannot generalize the results cause it always depends on how the data is?  But decision tree has pruning which cuts the unnecessary attributes in the tree for better predictions
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    bookitsabookitsa Member Posts: 15 Contributor I
    In this photo we see that decision tree is faster, right? How can i describe it?

  • Options
    bookitsabookitsa Member Posts: 15 Contributor I
    Thank you very much @varunm1 for your help!!
    I am looking the two xml codes and i want to see which describes best the difference between the two models:decision tree and knn. I will see the curves googling!
  • Options
    bookitsabookitsa Member Posts: 15 Contributor I
    edited January 2019
    In the diagram ROC curves how can we put labels(in other words:names) on the axis?
Sign In or Register to comment.