is this k-nn process "legitimate"?

Fred12 · August 2016

hi,

I designed a k-nn learning process, and would like to know if this is "legitimate" in the sense of correctly carried out for predicting future test samples, e.g train a model correctly, and use it for future predictions...

The learning problem is about chemical structures in materials, e.g looking onto some mineralic grain-like structures under microscope and determine chemical components, based on the shape and size of the grain-structure where each example is one grain.

I'm not sure if I made it too easy myself... here is the process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="false" class="retrieve" compatibility="7.2.000" expanded="true" height="68" name="Retrieve" width="90" x="45" y="238">
        <parameter key="repository_entry" value="//RapidMiner_Nils/Nils/Master/Data/Master Excelliste_Gefügebezeichnung_3 klassen"/>
      </operator>
      <operator activated="false" class="split_data" compatibility="7.2.000" expanded="true" height="103" name="Split Data" width="90" x="179" y="238">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.5"/>
          <parameter key="ratio" value="0.5"/>
        </enumeration>
        <parameter key="sampling_type" value="stratified sampling"/>
        <parameter key="use_local_random_seed" value="true"/>
      </operator>
      <operator activated="false" class="write_excel" compatibility="7.2.000" expanded="true" height="82" name="Write Excel (2)" width="90" x="45" y="442">
        <parameter key="excel_file" value="C:\Users\Admin\Desktop\testData.xlsx"/>
      </operator>
      <operator activated="false" class="write_excel" compatibility="7.2.000" expanded="true" height="82" name="Write Excel" width="90" x="45" y="136">
        <parameter key="excel_file" value="C:\Users\Admin\Desktop\trainData.xlsx"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.2.000" expanded="true" height="68" name="Retrieve testData" width="90" x="179" y="391">
        <parameter key="repository_entry" value="../../data/testData"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.2.000" expanded="true" height="82" name="Select Attributes (3)" width="90" x="313" y="442">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Durchmesser|Euler Zahl REM|Flächegefüllt LIMI|Grauwert normiert|Fläche REM/LIMI|FlächezuGesamtfläche LIMI"/>
      </operator>
      <operator activated="true" class="sample_bootstrapping" compatibility="7.2.000" expanded="true" height="82" name="Sample (2)" width="90" x="447" y="442">
        <parameter key="sample_ratio" value="0.5"/>
        <parameter key="local_random_seed" value="1"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.2.000" expanded="true" height="68" name="Retrieve Master3Klassen_nominal" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../../data/Master3Klassen_nominal"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Durchmesser|Euler Zahl REM|Flächegefüllt LIMI|Grauwert normiert|Fläche REM/LIMI|FlächezuGesamtfläche LIMI"/>
      </operator>
      <operator activated="true" class="sample_bootstrapping" compatibility="7.2.000" expanded="true" height="82" name="Sample (Bootstrapping)" width="90" x="313" y="34">
        <parameter key="use_weights" value="false"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.2.000" expanded="true" height="124" name="Multiply Trainings Data" width="90" x="447" y="34"/>
      <operator activated="true" class="optimize_parameters_grid" compatibility="7.2.000" expanded="true" height="103" name="Optimize Parameters (Grid)" width="90" x="648" y="34">
        <list key="parameters">
          <parameter key="k-NN.k" value="[1.0;7;3;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="x_validation" compatibility="7.2.000" expanded="true" height="124" name="Validation" width="90" x="313" y="34">
            <parameter key="number_of_validations" value="5"/>
            <parameter key="sampling_type" value="stratified sampling"/>
            <process expanded="true">
              <operator activated="true" class="bagging" compatibility="7.2.000" expanded="true" height="82" name="Bagging" width="90" x="179" y="34">
                <process expanded="true">
                  <operator activated="true" class="metacost" compatibility="7.2.000" expanded="true" height="82" name="MetaCost (2)" width="90" x="246" y="34">
                    <parameter key="cost_matrix" value="[0.0 5.0 2.0;1.0 0.0 2.0;1.0 5.0 0.0]"/>
                    <parameter key="sampling_with_replacement" value="false"/>
                    <process expanded="true">
                      <operator activated="true" class="k_nn" compatibility="7.2.000" expanded="true" height="82" name="k-NN" width="90" x="313" y="34">
                        <parameter key="k" value="7"/>
                        <parameter key="weighted_vote" value="true"/>
                        <parameter key="measure_types" value="NumericalMeasures"/>
                        <parameter key="numerical_measure" value="CamberraDistance"/>
                      </operator>
                      <connect from_port="training set" to_op="k-NN" to_port="training set"/>
                      <connect from_op="k-NN" from_port="model" to_port="model"/>
                      <portSpacing port="source_training set" spacing="0"/>
                      <portSpacing port="sink_model" spacing="0"/>
                    </process>
                  </operator>
                  <connect from_port="training set" to_op="MetaCost (2)" to_port="training set"/>
                  <connect from_op="MetaCost (2)" from_port="model" to_port="model"/>
                  <portSpacing port="source_training set" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                </process>
              </operator>
              <connect from_port="training" to_op="Bagging" to_port="training set"/>
              <connect from_op="Bagging" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="7.2.000" expanded="true" height="82" name="Performance" width="90" x="313" y="34">
                <parameter key="kappa" value="true"/>
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log" compatibility="7.2.000" expanded="true" height="82" name="Log" width="90" x="648" y="85">
            <list key="log">
              <parameter key="k" value="operator.k-NN.parameter.k"/>
              <parameter key="num_measures" value="operator.k-NN.parameter.numerical_measure"/>
              <parameter key="Performance_perf" value="operator.Performance.value.performance"/>
              <parameter key="opt_par_perf" value="operator.Optimize Parameters (Grid).value.performance"/>
              <parameter key="xval_perf" value="operator.Validation.value.performance"/>
              <parameter key="perf2_perf" value="operator.Performance (2).value.performance"/>
              <parameter key="perf2_kappa" value="operator.Performance (2).value.kappa"/>
              <parameter key="perf3_perf" value="operator.Performance (3).value.performance"/>
              <parameter key="perf3_kappa" value="operator.Performance (3).value.kappa"/>
            </list>
          </operator>
          <connect from_port="input 1" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
          <connect from_op="Log" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_parameters" compatibility="7.2.000" expanded="true" height="82" name="Set Parameters" width="90" x="849" y="85">
        <list key="name_map">
          <parameter key="k-NN" value="k-NN2"/>
        </list>
      </operator>
      <operator activated="true" class="k_nn" compatibility="7.2.000" expanded="true" height="82" name="k-NN2" width="90" x="581" y="187">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="CamberraDistance"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.2.000" expanded="true" height="124" name="Multiply Model" width="90" x="782" y="187"/>
      <operator activated="true" class="legacy:write_model" compatibility="7.2.000" expanded="true" height="68" name="Write Model" width="90" x="916" y="238">
        <parameter key="model_file" value="C:\Users\Marc\Desktop\knnmodel3.mod"/>
        <parameter key="output_type" value="Binary"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="447" y="289">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="7.2.000" expanded="true" height="82" name="Performance (2)" width="90" x="715" y="340">
        <parameter key="classification_error" value="true"/>
        <parameter key="kappa" value="true"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="log" compatibility="7.2.000" expanded="true" height="82" name="Log Train Perfromance" width="90" x="849" y="340">
        <list key="log">
          <parameter key="accuracy" value="operator.Performance.value.accuracy"/>
          <parameter key="classification error" value="operator.Performance.value.classification_error"/>
        </list>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model (3)" width="90" x="581" y="442">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="7.2.000" expanded="true" height="82" name="Performance (3)" width="90" x="715" y="442">
        <parameter key="classification_error" value="true"/>
        <parameter key="kappa" value="true"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="log" compatibility="7.2.000" expanded="true" height="82" name="Log Test Performance" width="90" x="849" y="442">
        <list key="log">
          <parameter key="accuracy" value="operator.Performance (3).value.accuracy"/>
          <parameter key="classification error" value="operator.Performance (3).value.classification_error"/>
        </list>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Write Excel" to_port="input"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Write Excel (2)" to_port="input"/>
      <connect from_op="Retrieve testData" from_port="output" to_op="Select Attributes (3)" to_port="example set input"/>
      <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Sample (2)" to_port="example set input"/>
      <connect from_op="Sample (2)" from_port="example set output" to_op="Apply Model (3)" to_port="unlabelled data"/>
      <connect from_op="Retrieve Master3Klassen_nominal" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Sample (Bootstrapping)" to_port="example set input"/>
      <connect from_op="Sample (Bootstrapping)" from_port="example set output" to_op="Multiply Trainings Data" to_port="input"/>
      <connect from_op="Multiply Trainings Data" from_port="output 1" to_op="k-NN2" to_port="training set"/>
      <connect from_op="Multiply Trainings Data" from_port="output 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Multiply Trainings Data" from_port="output 3" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_op="Set Parameters" to_port="parameter set"/>
      <connect from_op="Set Parameters" from_port="parameter set" to_port="result 4"/>
      <connect from_op="k-NN2" from_port="model" to_op="Multiply Model" to_port="input"/>
      <connect from_op="Multiply Model" from_port="output 1" to_op="Apply Model (3)" to_port="model"/>
      <connect from_op="Multiply Model" from_port="output 2" to_op="Write Model" to_port="input"/>
      <connect from_op="Multiply Model" from_port="output 3" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
      <connect from_op="Performance (2)" from_port="performance" to_op="Log Train Perfromance" to_port="through 1"/>
      <connect from_op="Log Train Perfromance" from_port="through 1" to_port="result 2"/>
      <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
      <connect from_op="Performance (3)" from_port="performance" to_op="Log Test Performance" to_port="through 1"/>
      <connect from_op="Log Test Performance" from_port="through 1" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

basically, I am splitting into train and test data, select same attributes for each, and do a sample(Bootstrapping) on both. then I train the model on either 70% or 100% of the data... I Know, 100% it simply remembers the data, but my thinking was, the more data, the more general and useful it can be... so when new data comes out of production environment, it can be tested against the full model...

this configuration proves to have best performance when I use parameters knn=1 and camberra distance, and is most stable, e.g if I remove the sample(Bootstrapping) operator, I get a decrease around 5-10% in performance.

with 70% of the original data, I get around 85-90% on test data and 100% training acc. With 100% I get around 94% on the test data.

IngoRM · August 2016

Hi,

After a quick look I would say: yes, there have not been any massive issues I saw in the process although some things do not make a lot of sense to me :smileywink:. Here is a small list of questions / hints to think about:

You used the bootstrapped sampling operator with a sampling ratio of 1.0 - this will lead to a data set with exactly the same size than before but roughly 30% of the examples not used at all. Why are doing this?
Before you apply the model on the test data, you make a bootstrapped sample again but this time with only 50% ratio. Why are you doing this now? The whole point about training a model is to apply it later on ALL unpredicted data points - why the sampling before?
You write the model into a file. Why not into the repository?
You know my opinion on calculating training performances so I would skip this part ;-)
I have cleaning up the process a little bit (less crossings, moved output ports, grouped operators) and also added some notes to document what is happening. The new process is below.

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.2.000" expanded="true" height="68" name="Retrieve testData" width="90" x="179" y="595">
        <parameter key="repository_entry" value="../../data/testData"/>
        <description align="center" color="transparent" colored="false" width="126">Testing Data</description>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.2.000" expanded="true" height="82" name="Select Attributes (3)" width="90" x="313" y="595">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Durchmesser|Euler Zahl REM|Flächegefüllt LIMI|Grauwert normiert|Fläche REM/LIMI|FlächezuGesamtfläche LIMI"/>
      </operator>
      <operator activated="true" class="sample_bootstrapping" compatibility="7.2.000" expanded="true" height="82" name="Sample (2)" width="90" x="447" y="595">
        <parameter key="sample_ratio" value="0.5"/>
        <parameter key="local_random_seed" value="1"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.2.000" expanded="true" height="68" name="Retrieve Master3Klassen_nominal" width="90" x="45" y="85">
        <parameter key="repository_entry" value="../../data/Master3Klassen_nominal"/>
        <description align="center" color="transparent" colored="false" width="126">Training Data</description>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="85">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Durchmesser|Euler Zahl REM|Flächegefüllt LIMI|Grauwert normiert|Fläche REM/LIMI|FlächezuGesamtfläche LIMI"/>
      </operator>
      <operator activated="true" class="sample_bootstrapping" compatibility="7.2.000" expanded="true" height="82" name="Sample (Bootstrapping)" width="90" x="313" y="85">
        <parameter key="use_weights" value="false"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.2.000" expanded="true" height="124" name="Multiply Trainings Data" width="90" x="447" y="85"/>
      <operator activated="true" class="optimize_parameters_grid" compatibility="7.2.000" expanded="true" height="103" name="Optimize Parameters (Grid)" width="90" x="782" y="34">
        <list key="parameters">
          <parameter key="k-NN.k" value="[1.0;7;3;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="x_validation" compatibility="7.2.000" expanded="true" height="124" name="Validation" width="90" x="313" y="34">
            <parameter key="number_of_validations" value="5"/>
            <parameter key="sampling_type" value="stratified sampling"/>
            <process expanded="true">
              <operator activated="true" class="bagging" compatibility="7.2.000" expanded="true" height="82" name="Bagging" width="90" x="179" y="34">
                <process expanded="true">
                  <operator activated="true" class="metacost" compatibility="7.2.000" expanded="true" height="82" name="MetaCost (2)" width="90" x="246" y="34">
                    <parameter key="cost_matrix" value="[0.0 5.0 2.0;1.0 0.0 2.0;1.0 5.0 0.0]"/>
                    <parameter key="sampling_with_replacement" value="false"/>
                    <process expanded="true">
                      <operator activated="true" class="k_nn" compatibility="7.2.000" expanded="true" height="82" name="k-NN" width="90" x="313" y="34">
                        <parameter key="k" value="7"/>
                        <parameter key="weighted_vote" value="true"/>
                        <parameter key="measure_types" value="NumericalMeasures"/>
                        <parameter key="numerical_measure" value="CamberraDistance"/>
                      </operator>
                      <connect from_port="training set" to_op="k-NN" to_port="training set"/>
                      <connect from_op="k-NN" from_port="model" to_port="model"/>
                      <portSpacing port="source_training set" spacing="0"/>
                      <portSpacing port="sink_model" spacing="0"/>
                    </process>
                  </operator>
                  <connect from_port="training set" to_op="MetaCost (2)" to_port="training set"/>
                  <connect from_op="MetaCost (2)" from_port="model" to_port="model"/>
                  <portSpacing port="source_training set" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                </process>
              </operator>
              <connect from_port="training" to_op="Bagging" to_port="training set"/>
              <connect from_op="Bagging" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="7.2.000" expanded="true" height="82" name="Performance" width="90" x="313" y="34">
                <parameter key="kappa" value="true"/>
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log" compatibility="7.2.000" expanded="true" height="82" name="Log" width="90" x="514" y="34">
            <list key="log">
              <parameter key="k" value="operator.k-NN.parameter.k"/>
              <parameter key="num_measures" value="operator.k-NN.parameter.numerical_measure"/>
              <parameter key="Performance_perf" value="operator.Performance.value.performance"/>
              <parameter key="opt_par_perf" value="operator.Optimize Parameters (Grid).value.performance"/>
              <parameter key="xval_perf" value="operator.Validation.value.performance"/>
              <parameter key="perf2_perf" value="operator.Performance (2).value.performance"/>
              <parameter key="perf2_kappa" value="operator.Performance (2).value.kappa"/>
              <parameter key="perf3_perf" value="operator.Performance (3).value.performance"/>
              <parameter key="perf3_kappa" value="operator.Performance (3).value.kappa"/>
            </list>
          </operator>
          <connect from_port="input 1" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
          <connect from_op="Log" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_parameters" compatibility="7.2.000" expanded="true" height="82" name="Set Parameters" width="90" x="916" y="85">
        <list key="name_map">
          <parameter key="k-NN" value="k-NN2"/>
        </list>
      </operator>
      <operator activated="true" class="k_nn" compatibility="7.2.000" expanded="true" height="82" name="k-NN2" width="90" x="581" y="391">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="CamberraDistance"/>
        <description align="center" color="transparent" colored="false" width="126">Final Model</description>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.2.000" expanded="true" height="124" name="Multiply Model" width="90" x="715" y="391"/>
      <operator activated="true" class="legacy:write_model" compatibility="7.2.000" expanded="true" height="68" name="Write Model" width="90" x="849" y="391">
        <parameter key="model_file" value="C:\Users\Marc\Desktop\knnmodel3.mod"/>
        <parameter key="output_type" value="Binary"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="849" y="238">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="7.2.000" expanded="true" height="82" name="Performance (2)" width="90" x="983" y="238">
        <parameter key="classification_error" value="true"/>
        <parameter key="kappa" value="true"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="log" compatibility="7.2.000" expanded="true" height="82" name="Log Train Perfromance" width="90" x="1117" y="238">
        <list key="log">
          <parameter key="accuracy" value="operator.Performance.value.accuracy"/>
          <parameter key="classification error" value="operator.Performance.value.classification_error"/>
        </list>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model (3)" width="90" x="983" y="595">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="7.2.000" expanded="true" height="82" name="Performance (3)" width="90" x="1117" y="595">
        <parameter key="classification_error" value="true"/>
        <parameter key="kappa" value="true"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="log" compatibility="7.2.000" expanded="true" height="82" name="Log Test Performance" width="90" x="1251" y="595">
        <list key="log">
          <parameter key="accuracy" value="operator.Performance (3).value.accuracy"/>
          <parameter key="classification error" value="operator.Performance (3).value.classification_error"/>
        </list>
      </operator>
      <connect from_op="Retrieve testData" from_port="output" to_op="Select Attributes (3)" to_port="example set input"/>
      <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Sample (2)" to_port="example set input"/>
      <connect from_op="Sample (2)" from_port="example set output" to_op="Apply Model (3)" to_port="unlabelled data"/>
      <connect from_op="Retrieve Master3Klassen_nominal" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Sample (Bootstrapping)" to_port="example set input"/>
      <connect from_op="Sample (Bootstrapping)" from_port="example set output" to_op="Multiply Trainings Data" to_port="input"/>
      <connect from_op="Multiply Trainings Data" from_port="output 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Multiply Trainings Data" from_port="output 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Multiply Trainings Data" from_port="output 3" to_op="k-NN2" to_port="training set"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_op="Set Parameters" to_port="parameter set"/>
      <connect from_op="Set Parameters" from_port="parameter set" to_port="result 2"/>
      <connect from_op="k-NN2" from_port="model" to_op="Multiply Model" to_port="input"/>
      <connect from_op="Multiply Model" from_port="output 1" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Multiply Model" from_port="output 2" to_op="Write Model" to_port="input"/>
      <connect from_op="Multiply Model" from_port="output 3" to_op="Apply Model (3)" to_port="model"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
      <connect from_op="Performance (2)" from_port="performance" to_op="Log Train Perfromance" to_port="through 1"/>
      <connect from_op="Log Train Perfromance" from_port="through 1" to_port="result 3"/>
      <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
      <connect from_op="Performance (3)" from_port="performance" to_op="Log Test Performance" to_port="through 1"/>
      <connect from_op="Log Test Performance" from_port="through 1" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="21"/>
      <portSpacing port="sink_result 3" spacing="147"/>
      <portSpacing port="sink_result 4" spacing="336"/>
      <portSpacing port="sink_result 5" spacing="0"/>
      <description align="center" color="blue" colored="true" height="201" resized="true" width="563" x="19" y="44">Load and Prep Training Data</description>
      <description align="center" color="blue" colored="true" height="169" resized="true" width="461" x="130" y="552">Load and Prep Testing Data</description>
      <description align="center" color="purple" colored="true" height="179" resized="true" width="301" x="745" y="10">Find Optimal Parameters</description>
      <description align="center" color="purple" colored="true" height="183" resized="true" width="467" x="525" y="353">Apply Parameters and Train Model</description>
      <description align="center" color="gray" colored="true" height="136" resized="true" width="417" x="815" y="204">Training Error</description>
      <description align="center" color="gray" colored="true" height="136" resized="false" width="417" x="954" y="563">Testing Error</description>
      <description align="center" color="yellow" colored="false" height="120" resized="false" width="180" x="1086" y="387">Optimized parameters are k for k-NN between 1 and 7.&lt;br/&gt;&lt;br/&gt;Best parameter applied to the operator on the left with the note &amp;quot;Final Model&amp;quot;</description>
    </process>
  </operator>
</process>

Hope this helps,

Ingo

Fred12 · August 2016

hi,

thanks the cleaned up process looks much better regarding your questions:

I thought the better performance could be a result of "skipping" those 30% of noisy data (or some of them) that cannot make up a good classification and are near class boundaries... altough that does not even sound very plausible to myself
don't know, was an error from me, was just trying out different configurations, I guess I can leave out this operator for the training samples, they should be untouched..
good point, will do so in the future, besides.. is it possible to store the LOG data from experiments also in repository to retrieve them later? because I somehow cannot store the log results from context menu (like e.g. performance)

IngoRM · August 2016

You are welcome :smileyhappy:

I am introducing numbers here just in case we want to reference to those points later:

The thing with the bootstrapping is that there is no guarantee that next time you won't throw away 30% of good data. So even if this works in this particular process execution, this does not seem to be very robust. So I am glad that you do not find this very convincing yourself :smileywink:
Yip, I would remove it.
You can use "Log to Data" and then store it as any other data set with the "Store" operator.

Cheers,

Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

is this k-nn process "legitimate"?

Answers