is this k-nn process "legitimate"?
hi,
I designed a k-nn learning process, and would like to know if this is "legitimate" in the sense of correctly carried out for predicting future test samples, e.g train a model correctly, and use it for future predictions...
The learning problem is about chemical structures in materials, e.g looking onto some mineralic grain-like structures under microscope and determine chemical components, based on the shape and size of the grain-structure where each example is one grain.
I'm not sure if I made it too easy myself... here is the process:
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" class="retrieve" compatibility="7.2.000" expanded="true" height="68" name="Retrieve" width="90" x="45" y="238">
<parameter key="repository_entry" value="//RapidMiner_Nils/Nils/Master/Data/Master Excelliste_Gefügebezeichnung_3 klassen"/>
</operator>
<operator activated="false" class="split_data" compatibility="7.2.000" expanded="true" height="103" name="Split Data" width="90" x="179" y="238">
<enumeration key="partitions">
<parameter key="ratio" value="0.5"/>
<parameter key="ratio" value="0.5"/>
</enumeration>
<parameter key="sampling_type" value="stratified sampling"/>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<operator activated="false" class="write_excel" compatibility="7.2.000" expanded="true" height="82" name="Write Excel (2)" width="90" x="45" y="442">
<parameter key="excel_file" value="C:\Users\Admin\Desktop\testData.xlsx"/>
</operator>
<operator activated="false" class="write_excel" compatibility="7.2.000" expanded="true" height="82" name="Write Excel" width="90" x="45" y="136">
<parameter key="excel_file" value="C:\Users\Admin\Desktop\trainData.xlsx"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.2.000" expanded="true" height="68" name="Retrieve testData" width="90" x="179" y="391">
<parameter key="repository_entry" value="../../data/testData"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.2.000" expanded="true" height="82" name="Select Attributes (3)" width="90" x="313" y="442">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Durchmesser|Euler Zahl REM|Flächegefüllt LIMI|Grauwert normiert|Fläche REM/LIMI|FlächezuGesamtfläche LIMI"/>
</operator>
<operator activated="true" class="sample_bootstrapping" compatibility="7.2.000" expanded="true" height="82" name="Sample (2)" width="90" x="447" y="442">
<parameter key="sample_ratio" value="0.5"/>
<parameter key="local_random_seed" value="1"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.2.000" expanded="true" height="68" name="Retrieve Master3Klassen_nominal" width="90" x="45" y="34">
<parameter key="repository_entry" value="../../data/Master3Klassen_nominal"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Durchmesser|Euler Zahl REM|Flächegefüllt LIMI|Grauwert normiert|Fläche REM/LIMI|FlächezuGesamtfläche LIMI"/>
</operator>
<operator activated="true" class="sample_bootstrapping" compatibility="7.2.000" expanded="true" height="82" name="Sample (Bootstrapping)" width="90" x="313" y="34">
<parameter key="use_weights" value="false"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.2.000" expanded="true" height="124" name="Multiply Trainings Data" width="90" x="447" y="34"/>
<operator activated="true" class="optimize_parameters_grid" compatibility="7.2.000" expanded="true" height="103" name="Optimize Parameters (Grid)" width="90" x="648" y="34">
<list key="parameters">
<parameter key="k-NN.k" value="[1.0;7;3;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="x_validation" compatibility="7.2.000" expanded="true" height="124" name="Validation" width="90" x="313" y="34">
<parameter key="number_of_validations" value="5"/>
<parameter key="sampling_type" value="stratified sampling"/>
<process expanded="true">
<operator activated="true" class="bagging" compatibility="7.2.000" expanded="true" height="82" name="Bagging" width="90" x="179" y="34">
<process expanded="true">
<operator activated="true" class="metacost" compatibility="7.2.000" expanded="true" height="82" name="MetaCost (2)" width="90" x="246" y="34">
<parameter key="cost_matrix" value="[0.0 5.0 2.0;1.0 0.0 2.0;1.0 5.0 0.0]"/>
<parameter key="sampling_with_replacement" value="false"/>
<process expanded="true">
<operator activated="true" class="k_nn" compatibility="7.2.000" expanded="true" height="82" name="k-NN" width="90" x="313" y="34">
<parameter key="k" value="7"/>
<parameter key="weighted_vote" value="true"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CamberraDistance"/>
</operator>
<connect from_port="training set" to_op="k-NN" to_port="training set"/>
<connect from_op="k-NN" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
</process>
</operator>
<connect from_port="training set" to_op="MetaCost (2)" to_port="training set"/>
<connect from_op="MetaCost (2)" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
</process>
</operator>
<connect from_port="training" to_op="Bagging" to_port="training set"/>
<connect from_op="Bagging" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.2.000" expanded="true" height="82" name="Performance" width="90" x="313" y="34">
<parameter key="kappa" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log" compatibility="7.2.000" expanded="true" height="82" name="Log" width="90" x="648" y="85">
<list key="log">
<parameter key="k" value="operator.k-NN.parameter.k"/>
<parameter key="num_measures" value="operator.k-NN.parameter.numerical_measure"/>
<parameter key="Performance_perf" value="operator.Performance.value.performance"/>
<parameter key="opt_par_perf" value="operator.Optimize Parameters (Grid).value.performance"/>
<parameter key="xval_perf" value="operator.Validation.value.performance"/>
<parameter key="perf2_perf" value="operator.Performance (2).value.performance"/>
<parameter key="perf2_kappa" value="operator.Performance (2).value.kappa"/>
<parameter key="perf3_perf" value="operator.Performance (3).value.performance"/>
<parameter key="perf3_kappa" value="operator.Performance (3).value.kappa"/>
</list>
</operator>
<connect from_port="input 1" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
<connect from_op="Log" from_port="through 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_parameters" compatibility="7.2.000" expanded="true" height="82" name="Set Parameters" width="90" x="849" y="85">
<list key="name_map">
<parameter key="k-NN" value="k-NN2"/>
</list>
</operator>
<operator activated="true" class="k_nn" compatibility="7.2.000" expanded="true" height="82" name="k-NN2" width="90" x="581" y="187">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CamberraDistance"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.2.000" expanded="true" height="124" name="Multiply Model" width="90" x="782" y="187"/>
<operator activated="true" class="legacy:write_model" compatibility="7.2.000" expanded="true" height="68" name="Write Model" width="90" x="916" y="238">
<parameter key="model_file" value="C:\Users\Marc\Desktop\knnmodel3.mod"/>
<parameter key="output_type" value="Binary"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="447" y="289">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.2.000" expanded="true" height="82" name="Performance (2)" width="90" x="715" y="340">
<parameter key="classification_error" value="true"/>
<parameter key="kappa" value="true"/>
<list key="class_weights"/>
</operator>
<operator activated="true" class="log" compatibility="7.2.000" expanded="true" height="82" name="Log Train Perfromance" width="90" x="849" y="340">
<list key="log">
<parameter key="accuracy" value="operator.Performance.value.accuracy"/>
<parameter key="classification error" value="operator.Performance.value.classification_error"/>
</list>
</operator>
<operator activated="true" class="apply_model" compatibility="7.2.000" expanded="true" height="82" name="Apply Model (3)" width="90" x="581" y="442">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.2.000" expanded="true" height="82" name="Performance (3)" width="90" x="715" y="442">
<parameter key="classification_error" value="true"/>
<parameter key="kappa" value="true"/>
<list key="class_weights"/>
</operator>
<operator activated="true" class="log" compatibility="7.2.000" expanded="true" height="82" name="Log Test Performance" width="90" x="849" y="442">
<list key="log">
<parameter key="accuracy" value="operator.Performance (3).value.accuracy"/>
<parameter key="classification error" value="operator.Performance (3).value.classification_error"/>
</list>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Write Excel" to_port="input"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Write Excel (2)" to_port="input"/>
<connect from_op="Retrieve testData" from_port="output" to_op="Select Attributes (3)" to_port="example set input"/>
<connect from_op="Select Attributes (3)" from_port="example set output" to_op="Sample (2)" to_port="example set input"/>
<connect from_op="Sample (2)" from_port="example set output" to_op="Apply Model (3)" to_port="unlabelled data"/>
<connect from_op="Retrieve Master3Klassen_nominal" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Sample (Bootstrapping)" to_port="example set input"/>
<connect from_op="Sample (Bootstrapping)" from_port="example set output" to_op="Multiply Trainings Data" to_port="input"/>
<connect from_op="Multiply Trainings Data" from_port="output 1" to_op="k-NN2" to_port="training set"/>
<connect from_op="Multiply Trainings Data" from_port="output 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Multiply Trainings Data" from_port="output 3" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_op="Set Parameters" to_port="parameter set"/>
<connect from_op="Set Parameters" from_port="parameter set" to_port="result 4"/>
<connect from_op="k-NN2" from_port="model" to_op="Multiply Model" to_port="input"/>
<connect from_op="Multiply Model" from_port="output 1" to_op="Apply Model (3)" to_port="model"/>
<connect from_op="Multiply Model" from_port="output 2" to_op="Write Model" to_port="input"/>
<connect from_op="Multiply Model" from_port="output 3" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_op="Log Train Perfromance" to_port="through 1"/>
<connect from_op="Log Train Perfromance" from_port="through 1" to_port="result 2"/>
<connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
<connect from_op="Performance (3)" from_port="performance" to_op="Log Test Performance" to_port="through 1"/>
<connect from_op="Log Test Performance" from_port="through 1" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>
basically, I am splitting into train and test data, select same attributes for each, and do a sample(Bootstrapping) on both. then I train the model on either 70% or 100% of the data... I Know, 100% it simply remembers the data, but my thinking was, the more data, the more general and useful it can be... so when new data comes out of production environment, it can be tested against the full model...
this configuration proves to have best performance when I use parameters knn=1 and camberra distance, and is most stable, e.g if I remove the sample(Bootstrapping) operator, I get a decrease around 5-10% in performance.
with 70% of the original data, I get around 85-90% on test data and 100% training acc. With 100% I get around 94% on the test data.
Answers
Hi,
After a quick look I would say: yes, there have not been any massive issues I saw in the process although some things do not make a lot of sense to me :smileywink:. Here is a small list of questions / hints to think about:
Hope this helps,
Ingo
hi,
thanks the cleaned up process looks much better regarding your questions:
You are welcome :smileyhappy:
I am introducing numbers here just in case we want to reference to those points later:
Cheers,
Ingo