How can I obtain the accuracy list of my process?

fiddinyusfida · August 2019

Hi everyone,

I'm very new in Rapidminer and just found difficulty here. I am conducting a loop process for a model, says 10 iterations and calculate the accuracy performance. However, the result shows only the averaged accuracy or final accuracy. I need the list of accuracy (which is contains 10 accuracies) in order to further check using statistical software like SPSS.

Is it possible to obtain accuracy list of my process using rapidminer?

Below is the averaged accuracy sample. Thanks for your kind response

varunm1 · August 2019

Hello @fiddinyusfida

Did you add a performance operator inside the loop operator? Here is a sample XML code (Click SHow) on Titanic dataset. To use this XML code, you need to copy this and open a new process in your rapidminer, the paste this in XML window of rapidminer and click on green tick mark. You will see the process in the process window.

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="85">
<parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
</operator>
<operator activated="true" class="concurrency:loop_values" compatibility="9.3.001" expanded="true" height="82" name="Loop Values" width="90" x="313" y="85">
<parameter key="attribute" value="Sex"/>
<parameter key="iteration_macro" value="loop_value"/>
<parameter key="reuse_results" value="false"/>
<parameter key="enable_parallel_execution" value="true"/>
<process expanded="true">
<operator activated="true" class="split_data" compatibility="9.3.001" expanded="true" height="103" name="Split Data" width="90" x="45" y="34">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
<parameter key="sampling_type" value="automatic"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="179" y="34">
<parameter key="criterion" value="gain_ratio"/>
<parameter key="maximal_depth" value="10"/>
<parameter key="apply_pruning" value="true"/>
<parameter key="confidence" value="0.1"/>
<parameter key="apply_prepruning" value="true"/>
<parameter key="minimal_gain" value="0.01"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
</operator>
<operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="380" y="187">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance" width="90" x="648" y="136">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="input 1" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve Titanic Training" from_port="output" to_op="Loop Values" to_port="input 1"/>
<connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Please inform if you need more information.

fiddinyusfida · August 2019

@varunm1 Thank you for your response,

In your previous solution, I cannot define how many iterations. Here I attached the loop with average function.

After I calculated manually, Why does this process produce a different averaged result?

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">

</context>

</operator>

</enumeration>

</operator>

</operator>

</operator>

</operator>

</process>

</operator>

</enumeration>

</operator>

</operator>

</operator>

</operator>

</process>

</operator>

</process>

</operator>

</process>

varunm1 · August 2019

Hello @fiddinyusfida

Thanks for the process, I did check the process. My understanding is the change in accuracy is based on splitting of data. As you are splitting it some times the test set changes and train set changes t changes accuracy. I fixed it by using a "local random seed" option in Split data operator, can you check now the below-modified process and see it is ok for you.

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="136">
<parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="136"/>
<operator activated="true" class="loop_and_average" compatibility="9.3.001" expanded="true" height="82" name="Loop and Average" width="90" x="380" y="187">
<parameter key="iterations" value="2"/>
<parameter key="average_performances_only" value="false"/>
<process expanded="true">
<operator activated="true" class="split_data" compatibility="9.3.001" expanded="true" height="103" name="Split Data" width="90" x="45" y="85">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
<parameter key="sampling_type" value="automatic"/>
<parameter key="use_local_random_seed" value="true"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="246" y="85">
<parameter key="criterion" value="gain_ratio"/>
<parameter key="maximal_depth" value="10"/>
<parameter key="apply_pruning" value="true"/>
<parameter key="confidence" value="0.1"/>
<parameter key="apply_prepruning" value="true"/>
<parameter key="minimal_gain" value="0.01"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
</operator>
<operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="514" y="85">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance" width="90" x="715" y="85">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="in 1" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="concurrency:loop" compatibility="9.3.001" expanded="true" height="82" name="Loop" width="90" x="380" y="34">
<parameter key="number_of_iterations" value="2"/>
<parameter key="iteration_macro" value="iteration"/>
<parameter key="reuse_results" value="false"/>
<parameter key="enable_parallel_execution" value="false"/>
<process expanded="true">
<operator activated="true" class="split_data" compatibility="9.3.001" expanded="true" height="103" name="Split Data (2)" width="90" x="45" y="34">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
<parameter key="sampling_type" value="automatic"/>
<parameter key="use_local_random_seed" value="true"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree (2)" width="90" x="179" y="34">
<parameter key="criterion" value="gain_ratio"/>
<parameter key="maximal_depth" value="10"/>
<parameter key="apply_pruning" value="true"/>
<parameter key="confidence" value="0.1"/>
<parameter key="apply_prepruning" value="true"/>
<parameter key="minimal_gain" value="0.01"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
</operator>
<operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="514" y="187">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance (2)" width="90" x="648" y="136">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="input 1" to_op="Split Data (2)" to_port="example set"/>
<connect from_op="Split Data (2)" from_port="partition 1" to_op="Decision Tree (2)" to_port="training set"/>
<connect from_op="Split Data (2)" from_port="partition 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Decision Tree (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve Titanic Training" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Loop" to_port="input 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="Loop and Average" to_port="in 1"/>
<connect from_op="Loop and Average" from_port="averagable 1" to_port="result 2"/>
<connect from_op="Loop" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

fiddinyusfida · September 2019

@varunm1 Many thanks, this helps me a lot

I just curious,
Are there any ways to make this local random seed increases as the iteration process?

Such as this pseudocode

For i in 5:<br>&nbsp; &nbsp;Random seed (i)

varunm1 · September 2019

You can use macros for that like %{execution_count}

hughesfleming68 · September 2019

Just a quick comment here.... I wouldn't try and increment your random seed this way. If you chose your best accuracy based on changing your random seed then any improvement won't translate to out of sample data. There are many ways to trick yourself into thinking your model is better than it really is and this is one of them.

fiddinyusfida · September 2019

@varunm1 Thanks for the suggestion, I am still learning how to implement macros.

@hughesfleming68 Thanks for the response. What actually I want to do is repeating the process 30 times (based on the Central Limit Theorem) by using a random seed.

After I obtain the 30 accuracies (comes from random seed 1 to 30), I want to do statistical hypothesis testing to know whether my proposed method is significant or not (compare to another).

Or is there any suggestion about this?

I quoted central limit theorem from this link
(https://www.investopedia.com/terms/c/central_limit_theorem.asp)

Sample sizes equal to or greater than 30 are considered sufficient for the CLT to hold.

hughesfleming68 · September 2019

Hi, @fiddinyusfida. Sometimes that approach is unavoidable. It is something I have to deal with when I use Tensorflow for time series forecasting as opposed to other frameworks like DL4J or PyTorch which make it much easier to get repeatable results.

In your case, the data splitting is the weak link and whether you change the spit ratio,sampling type or random seed, you still could get wildly different different results. It is something I would do as a last resort. It is better to use as much data as you can and then use cross validation or sliding window validation in the case of a time series to get a result you can start to trust. In the end only testing on out of sample data will tell you if your testing was valid. If your data is very random....sometimes we can't control this part then even averaging 30 times might not be helpful. It all depends how stable your data is.

fiddinyusfida · September 2019

Hi @hughesfleming68,

I have only 100 records and seems hard to add the data since I obtained it from public dataset repository.

So, based on your tips, It will be better if I use Cross-validation (for instance K=10) and just averaging the accuracy instead of doing data-split with ratio?

hughesfleming68 · September 2019

I would start with five fold cross validation and switch between linear and shuffled sampling to see what effect that has on your result. Either way, you are going to be data limited but that depends on how regular your data is. I would still chose that over split. Luck plays a large role when it comes to splitting small data sets. I always seem to get over optimistic in sample results and poorer out of sample results so I am very cautious.

fiddinyusfida · September 2019

@hughesfleming68

I really appreciate your advice, thank you....

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How can I obtain the accuracy list of my process?

Best Answer

Be Safe. Follow precautions and Maintain Social Distancing

Answers

Be Safe. Follow precautions and Maintain Social Distancing

Be Safe. Follow precautions and Maintain Social Distancing