Options

How can I obtain the accuracy list of my process?

fiddinyusfidafiddinyusfida Member Posts: 12 Contributor II
edited August 2019 in Help
Hi everyone,

I'm very new in Rapidminer and just found difficulty here. I am conducting a loop process for a model, says 10 iterations and calculate the accuracy performance. However, the result shows only the averaged accuracy or final accuracy. I need the list of accuracy (which is contains 10 accuracies) in order to further check using statistical software like SPSS.

Is it possible to obtain accuracy list of my process using rapidminer?

Below is the averaged accuracy sample. Thanks for your kind response


Best Answer

Answers

  • Options
    fiddinyusfidafiddinyusfida Member Posts: 12 Contributor II
    @varunm1 Thank you for your response,

    In your previous solution, I cannot define how many iterations. Here I attached the loop with average function.

    After I calculated manually, Why does this process produce a different averaged result? 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="136">
            <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="136"/>
          <operator activated="true" class="loop_and_average" compatibility="9.3.001" expanded="true" height="82" name="Loop and Average" width="90" x="380" y="187">
            <parameter key="iterations" value="2"/>
            <parameter key="average_performances_only" value="false"/>
            <process expanded="true">
              <operator activated="true" class="split_data" compatibility="9.3.001" expanded="true" height="103" name="Split Data" width="90" x="45" y="85">
                <enumeration key="partitions">
                  <parameter key="ratio" value="0.7"/>
                  <parameter key="ratio" value="0.3"/>
                </enumeration>
                <parameter key="sampling_type" value="automatic"/>
                <parameter key="use_local_random_seed" value="false"/>
                <parameter key="local_random_seed" value="1992"/>
              </operator>
              <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="246" y="85">
                <parameter key="criterion" value="gain_ratio"/>
                <parameter key="maximal_depth" value="10"/>
                <parameter key="apply_pruning" value="true"/>
                <parameter key="confidence" value="0.1"/>
                <parameter key="apply_prepruning" value="true"/>
                <parameter key="minimal_gain" value="0.01"/>
                <parameter key="minimal_leaf_size" value="2"/>
                <parameter key="minimal_size_for_split" value="4"/>
                <parameter key="number_of_prepruning_alternatives" value="3"/>
              </operator>
              <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="514" y="85">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance" width="90" x="715" y="85">
                <parameter key="main_criterion" value="first"/>
                <parameter key="accuracy" value="true"/>
                <parameter key="classification_error" value="false"/>
                <parameter key="kappa" value="false"/>
                <parameter key="weighted_mean_recall" value="false"/>
                <parameter key="weighted_mean_precision" value="false"/>
                <parameter key="spearman_rho" value="false"/>
                <parameter key="kendall_tau" value="false"/>
                <parameter key="absolute_error" value="false"/>
                <parameter key="relative_error" value="false"/>
                <parameter key="relative_error_lenient" value="false"/>
                <parameter key="relative_error_strict" value="false"/>
                <parameter key="normalized_absolute_error" value="false"/>
                <parameter key="root_mean_squared_error" value="false"/>
                <parameter key="root_relative_squared_error" value="false"/>
                <parameter key="squared_error" value="false"/>
                <parameter key="correlation" value="false"/>
                <parameter key="squared_correlation" value="false"/>
                <parameter key="cross-entropy" value="false"/>
                <parameter key="margin" value="false"/>
                <parameter key="soft_margin_loss" value="false"/>
                <parameter key="logistic_loss" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
                <list key="class_weights"/>
              </operator>
              <connect from_port="in 1" to_op="Split Data" to_port="example set"/>
              <connect from_op="Split Data" from_port="partition 1" to_op="Decision Tree" to_port="training set"/>
              <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="concurrency:loop" compatibility="9.3.001" expanded="true" height="82" name="Loop" width="90" x="380" y="34">
            <parameter key="number_of_iterations" value="2"/>
            <parameter key="iteration_macro" value="iteration"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="false"/>
            <process expanded="true">
              <operator activated="true" class="split_data" compatibility="9.3.001" expanded="true" height="103" name="Split Data (2)" width="90" x="45" y="34">
                <enumeration key="partitions">
                  <parameter key="ratio" value="0.7"/>
                  <parameter key="ratio" value="0.3"/>
                </enumeration>
                <parameter key="sampling_type" value="automatic"/>
                <parameter key="use_local_random_seed" value="false"/>
                <parameter key="local_random_seed" value="1992"/>
              </operator>
              <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree (2)" width="90" x="179" y="34">
                <parameter key="criterion" value="gain_ratio"/>
                <parameter key="maximal_depth" value="10"/>
                <parameter key="apply_pruning" value="true"/>
                <parameter key="confidence" value="0.1"/>
                <parameter key="apply_prepruning" value="true"/>
                <parameter key="minimal_gain" value="0.01"/>
                <parameter key="minimal_leaf_size" value="2"/>
                <parameter key="minimal_size_for_split" value="4"/>
                <parameter key="number_of_prepruning_alternatives" value="3"/>
              </operator>
              <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="514" y="187">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance (2)" width="90" x="648" y="136">
                <parameter key="main_criterion" value="first"/>
                <parameter key="accuracy" value="true"/>
                <parameter key="classification_error" value="false"/>
                <parameter key="kappa" value="false"/>
                <parameter key="weighted_mean_recall" value="false"/>
                <parameter key="weighted_mean_precision" value="false"/>
                <parameter key="spearman_rho" value="false"/>
                <parameter key="kendall_tau" value="false"/>
                <parameter key="absolute_error" value="false"/>
                <parameter key="relative_error" value="false"/>
                <parameter key="relative_error_lenient" value="false"/>
                <parameter key="relative_error_strict" value="false"/>
                <parameter key="normalized_absolute_error" value="false"/>
                <parameter key="root_mean_squared_error" value="false"/>
                <parameter key="root_relative_squared_error" value="false"/>
                <parameter key="squared_error" value="false"/>
                <parameter key="correlation" value="false"/>
                <parameter key="squared_correlation" value="false"/>
                <parameter key="cross-entropy" value="false"/>
                <parameter key="margin" value="false"/>
                <parameter key="soft_margin_loss" value="false"/>
                <parameter key="logistic_loss" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
                <list key="class_weights"/>
              </operator>
              <connect from_port="input 1" to_op="Split Data (2)" to_port="example set"/>
              <connect from_op="Split Data (2)" from_port="partition 1" to_op="Decision Tree (2)" to_port="training set"/>
              <connect from_op="Split Data (2)" from_port="partition 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Decision Tree (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
              <connect from_op="Performance (2)" from_port="performance" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Loop" to_port="input 1"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Loop and Average" to_port="in 1"/>
          <connect from_op="Loop and Average" from_port="averagable 1" to_port="result 2"/>
          <connect from_op="Loop" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @fiddinyusfida

    Thanks for the process, I did check the process. My understanding is the change in accuracy is based on splitting of data. As you are splitting it some times the test set changes and train set changes t changes accuracy. I fixed it by using a "local random seed" option in Split data operator, can you check now the below-modified process and see it is ok for you.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="136">
    <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" width="90" x="179" y="136"/>
    <operator activated="true" class="loop_and_average" compatibility="9.3.001" expanded="true" height="82" name="Loop and Average" width="90" x="380" y="187">
    <parameter key="iterations" value="2"/>
    <parameter key="average_performances_only" value="false"/>
    <process expanded="true">
    <operator activated="true" class="split_data" compatibility="9.3.001" expanded="true" height="103" name="Split Data" width="90" x="45" y="85">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.7"/>
    <parameter key="ratio" value="0.3"/>
    </enumeration>
    <parameter key="sampling_type" value="automatic"/>
    <parameter key="use_local_random_seed" value="true"/>
    <parameter key="local_random_seed" value="1992"/>
    </operator>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="246" y="85">
    <parameter key="criterion" value="gain_ratio"/>
    <parameter key="maximal_depth" value="10"/>
    <parameter key="apply_pruning" value="true"/>
    <parameter key="confidence" value="0.1"/>
    <parameter key="apply_prepruning" value="true"/>
    <parameter key="minimal_gain" value="0.01"/>
    <parameter key="minimal_leaf_size" value="2"/>
    <parameter key="minimal_size_for_split" value="4"/>
    <parameter key="number_of_prepruning_alternatives" value="3"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="514" y="85">
    <list key="application_parameters"/>
    <parameter key="create_view" value="false"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance" width="90" x="715" y="85">
    <parameter key="main_criterion" value="first"/>
    <parameter key="accuracy" value="true"/>
    <parameter key="classification_error" value="false"/>
    <parameter key="kappa" value="false"/>
    <parameter key="weighted_mean_recall" value="false"/>
    <parameter key="weighted_mean_precision" value="false"/>
    <parameter key="spearman_rho" value="false"/>
    <parameter key="kendall_tau" value="false"/>
    <parameter key="absolute_error" value="false"/>
    <parameter key="relative_error" value="false"/>
    <parameter key="relative_error_lenient" value="false"/>
    <parameter key="relative_error_strict" value="false"/>
    <parameter key="normalized_absolute_error" value="false"/>
    <parameter key="root_mean_squared_error" value="false"/>
    <parameter key="root_relative_squared_error" value="false"/>
    <parameter key="squared_error" value="false"/>
    <parameter key="correlation" value="false"/>
    <parameter key="squared_correlation" value="false"/>
    <parameter key="cross-entropy" value="false"/>
    <parameter key="margin" value="false"/>
    <parameter key="soft_margin_loss" value="false"/>
    <parameter key="logistic_loss" value="false"/>
    <parameter key="skip_undefined_labels" value="true"/>
    <parameter key="use_example_weights" value="true"/>
    <list key="class_weights"/>
    </operator>
    <connect from_port="in 1" to_op="Split Data" to_port="example set"/>
    <connect from_op="Split Data" from_port="partition 1" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_averagable 1" spacing="0"/>
    <portSpacing port="sink_averagable 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="concurrency:loop" compatibility="9.3.001" expanded="true" height="82" name="Loop" width="90" x="380" y="34">
    <parameter key="number_of_iterations" value="2"/>
    <parameter key="iteration_macro" value="iteration"/>
    <parameter key="reuse_results" value="false"/>
    <parameter key="enable_parallel_execution" value="false"/>
    <process expanded="true">
    <operator activated="true" class="split_data" compatibility="9.3.001" expanded="true" height="103" name="Split Data (2)" width="90" x="45" y="34">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.7"/>
    <parameter key="ratio" value="0.3"/>
    </enumeration>
    <parameter key="sampling_type" value="automatic"/>
    <parameter key="use_local_random_seed" value="true"/>
    <parameter key="local_random_seed" value="1992"/>
    </operator>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree (2)" width="90" x="179" y="34">
    <parameter key="criterion" value="gain_ratio"/>
    <parameter key="maximal_depth" value="10"/>
    <parameter key="apply_pruning" value="true"/>
    <parameter key="confidence" value="0.1"/>
    <parameter key="apply_prepruning" value="true"/>
    <parameter key="minimal_gain" value="0.01"/>
    <parameter key="minimal_leaf_size" value="2"/>
    <parameter key="minimal_size_for_split" value="4"/>
    <parameter key="number_of_prepruning_alternatives" value="3"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="514" y="187">
    <list key="application_parameters"/>
    <parameter key="create_view" value="false"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance (2)" width="90" x="648" y="136">
    <parameter key="main_criterion" value="first"/>
    <parameter key="accuracy" value="true"/>
    <parameter key="classification_error" value="false"/>
    <parameter key="kappa" value="false"/>
    <parameter key="weighted_mean_recall" value="false"/>
    <parameter key="weighted_mean_precision" value="false"/>
    <parameter key="spearman_rho" value="false"/>
    <parameter key="kendall_tau" value="false"/>
    <parameter key="absolute_error" value="false"/>
    <parameter key="relative_error" value="false"/>
    <parameter key="relative_error_lenient" value="false"/>
    <parameter key="relative_error_strict" value="false"/>
    <parameter key="normalized_absolute_error" value="false"/>
    <parameter key="root_mean_squared_error" value="false"/>
    <parameter key="root_relative_squared_error" value="false"/>
    <parameter key="squared_error" value="false"/>
    <parameter key="correlation" value="false"/>
    <parameter key="squared_correlation" value="false"/>
    <parameter key="cross-entropy" value="false"/>
    <parameter key="margin" value="false"/>
    <parameter key="soft_margin_loss" value="false"/>
    <parameter key="logistic_loss" value="false"/>
    <parameter key="skip_undefined_labels" value="true"/>
    <parameter key="use_example_weights" value="true"/>
    <list key="class_weights"/>
    </operator>
    <connect from_port="input 1" to_op="Split Data (2)" to_port="example set"/>
    <connect from_op="Split Data (2)" from_port="partition 1" to_op="Decision Tree (2)" to_port="training set"/>
    <connect from_op="Split Data (2)" from_port="partition 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Decision Tree (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
    <connect from_op="Performance (2)" from_port="performance" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Loop" to_port="input 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Loop and Average" to_port="in 1"/>
    <connect from_op="Loop and Average" from_port="averagable 1" to_port="result 2"/>
    <connect from_op="Loop" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>


    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    fiddinyusfidafiddinyusfida Member Posts: 12 Contributor II
    @varunm1 Many thanks, this helps me a lot

    I just curious,
    Are there any ways to make this local random seed increases as the iteration process?

    Such as this pseudocode
    For i in 5:<br>&nbsp; &nbsp;Random seed (i)


  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    You can use macros for that like %{execution_count}
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    hughesfleming68hughesfleming68 Member Posts: 323 Unicorn
    edited September 2019
    Just a quick comment here.... I wouldn't try and increment your random seed this way. If you chose your best accuracy based on changing your random seed then any improvement won't translate to out of sample data. There are many ways to trick yourself into thinking your model is better than it really is and this is one of them. 
  • Options
    fiddinyusfidafiddinyusfida Member Posts: 12 Contributor II
    @varunm1 Thanks for the suggestion, I am still learning how to implement macros. 

    @hughesfleming68 Thanks for the response. What actually I want to do is repeating the process 30 times (based on the Central Limit Theorem) by using a random seed.

    After I obtain the 30 accuracies (comes from random seed 1 to 30), I want to do statistical hypothesis testing to know whether my proposed method is significant or not (compare to another).

    Or is there any suggestion about this?

    I quoted central limit theorem from this link
    (https://www.investopedia.com/terms/c/central_limit_theorem.asp
    Sample sizes equal to or greater than 30 are considered sufficient for the CLT to hold.
  • Options
    hughesfleming68hughesfleming68 Member Posts: 323 Unicorn
    edited September 2019
    Hi, @fiddinyusfida. Sometimes that approach is unavoidable. It is something I have to deal with when I use Tensorflow for time series forecasting as opposed to other frameworks like DL4J or PyTorch which make it much easier to get repeatable results.

    In your case, the data splitting is the weak link and whether you change the spit ratio,sampling type or random seed, you still could get wildly different different results. It is something I would do as a last resort. It is better to use as much data as you can and then use cross validation or sliding window validation in the case of a time series to get a result you can start to trust. In the end only testing on out of sample data will tell you if your testing was valid. If your data is very random....sometimes we can't control this part then even averaging 30 times might not be helpful. It all depends how stable your data is.
  • Options
    fiddinyusfidafiddinyusfida Member Posts: 12 Contributor II
    Hi @hughesfleming68,

    I have only 100 records and seems hard to add the data since I obtained it from public dataset repository.

    So, based on your tips, It will be better if I use Cross-validation (for instance K=10) and just averaging the accuracy instead of doing data-split with ratio?
  • Options
    hughesfleming68hughesfleming68 Member Posts: 323 Unicorn
    I would start with five fold cross validation and switch between linear and shuffled sampling to see what effect that has on your result. Either way, you are going to be data limited but that depends on how regular your data is. I would still chose that over split. Luck plays a large role when it comes to splitting small data sets. I always seem to get over optimistic in sample results and poorer out of sample results so I am very cautious.
  • Options
    fiddinyusfidafiddinyusfida Member Posts: 12 Contributor II
    @hughesfleming68

    I really appreciate your advice, thank you....

Sign In or Register to comment.