Why does the performance from Backward Elimination is different when not using it ?

aphongmeaphongme Member Posts: 3 Contributor I
edited November 2018 in Help

Hi,

I used the Backward Elimination operator to optimize my AUC for logistic regression by eliminating some attributes. However, when I stop using the Backward Elimination operator and eliminate the same attributes myself using the Selected Attribute operator (based on Backward Elimination operator's results) the resultant AUC/Performance is not the same (it lower). This is the same for many optimization operators (Optimize Parameter (Grid), Forward Selection).

How do these optimization operators work and how are they different from doing it manually (without optimization operator) ?

My data has 2030 instances with 33 features and 1 binary dependent variable.

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<operator activated="true" class="retrieve" compatibility="8.1.003" expanded="true" height="68" name="Retrieve Data Screen without EV to EBITDA and EV to EBIT" width="90" x="45" y="85">
<parameter key="repository_entry" value="//NewLocalRepository/Data/3 Year (No Outlier)"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<operator activated="true" class="set_role" compatibility="8.1.003" expanded="true" height="82" name="Set Role" width="90" x="179" y="187">
<parameter key="attribute_name" value="Outperform/Underperform"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<operator activated="true" class="select_attributes" compatibility="8.1.003" expanded="true" height="82" name="Select Attributes (3)" width="90" x="313" y="187">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value="ASSET TURNOVER_YEAR 1|ASSET TURNOVER_YEAR 2|ASSET TURNOVER_YEAR 3|DIV YIELD_YEAR 1|DIV YIELD_YEAR 2|DIV YIELD_YEAR 3|INCOME GROWTH_YEAR 1|INCOME GROWTH_YEAR 2|INCOME GROWTH_YEAR 3|NET DEBT TO EQUITY_YEAR 1|NET DEBT TO EQUITY_YEAR 2|NET DEBT TO EQUITY_YEAR 3|Outperform/Underperform|PB_YEAR 1|PB_YEAR 2|PB_YEAR 3|PE_YEAR 1|PE_YEAR 2|PE_YEAR 3|PROFIT MARGIN_YEAR 1|PROFIT MARGIN_YEAR 2|PROFIT MARGIN_YEAR 3|REVENUE GROWTH_YEAR 1|REVENUE GROWTH_YEAR 2|REVENUE GROWTH_YEAR 3|ROA_YEAR 1|ROA_YEAR 2|ROA_YEAR 3|ROE_YEAR 1|ROE_YEAR 2|ROE_YEAR 3|ROIC_YEAR 1|ROIC_YEAR 2|ROIC_YEAR 3"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<operator activated="true" class="normalize" compatibility="8.1.003" expanded="true" height="103" name="Normalize" width="90" x="447" y="187">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="method" value="Z-transformation"/>
<parameter key="min" value="0.0"/>
<parameter key="max" value="0.5"/>
<parameter key="allow_negative_values" value="false"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<operator activated="true" class="principal_component_analysis" compatibility="8.1.003" expanded="true" height="103" name="PCA" width="90" x="581" y="187">
<parameter key="dimensionality_reduction" value="keep variance"/>
<parameter key="variance_threshold" value="1.0"/>
<parameter key="number_of_components" value="1"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
<operator activated="true" class="optimize_selection_backward" compatibility="8.1.003" expanded="true" height="103" name="Backward Elimination" width="90" x="715" y="187">
<parameter key="maximal_number_of_eliminations" value="10"/>
<parameter key="speculative_rounds" value="50"/>
<parameter key="stopping_behavior" value="with decrease"/>
<parameter key="use_relative_decrease" value="true"/>
<parameter key="alpha" value="0.05"/>
<process expanded="true">
<operator activated="true" class="split_data" compatibility="8.1.003" expanded="true" height="103" name="Split Data" width="90" x="112" y="85">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
<parameter key="sampling_type" value="automatic"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="h2o:logistic_regression" compatibility="7.6.001" expanded="true" height="124" name="Logistic Regression (3)" width="90" x="313" y="34">
<parameter key="solver" value="AUTO"/>
<parameter key="reproducible" value="false"/>
<parameter key="maximum_number_of_threads" value="4"/>
<parameter key="use_regularization" value="false"/>
<parameter key="lambda_search" value="false"/>
<parameter key="number_of_lambdas" value="0"/>
<parameter key="lambda_min_ratio" value="0.0"/>
<parameter key="early_stopping" value="true"/>
<parameter key="stopping_rounds" value="3"/>
<parameter key="stopping_tolerance" value="0.001"/>
<parameter key="standardize" value="false"/>
<parameter key="non-negative_coefficients" value="false"/>
<parameter key="add_intercept" value="true"/>
<parameter key="compute_p-values" value="true"/>
<parameter key="remove_collinear_columns" value="true"/>
<parameter key="missing_values_handling" value="MeanImputation"/>
<parameter key="max_iterations" value="0"/>
<parameter key="max_runtime_seconds" value="0"/>
</operator>
<operator activated="true" class="apply_model" compatibility="8.1.003" expanded="true" height="82" name="Apply Model (3)" width="90" x="447" y="187">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="8.1.003" expanded="true" height="82" name="Performance (3)" width="90" x="581" y="187">
<parameter key="main_criterion" value="AUC"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="AUC (optimistic)" value="true"/>
<parameter key="AUC" value="true"/>
<parameter key="AUC (pessimistic)" value="true"/>
<parameter key="precision" value="false"/>
<parameter key="recall" value="false"/>
<parameter key="lift" value="false"/>
<parameter key="fallout" value="false"/>
<parameter key="f_measure" value="false"/>
<parameter key="false_positive" value="false"/>
<parameter key="false_negative" value="false"/>
<parameter key="true_positive" value="false"/>
<parameter key="true_negative" value="false"/>
<parameter key="sensitivity" value="false"/>
<parameter key="specificity" value="false"/>
<parameter key="youden" value="false"/>
<parameter key="positive_predictive_value" value="false"/>
<parameter key="negative_predictive_value" value="false"/>
<parameter key="psep" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
</operator>
<connect from_port="example set" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Logistic Regression (3)" to_port="training set"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Apply Model (3)" to_port="unlabelled data"/>
<connect from_op="Logistic Regression (3)" from_port="model" to_op="Apply Model (3)" to_port="model"/>
<connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
<connect from_op="Performance (3)" from_port="performance" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
</process>

Help please

 

 

Answers

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

    Interesting. Thanks for sharing your data set. 

     

    But somehow your xml code was broken. I copy & pasted it, but it did not work.

     

    Could you use xml view from RapidMiner studio, and copy all the codes from there to share again?

     

    Thanks.

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @aphongme,

     

    Interesting topic, in deed.

    Without seeing your process, here a possible element of answer : 

     - Did you check and set up (for example with the default value) the use local random seed parameter of 

    the Cross Validation operator in both cases (manually / with Optimize Parameters)?

    If not during the cross validation, the dataset can be split differently in both cases, which affects the results and then the performance.

     

    Here a process using your data and the kNN model : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="238">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Compare_ROC_AUC\3 Years Data.csv"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="PE_YEAR 1.true.real.attribute"/>
    <parameter key="1" value="PE_YEAR 2.true.real.attribute"/>
    <parameter key="2" value="PE_YEAR 3.true.real.attribute"/>
    <parameter key="3" value="PB_YEAR 1.true.real.attribute"/>
    <parameter key="4" value="PB_YEAR 2.true.real.attribute"/>
    <parameter key="5" value="PB_YEAR 3.true.real.attribute"/>
    <parameter key="6" value="ROE_YEAR 1.true.real.attribute"/>
    <parameter key="7" value="ROE_YEAR 2.true.real.attribute"/>
    <parameter key="8" value="ROE_YEAR 3.true.real.attribute"/>
    <parameter key="9" value="ROIC_YEAR 1.true.real.attribute"/>
    <parameter key="10" value="ROIC_YEAR 2.true.real.attribute"/>
    <parameter key="11" value="ROIC_YEAR 3.true.real.attribute"/>
    <parameter key="12" value="ROA_YEAR 1.true.real.attribute"/>
    <parameter key="13" value="ROA_YEAR 2.true.real.attribute"/>
    <parameter key="14" value="ROA_YEAR 3.true.real.attribute"/>
    <parameter key="15" value="ASSET TURNOVER_YEAR 1.true.real.attribute"/>
    <parameter key="16" value="ASSET TURNOVER_YEAR 2.true.real.attribute"/>
    <parameter key="17" value="ASSET TURNOVER_YEAR 3.true.real.attribute"/>
    <parameter key="18" value="REVENUE GROWTH_YEAR 1.true.real.attribute"/>
    <parameter key="19" value="REVENUE GROWTH_YEAR 2.true.real.attribute"/>
    <parameter key="20" value="REVENUE GROWTH_YEAR 3.true.real.attribute"/>
    <parameter key="21" value="INCOME GROWTH_YEAR 1.true.real.attribute"/>
    <parameter key="22" value="INCOME GROWTH_YEAR 2.true.real.attribute"/>
    <parameter key="23" value="INCOME GROWTH_YEAR 3.true.real.attribute"/>
    <parameter key="24" value="NET DEBT TO EQUITY_YEAR 1.true.real.attribute"/>
    <parameter key="25" value="NET DEBT TO EQUITY_YEAR 2.true.real.attribute"/>
    <parameter key="26" value="NET DEBT TO EQUITY_YEAR 3.true.real.attribute"/>
    <parameter key="27" value="PROFIT MARGIN_YEAR 1.true.real.attribute"/>
    <parameter key="28" value="PROFIT MARGIN_YEAR 2.true.real.attribute"/>
    <parameter key="29" value="PROFIT MARGIN_YEAR 3.true.real.attribute"/>
    <parameter key="30" value="DIV YIELD_YEAR 1.true.real.attribute"/>
    <parameter key="31" value="DIV YIELD_YEAR 2.true.real.attribute"/>
    <parameter key="32" value="DIV YIELD_YEAR 3.true.real.attribute"/>
    <parameter key="33" value="Outperform/Underperform.true.polynominal.attribute"/>
    <parameter key="34" value="Price Movement.true.polynominal.attribute"/>
    <parameter key="35" value="outlier.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.1.003" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="238">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Price Movement|outlier"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="8.1.003" expanded="true" height="82" name="Set Role" width="90" x="380" y="238">
    <parameter key="attribute_name" value="Outperform/Underperform"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.1.003" expanded="true" height="103" name="Multiply" width="90" x="514" y="289"/>
    <operator activated="true" class="concurrency:cross_validation" compatibility="8.1.003" expanded="true" height="145" name="Cross Validation (2)" width="90" x="648" y="391">
    <parameter key="use_local_random_seed" value="true"/>
    <process expanded="true">
    <operator activated="true" class="k_nn" compatibility="8.1.003" expanded="true" height="82" name="k-NN (2)" width="90" x="179" y="34">
    <parameter key="k" value="7"/>
    </operator>
    <connect from_port="training set" to_op="k-NN (2)" to_port="training set"/>
    <connect from_op="k-NN (2)" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="8.1.003" expanded="true" height="82" name="Apply Model (2)" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_binominal_classification" compatibility="8.1.003" expanded="true" height="82" name="Performance (2)" width="90" x="179" y="34">
    <parameter key="main_criterion" value="AUC"/>
    <parameter key="AUC" value="true"/>
    </operator>
    <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
    <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="8.1.003" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="648" y="136">
    <list key="parameters">
    <parameter key="k-NN (3).k" value="[1.0;10;10;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="concurrency:cross_validation" compatibility="8.1.003" expanded="true" height="145" name="Cross Validation (3)" width="90" x="380" y="34">
    <parameter key="use_local_random_seed" value="true"/>
    <process expanded="true">
    <operator activated="true" class="k_nn" compatibility="8.1.003" expanded="true" height="82" name="k-NN (3)" width="90" x="179" y="34">
    <parameter key="k" value="9"/>
    </operator>
    <connect from_port="training set" to_op="k-NN (3)" to_port="training set"/>
    <connect from_op="k-NN (3)" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="8.1.003" expanded="true" height="82" name="Apply Model (3)" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_binominal_classification" compatibility="8.1.003" expanded="true" height="82" name="Performance (3)" width="90" x="179" y="34">
    <parameter key="main_criterion" value="AUC"/>
    <parameter key="AUC" value="true"/>
    </operator>
    <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
    <connect from_op="Performance (3)" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="input 1" to_op="Cross Validation (3)" to_port="example set"/>
    <connect from_op="Cross Validation (3)" from_port="model" to_port="model"/>
    <connect from_op="Cross Validation (3)" from_port="performance 1" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Cross Validation (2)" to_port="example set"/>
    <connect from_op="Cross Validation (2)" from_port="performance 1" to_port="result 4"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="model" to_port="result 2"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="parameter set" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

    Lionel

  • hughesfleming68hughesfleming68 Member Posts: 323 Unicorn

    Perhaps off topic but maybe not.... I did run into a situation when I was transitioning to version 8 where one process using backward elimination produced different results to the same process on running on version 7.1 with the same dataset. This was repeatable with no variation. I created a loop and ran perhaps 300 tests and checked the out of sample predictions externally. I had consistently better results using version 7.1 than 8 on live data outside of Rapidminer. I was never able to figure out why or where the difference was coming from. I will try and post an example when I can. Of course there are many things that could have changed between versions and it might not have anything to do with the backward elimination at all.

     

    Alex

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    The explanation about the random seed makes sense to me.

     

    You can try to use Optimize Selection (Evolutionary) for better results.

     

    Also, to me AUC is a bad criterion for feature selection performance. It evaluates a model at several threshholds, when in reality you end up using only one. If no natural performance measure comes up, I would stick with accuracy.

     

    Kind regards,

    Sebastian

Sign In or Register to comment.