Naive Bayes - Execute Python vs RM : different AUC

lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
edited November 2019 in Help

Hi,

 

I continue my experiments on RM/Execute Python with the NB model.

Sorry, but I feel obliged to appeal to you : 

mschmitz , that is with numerical examples for both model RM and execute Python.

Indeed, I retrieve in both models strictly the same scoring results (accuracy, weighted mean recall, weighted mean precision, recall (positive class no/yes), precision (positive class no/yes) ) except..... for the AUC

AUC(RM)= 0.942

AUC(Python) = 0.883

 

I suppose that the AUC is calculated from the ROC curve.

But how it is calculated ?. How explain this difference?

Here the process : 

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals" width="90" x="45" y="136">
<parameter key="repository_entry" value="//Samples/data/Deals"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical (3)" width="90" x="179" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Future Customer"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="85"/>
<operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="514" y="85"/>
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Deals (2)" width="90" x="45" y="340">
<parameter key="repository_entry" value="//Samples/data/Deals"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="340">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Future Customer"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="340">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Future Customer"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="coding_type" value="unique integers"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="166" name="Build / Apply model" width="90" x="514" y="289">
<parameter key="script" value="import pandas as pd&#10;import numpy as np&#10;from sklearn.naive_bayes import GaussianNB&#10;from sklearn.calibration import CalibratedClassifierCV&#10;from sklearn.metrics import confusion_matrix&#10;from sklearn.metrics import accuracy_score&#10;from sklearn.metrics import recall_score&#10;from sklearn.metrics import precision_score&#10;from sklearn.metrics import roc_auc_score&#10;from sklearn import metrics&#10;&#10;&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; # Build the model&#10; X = data.iloc[:,1:]&#10; y = data.iloc[:,0]&#10; NB = GaussianNB()&#10; NB.fit(X,y)&#10;&#10; NB_Calib = CalibratedClassifierCV(base_estimator = NB,method = 'sigmoid') &#10;&#10; NB_Calib.fit(X,y)&#10;&#10; #Calculate probability of each class.&#10;&#10; pr = NB.class_prior_ &#10; &#10; #Calculate mean of each feature per class&#10; th= NB.theta_&#10;&#10; #Apply the model&#10; y_pred = NB.predict(X)&#10; y_prob = NB_Calib.predict_proba(X) &#10; &#10; &#10; # Calculate the scoring&#10; &#10; #confusion matrix&#10; conf_matrix = confusion_matrix(y,y_pred)&#10; &#10; #accuracy&#10; acc_score = 100*accuracy_score(y,y_pred) &#10; &#10; #weighted recall &#10; reca_score = 100*recall_score(y,y_pred,average = 'weighted')&#10; &#10; #weighted precision&#10; precisionscore = 100*precision_score(y,y_pred,average='weighted') &#10;&#10; #recall (positive class : yes / positive class : no ) &#10; reca_no = 100*recall_score(y,y_pred,average =None)&#10;&#10; #precision (positive class : yes / positive class : no ) &#10; precision_no = 100*precision_score(y,y_pred,average=None) &#10; &#10; #AUC (positive class : no) &#10; AUCscore = roc_auc_score(y,y_pred,average=None) &#10;&#10; #AUC (positive class : no) méthode n°2&#10; fpr, tpr, thresholds = metrics.roc_curve(y, y_pred, pos_label=1)&#10; AUC_2 = metrics.auc(fpr, tpr)&#10; &#10; &#10; #Write the y_pred and scores in dataframe&#10; &#10; y_prediction = pd.DataFrame(data = y_pred,columns = ['prediction(Future Customer)'])&#10; y_probability = pd.DataFrame(data = y_prob,columns = ['confidence(yes)','confidence(no)'])&#10; data = data.join(y_prediction)&#10; data = data.join(y_probability)&#10;&#10; &#10; accu_score = pd.DataFrame(data = [acc_score],columns = ['accuracy'])&#10; recall_weighted = pd.DataFrame(data = [reca_score],columns = ['weighted_mean_recall']) &#10; precision_weighted = pd.DataFrame(data = [precisionscore],columns = ['weighted_mean_precision']) &#10; recall_no = pd.DataFrame(data = [reca_no],columns = ['recall (positive class : yes)','recall (positive class : no)'])&#10; precision_no = pd.DataFrame(data = [precision_no],columns = ['precision (positive class : yes)','precision (positive class : no)'])&#10; AUC = pd.DataFrame(data = [AUCscore],columns = ['AUC'])&#10; AUC2 = pd.DataFrame(data = [AUC_2],columns = ['AUC_method2'])&#10; score = accu_score.join(recall_weighted)&#10; score = score.join(precision_weighted)&#10; score = score.join(recall_no)&#10; score = score.join(precision_no)&#10; score = score.join(AUC)&#10; score = score.join(AUC2)&#10; &#10; theta = pd.DataFrame(data = th,columns = ['Gender = Male','Gender = Female','PM = Credit card','PM = cheque','PM = cash','Age'])&#10; proba = pd.DataFrame(data = pr, columns = ['probability'])&#10; &#10; confus_matrix = pd.DataFrame(data = conf_matrix,columns = ['true yes','true no']) &#9;&#10; &#10; #data.rm_metadata['prediction(Future Customer)']=(None,'prediction(Future Customer)')&#10;&#10; &#10; # connect 4 output ports to see the results&#10; return score,theta, confus_matrix,proba,data"/>
</operator>
<operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="648" y="85">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (2)" width="90" x="782" y="85"/>
<operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance (2)" width="90" x="916" y="136"/>
<operator activated="true" class="performance_classification" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="916" y="34">
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Retrieve Deals" from_port="output" to_op="Nominal to Numerical (3)" to_port="example set input"/>
<connect from_op="Nominal to Numerical (3)" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Retrieve Deals (2)" from_port="output" to_op="Nominal to Numerical (2)" to_port="example set input"/>
<connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Build / Apply model" to_port="input 1"/>
<connect from_op="Build / Apply model" from_port="output 1" to_port="result 1"/>
<connect from_op="Build / Apply model" from_port="output 2" to_port="result 2"/>
<connect from_op="Build / Apply model" from_port="output 3" to_port="result 3"/>
<connect from_op="Build / Apply model" from_port="output 4" to_port="result 4"/>
<connect from_op="Build / Apply model" from_port="output 5" to_port="result 7"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_op="Performance" to_port="labelled data"/>
<connect from_op="Multiply (2)" from_port="output 2" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_port="result 8"/>
<connect from_op="Performance" from_port="performance" to_port="result 5"/>
<connect from_op="Performance" from_port="example set" to_port="result 6"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
<portSpacing port="sink_result 8" spacing="0"/>
<portSpacing port="sink_result 9" spacing="0"/>
</process>
</operator>
</process>

Thanks you,

 

Best regards,

 

Lionel

Best Answer

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    It likely has to do with the way that ties are handled, because there are multiple options for that when calculating ROC/AUC and not all software uses the same method.  You'll either have to dive into the details of the ROC/AUC calculations in python vs RapidMiner (via the java code on github), or maybe one of the developers will chime in because they already know the answer :-)

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @mschmitz

     

    Here two elements : 

     

    1. Probability calibration :

    Recently, during my experimentations of comparaisons Python/RM, I was too interested in

    NB_Calib = CalibratedClassifierCV(base_estimator = NB,method = 'sigmoid') 

    In deed, first, the calculated confidences by the model (SVM) in Python were abberant (for the predicted class, the confidence was < 0.5 for a binary problem !!!???). After investigations, I discover this python class which seems to improve the relevance of classifiers confidences. So I builded a SVM model (strictly the same both python/RM) and used this class to calculated the new confidences in Python : There were differents from RM.

    To go further : 

    http://scikit-learn.org/stable/modules/calibration.html

     

    To confirm, with the NB model, in the following process , I applied too with Execute Python the class above : The confidences from "Execute Python" are indeed differents from confidences of RM. (the training example set Chapter09DataSet_Training.csv in attached file)

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Chapter09DataSet_Training" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Rapidminer_Tests/data/Chapter09DataSet_Training"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="85">
    <parameter key="attribute_name" value="2nd_Heart_Attack"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="187"/>
    <operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="447" y="85"/>
    <operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="581" y="136">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (2)" width="90" x="648" y="34"/>
    <operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="782" y="136"/>
    <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Chapter09DataSet_Training (2)" width="90" x="45" y="442">
    <parameter key="repository_entry" value="//Rapidminer_Tests/data/Chapter09DataSet_Training"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role (2)" width="90" x="179" y="442">
    <parameter key="attribute_name" value="2nd_Heart_Attack"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply (3)" width="90" x="313" y="544"/>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Naive Bayes Python" width="90" x="447" y="442">
    <parameter key="script" value="import pandas as pd&#10;from sklearn.naive_bayes import GaussianNB&#10;from sklearn.calibration import CalibratedClassifierCV&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; y = data['2nd_Heart_Attack']&#10; X = data.drop('2nd_Heart_Attack',axis = 1)&#10;&#10; #List of attributes&#10; features = list(X)&#10;&#10; #Build the model&#10; model_NB = GaussianNB()&#10; model_NB.fit(X,y)&#10;&#10; #Create de calibrated model&#10; model_NB_calib =CalibratedClassifierCV(model_NB,method = 'sigmoid')&#10; model_NB_calib.fit(X,y)&#10;&#10; #Calculation of distribution table (mean) &#10; th = model_NB.theta_&#10; th_1 = th[0,:]&#10; th_2 = th[1,:]&#10;&#10; #Calculation of distribution table (stv) &#10; std = model_NB.sigma_&#10; std_1 = (std[0,:])**0.5&#10; std_2 = (std[1,:])**0.5&#10;&#10;&#10; #Write the results&#10; theta_2 = pd.DataFrame(data = th_2,columns = ['Yes (main)'])&#10; theta_1 = pd.DataFrame(data = th_1,columns = ['No (main)']) &#10; sigma_2 = pd.DataFrame(data = std_2,columns = ['Yes (std)'])&#10; sigma_1 = pd.DataFrame(data = std_1,columns = ['No (std)']) &#10; &#10; theta = pd.DataFrame(data = features,columns = ['Attribute'])&#10; theta = theta.join(theta_2)&#10; theta = theta.join(sigma_2)&#10; theta = theta.join(theta_1) &#10; theta = theta.join(sigma_1)&#10; &#10;&#10; # connect 1 output port to see the results&#10; return model_NB_calib,theta"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Apply model Python (2)" width="90" x="581" y="544">
    <parameter key="script" value="import pandas as pd&#10;from sklearn.metrics import accuracy_score&#10;from sklearn.metrics import roc_auc_score&#10;from sklearn.preprocessing import LabelEncoder&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(model,data):&#10;&#10; y = data['2nd_Heart_Attack']&#10; X = data.drop('2nd_Heart_Attack',axis = 1)&#10;&#10; #Prediction : Applying of the model&#10; y_pred = model.predict(X)&#10; y_prob = model.predict_proba(X)&#10;&#10; #Transform (Yes/No) ==&gt; (0/1) mandatory for python&#10; le = LabelEncoder()&#10;&#10; y_bin = le.fit_transform(y)&#10; y_pred_bin = le.fit_transform(y_pred)&#10;&#9;&#10; #Calculation of accuracy&#10; acc = 100*accuracy_score(y,y_pred)&#10; #Calculation of AUC&#10; auc_ = roc_auc_score(y_bin,y_pred_bin,average = 'weighted')&#10;&#10; #Write the results&#10; data['prediction(2nd_Heart_Attack)'] = y_pred&#10; data['confidence(Yes)'] = y_prob[:,1]&#10; data['confidence(No)'] = y_prob[:,0]&#10;&#10; performance = pd.DataFrame(data = [acc],columns = ['accuracy'])&#10; AUC = pd.DataFrame(data = [auc_],columns = ['AUC'])&#10; performance = performance.join(AUC)&#10; &#10; data.rm_metadata['prediction(2nd_Heart_Attack)']=(None,'prediction(2nd_Heart_Attack)')&#10; data.rm_metadata['confidence(Yes)'] = (None,'confidence(Yes)')&#10; data.rm_metadata['confidence(No)'] = (None,'confidence(No)')&#10; &#10; &#10; # connect 2 output ports to see the results&#10; return data, performance"/>
    </operator>
    <connect from_op="Retrieve Chapter09DataSet_Training" from_port="output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Naive Bayes" to_port="training set"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Apply Model" from_port="model" to_port="result 5"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_port="result 2"/>
    <connect from_op="Performance" from_port="performance" to_port="result 1"/>
    <connect from_op="Retrieve Chapter09DataSet_Training (2)" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
    <connect from_op="Set Role (2)" from_port="example set output" to_op="Multiply (3)" to_port="input"/>
    <connect from_op="Multiply (3)" from_port="output 1" to_op="Naive Bayes Python" to_port="input 1"/>
    <connect from_op="Multiply (3)" from_port="output 2" to_op="Apply model Python (2)" to_port="input 2"/>
    <connect from_op="Naive Bayes Python" from_port="output 1" to_op="Apply model Python (2)" to_port="input 1"/>
    <connect from_op="Naive Bayes Python" from_port="output 2" to_port="result 6"/>
    <connect from_op="Apply model Python (2)" from_port="output 1" to_port="result 3"/>
    <connect from_op="Apply model Python (2)" from_port="output 2" to_port="result 4"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    <portSpacing port="sink_result 6" spacing="0"/>
    <portSpacing port="sink_result 7" spacing="0"/>
    </process>
    </operator>
    </process>

    2. The ROC curve :

     

    In parallel, I builded the ROC curve with Python and it's weird : 

    Python is using only one point for creating the ROC. Here a screenshot of this ROC : 

    NB_ROC_Python.pngNB ROC python curve

    While RM is using a lot more points : 

    NB_ROC_RM.pngNB_ROC_RM curve

    The number of points taken into account is not the same in both cases. RM is more accurate than Python and then

    the two curves have not the same "shape" and then the Area Under Curve is different. For me, there is a "bug" or at least

    a simplification/a lack of precision in Python.

     

    Best regards,

     

    Lionel

     

     

     

Sign In or Register to comment.