Problem with Naive Bayesian

domi_wiesedomi_wiese Member Posts: 28 Contributor II
edited December 2018 in Help

Hello,

I'm from Germany and studying Financial Management. Right now I have to make a presentation about the Naive Bayesian on RapidMiner. My problem is, that I don't understand how the results ,,prediction(no) / prediction(yes)" can be computed. 

Here is my XML Process: 

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="246" y="34">
<parameter key="laplace_correction" value="true"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="187">
<parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="313" y="340">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="380" y="85">
<parameter key="use_example_weights" value="true"/>
</operator>
</process>

I've used the Golf data set. For example the first row: sunny, Temperature:85, Humidity:85 and Wind:false.

For Temperature and Humidity I've used the probability density function in order to get the following results for no= 0,0003074677... and yes=0,000059000924.

What should I do witht those results to get the results from the Prediction No= 0,711 and Yes= 0,289?

 

Thank you in advance!rapid1.png

Tagged:

Best Answer

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted

    Hi again @domi_wiese,

     

    I think you have strictly reach your goal : 

    Don't forget that you perform calculation without Laplace correction : 

    without Laplace correction, the results of RapidMiner are : 

    NB_probabilities_9.png

    To come back to the calculations : 

    5/14 => OK

    2/5 => OK

    3/5 => OK

    0,04125 (Humidity you confirm ?) => OK

    0,02121 (Temperature you confirm ?) => NOK (I find 0,1204 => I made an error => can you give detail of your calculation for this case

    => I don't know where is my error in this calculation)

     

    Morality : Is the solution in the calculator of OS Windows..............??????

     

    Best regards,

     

    Lionel

     

     

     

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    If you want to understand the calculations behind NB (also using the Golf dataset), check out Ingo's short video here: 

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • domi_wiesedomi_wiese Member Posts: 28 Contributor II

    Thank you for this video. I have already watched it, but there is just the Basic explained, which is not the problem for me. I'm talking about the next steps, I mean: how to combinate the continous numeric values with those from Outlook and Wind to get the predictions (yes and no). In other words: what is the equation to get to those predictions, for example for the first row?

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    for numeric variables we use a gaussian assumption. The probability is given by the usual gaussian pdf with the calculated mean and variance. For nominal variables we can get the probabilities from simple counting.

     

    Best

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • domi_wiesedomi_wiese Member Posts: 28 Contributor II

    Thank you very much. 

    Let's stick to the first row with the prediction for no (71,1%). 

    I've used the probability density function to get for temperature (85) and humidity (85) the following results

                   Temperature               Humidity

    yes        0,00097307096          0,00319056274

    no          0,0464961233            0,0412564316

     

     

    At next, I've computed the following results for Outlook=sunny and wind=false 

                   Sunny              False

    yes            3/9                   4/9

    no             2/5                   2/5

     

    In order to get the prediction no (71,1%) and yes (28,9%) I thought that it would be like this:

    Multiply all the results for yes with (9/14) and multiply all the results for no with (5/14). Then add those two results to have the Basis (evidence). At last divide the result for yes with the evidence and the results for no with the evidence to get the predictions. 

    What am I doing wrong? 

    Thank you in advance!

     

     

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi,

     

    This topic interests me a lot.

    In deed, from my opinion, it is essential to understand the theory behind the algorithms.

    I hope you can give me a few minutes of attention : 

    1. Here the results of confidence of the Golf test set (after training by the Golf dataset) given by RapidMiner

    without Laplace correction

    NB_probabilities_1.png

    2. I tried to retrieve this results manually, but I have this illogical results for the first row of the Golf data set : 

    NB_probabilities_2.png

     You can find the whole Excel calculation file by following this link : 

    https://drive.google.com/open?id=18T153eElmtsjOzihGwLENVh8cwHdaHMT

     

    3. I used too Python, and the results are differents from RapidMiner :

     You can fiNB_probabilities_3.png

    you can find the process here : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="8.1.000" expanded="true" height="68" name="Training Golf" width="90" x="45" y="85">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Naive_Bayes_Probabilities\NB_Proba_1.xlsx"/>
    <parameter key="imported_cell_range" value="A1:E15"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.1.000" expanded="true" height="103" name="Multiply" width="90" x="179" y="85"/>
    <operator activated="true" class="set_role" compatibility="8.1.000" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="naive_bayes" compatibility="8.1.000" expanded="true" height="82" name="Naive Bayes" width="90" x="447" y="85">
    <parameter key="laplace_correction" value="false"/>
    </operator>
    <operator activated="true" class="read_excel" compatibility="8.1.000" expanded="true" height="68" name="Test Golf" width="90" x="45" y="238">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Naive_Bayes_Probabilities\NB_Proba_1.xlsx"/>
    <parameter key="sheet_number" value="2"/>
    <parameter key="imported_cell_range" value="A1:D15"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Outlook.true.polynominal.attribute"/>
    <parameter key="1" value="Temperature.true.real.attribute"/>
    <parameter key="2" value="Humidity.true.real.attribute"/>
    <parameter key="3" value="Wind.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.1.000" expanded="true" height="103" name="Multiply (2)" width="90" x="179" y="238"/>
    <operator activated="true" class="apply_model" compatibility="8.1.000" expanded="true" height="82" name="Apply Model" width="90" x="581" y="136">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="LabelEncoder" width="90" x="313" y="136">
    <parameter key="script" value="import pandas&#10;from sklearn.preprocessing import LabelEncoder&#10;&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; le = LabelEncoder()&#10; data.iloc[:,0] = le.fit_transform(data.iloc[:,0])&#10; data.iloc[:,3] = le.fit_transform(data.iloc[:,3])&#10; data.iloc[:,4] = le.fit_transform(data.iloc[:,4])&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Naives Bayes Python" width="90" x="447" y="187">
    <parameter key="script" value="&#10;from sklearn.naive_bayes import GaussianNB&#10;from sklearn.preprocessing import LabelEncoder&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; X= data.iloc[:,0:4]&#10; y=data.iloc[:,4]&#10;&#10; clf = GaussianNB()&#10; clf.fit(X,y)&#10; &#10; # connect 2 output ports to see the results&#10; return clf"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="LabelEncoder (2)" width="90" x="313" y="289">
    <parameter key="script" value="import pandas&#10;from sklearn.preprocessing import LabelEncoder&#10;&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; le = LabelEncoder()&#10; data.iloc[:,0] = le.fit_transform(data.iloc[:,0])&#10; data.iloc[:,3] = le.fit_transform(data.iloc[:,3])&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Apply Model Python" width="90" x="581" y="238">
    <parameter key="script" value="&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(model, data):&#10;&#10; base =data[['Outlook', 'Temperature', 'Humidity','Wind']]&#10; data['prediction (Play)'] = model.predict(base)&#10; data['confidence(no)'] = model.predict_proba(base)[:,0]&#10; data['confidence(yes)'] = model.predict_proba(base)[:,1]&#10;&#10; #set role of prediction attribute to prediction&#10; data.rm_metadata['prediction (Play)']=(None,'prediction(Play)')&#10; data.rm_metadata['confidence(no)']=(None,'confidence(no)')&#10; data.rm_metadata['confidence(yes)']=(None,'confidence(yes)')&#10; return data&#10; "/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="238">
    <list key="function_descriptions">
    <parameter key="prediction (Play)" value="if([prediction (Play)]==0,&quot;no&quot;,&quot;yes&quot;)"/>
    </list>
    </operator>
    <connect from_op="Training Golf" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="LabelEncoder" to_port="input 1"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
    <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Test Golf" from_port="output" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="LabelEncoder (2)" to_port="input 1"/>
    <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
    <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
    <connect from_op="LabelEncoder" from_port="output 1" to_op="Naives Bayes Python" to_port="input 1"/>
    <connect from_op="Naives Bayes Python" from_port="output 1" to_op="Apply Model Python" to_port="input 1"/>
    <connect from_op="LabelEncoder (2)" from_port="output 1" to_op="Apply Model Python" to_port="input 2"/>
    <connect from_op="Apply Model Python" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    and the training and test Golf dataset by following the link here (Excel file) : 

    https://drive.google.com/open?id=18Dht5-aTuJVehZvbU3LZLAvzQTBixCLB

     

    4. I think I understood the calculation methodology of confidences and I am almost on my calculations.

    Can you help me to find my error if there is an error ?

    Why the results of Python are different from RapidMiner ?

    Is there a postprocessing of the probabilities in RapidMiner ?

     

    Thanks you for your help.

     

    Best regards, 

     

    Lionel

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    dang, @IngoRM - @lionelderkrikor also uses Excel to check calculations!! I thought I was the only luddite lingering around. Now if I could only get my hands on my old HP 42S RPN calculator....  :smileylol:

     

    Hp42s_face

     

    (sorry @lionelderkrikor - I was just showing Ingo some calcs on Excel today and could not resist.  Believe me, I am sometimes very proud of my luddite skills...)

     

    Scott

     

  • domi_wiesedomi_wiese Member Posts: 28 Contributor II

    Hi,

     

    thanks for sharing your point of view Lionel! I really appreciate it.

    In my opinion, you used not the ,,correct" equation for the probability density function. I think the following link has it right at the following time: 02:20min. 

    https://www.youtube.com/watch?v=k2diLn5Nqbs&t=125s&list=PL7r4RQYRQRfgw3-ccVUzdlYh5HK-tQHFs&index=3

     

    By using that equation and performing like I already described in my last post, I've received for prediction no = 78,756%, which still isn't 71,1%. 

     

    Could someone please help me find the solution.

    Thank you.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    i think @lionelderkrikor forgot the priors. So you need to multiply by 4/14 and 10/14 respectively.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi, 

     

    Thanks you for your feedback @domi_wiese, I admit that you are much closer to the expected results......

     

    Many things : 

     

    1. A priori the equation i use, and the equation of your video are equivalent : 

    NB_probabilities_5.pngyhyhyh

    2. In my intermediate results, I retrieve strictly the same results given by RapidMiner in the Distribution Table (mean/std dev of Temperature and Humidity, count of nominal attributes) without Laplace correction

    NB_probabilities_6.pngppmp

    That's why, I don't understand why I obtain these illogical results.

     

    3. @mschmitz, a priori, I have not forgotten the priors in the calculations : In deed that's not explicit and detailed in Excel calculation file.

    NB_probabilities_7.png

    Although there is no change in the results, here the link to my second release of Excel file : 

    https://drive.google.com/open?id=12mELZ_SW8fv-VfeRkY-mUjqEUb42ODx6

     

    4. @domi_wiese, maybe you can share your calculation file and/or your intermediate results  - P(Xi|Y = yes/no)  / P(Y = yes/no)  - in order 

    we find the solution to this mysterious Naive Bayes problematic....

     

    5. Do not give up : I'm sure, we will find the solution to this problem and if we can not do it with Excel, @sgenzer will lend us his HP 42S RPN calculator..... or i will retrieve my old TI 86 calculator from college :

    NB_probabilities_8.png

     

    I hope that I advanced the reflection on this topic a little bit.

     

    Best regards, 

     

    Lionel 

     

     

     

      

     

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    @lionelderkrikor@sgenzer 

     

    Ok boys, I'm dropping my beast on the table too...

     

    2018-02-07 10.03.01.jpgAnd on the 7th day, God created the HP 48GX

  • domi_wiesedomi_wiese Member Posts: 28 Contributor II

    Hi,

     

    @lionelderkrikor

    I'm really sorry. I made a mistake while using the probability density function. 

    But I've corrected them. Now, I have computed like in the first picture below, in order to get the the intermediate result for no. I do the same for yes. After that, I got the predictions which is 71,7% for no. This is still around 0,5% too much, but I think it could be correct. What do you think?Naive5 (2).png1 PictureNaive 6_LI.jpg2 Picture

  • domi_wiesedomi_wiese Member Posts: 28 Contributor II

    Hi @lionelderkrikor,

     

    thank you for bringing my attention to laplace correction. I'll look after that by tommorow.
    Of course I will send you my calculation. Naive 7.png

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi again @domi_wiese,

     

    Thanks to you, I found my error : a problem of bracket and exponent in Excel......

    For my general culture : What is your calculator software ?

     

    and good luck for your presentation.

     

    Best regards, 

     

    Lionel

  • domi_wiesedomi_wiese Member Posts: 28 Contributor II

    Hi @lionelderkrikor,

     

    I'm glad we found our mistakes and solved them and thank you for wishing me luck.

    To be honest: First I used my own calculator, but then I used a calculator on the internet. I can show you the link of course.

    https://web2.0rechner.de/

     

    Have a nice day!

     

     

  • domi_wiesedomi_wiese Member Posts: 28 Contributor II

    Hi @lionelderkrikor,

     

    just one thing: could you please send me a picture of your design view with the process? And where is the option with the laplace correction? I know what that is, but I can't find the position of it.

    Thank you in advance!

  • domi_wiesedomi_wiese Member Posts: 28 Contributor II

    Hi @lionelderkrikor,

     

    I've already found out, how it works. So, thanks again and have a nice day!

Sign In or Register to comment.