Neutral Network for football match prediction

CondwrasCondwras Member Posts: 15 Contributor II
Hello guys!My name is Jim and i am new to the forum,amd new to the Rapidminer program!

I want to make a question about how to predict a football match with the help of Rapidminer.So lets start!!

I have an excel file with about 3000 games of football matches....I have 20 culumns with data(19 columns with numbers like "goals,wins,losses" etc,and 1 column with the final result of the match(1,X,2),which i use as a label).So total 20 columns!

Now i have a second excel file with 10 matches i want to predict the final result,with  the same number of columns(19).And i just want to predict the result of the match(1,X,2).

I use 4 operators...2 "read excel" operators,1 "neural net" operator,and 1 "apply model" operator...

NOW my problem is that i try to predict 10 matches for beginning,but i take final result 1 to all of my matches!

Why thats happening?And when i change the "neutral net" operator with the "k-NN" i take better results...

So my question is,what is better to use to predict something like that?Do you have an other operator or generally some kind of advice that will help me to my work?

Thanks a lot in advance,Jim!

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi Jim!

    Just for clarification - Soccer or American Football? The first one would be more interesting for me :)

    A general thing: You can not generally say which algorithm will work best on your data. You simply need to define a performance measure (Accuracy?) and then try different things out. My shortlist usually starts with a Random Forest - but a Neural Net is ofc also fine.
    Then you might simply try out read excel - x-validation with Random forest and have a look what happens.
    Have you checked out our new tutorials? http://docs.rapidminer.com/studio/getting-started/

    Might be worth a look. There is also a beginners book by David north which is available for free: https://rapidminer.com/resource/data-mining-masses/

    If you have any further question - feel free to ask :)

    Cheers!
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CondwrasCondwras Member Posts: 15 Contributor II
    Thanks for the response Martin..!

    I will try to predict soccer(mostly European leagues like England,Germany and Spain).I try the "random forest"
    operator that you say,but i take again the same results on my matches.The final result prediction is "1",on all 10 matches....

    Actually my problem is not what operator to use,but if i use them with the right way.As i mention on my first message i use 2 operators...2 "read excel" operators,1 "neural net" operator,and 1 "apply model" operator..And then i connect them and i take my results from my label....Now i change "neutral net",with ""k-NN" OR with the "random forest" and i take different results....But i really dont understand how to use "validation" operator on this project..!

    Thanks a lot again!
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi Jim,

    the videos posted above should explain the x-validation. Video 4-5 would be those.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CondwrasCondwras Member Posts: 15 Contributor II
    Thanks again for your help Martin!

    I already study your links and especially the videos...I see them from the beginning,(not only the videos 4 and 5 you say),and i try step by step to follow the instructions.Until  now i have a little problem to video 3,but i will try to find out what happened!I will be back with the results by the night,so i can tell you if i can solve the problem myself!

    Thanks for your time!
  • CondwrasCondwras Member Posts: 15 Contributor II
    Wow,thats really good stuff Martin..!Great videos,really helped me!I follow the steps but i have some kind of problems,cause when i use the decision tree operator and do exactly the same as video 4 shows me,i use the missing labels on the one "filter example" operator and the no missing labels on the other....But for some reason when the process finish i stil have questionmarks to my predictions..Aby idea what i do wrong?
  • CondwrasCondwras Member Posts: 15 Contributor II
    Really,really thank you  :) ..! I try all night and finally i find out,what i was doing wrong....! I understand the way all the processes work,and i think now i can continue my project with more confidence! I will try to change some things so i can take better results for my predictions!

    I have just 2 questions,and i will be glad to hear your opinion...!Every help is going to be a huge step closer to my purpose for me!

    1)My accuracy for an amount of 3000 football matches is about 38%....Do you think is this a good percentage?And if understand right the +/-,which is next to the accuracy,it is better to be the lowest it can,because that make the system,more stable,right?My +/- is 2,45...!

    2)This is the most important for me...!When i use the "k-NN" operator,i take different predictions for my football matches(1,X,2).The 35% percentage i mention earlier is with a"k-NN" operator.....But when i use a "decision tree" operator  ALL my matches prediction WAS 1....i really cant understand what i am doing wrong...The same thing happens when i use the ""neural net" operator....ALL matches prediction is 1....Did i miss something....I really need a guidance here....!

    My greetings,Jim!
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    For 1)
    Two things on that. What are you trying to predict? Homewin, Awaywin, Draw? If so, you can do a simple thing. The dominant class will be Homewin. I would expect something like 40% of your games where won by the home team. So the naive approch (Hometeam wins all the time) gives you 40% accuracy. Check this number. If you are better than this, then you did something good.
    The second thing: I know that ~55% accuracy is possible on german bundesliga.

    For 2)
    Have you checked the decision tree? It can be, that it does not find any split which fullfills the pruning options. Then you only have a stump. Try to reduce the pruning (e.g. bei change the minimal gain to 0.001). Then there should be something. My personal tipp: try a random forest and deactivate pruning and prepruning.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CondwrasCondwras Member Posts: 15 Contributor II
    I always try to predict the final result of a match!My testing gives me that results:

    Accuracy:37.38%,+/- 2,45...!
    My true 1 is 43,85%,my true X is 30,41% and my true 2 is 34,47%.These is all class recall..!I will try to change some things,like delete some of the matches that the odds is smaller than 1,90...I can try different things to make my accuracy better..!

    As for the question number 2,i deactivate as you say pruning and prepruning on "random forest' operator,but the results arent quite good.I take 85% of my predictions for final result as 1 only...!Pretty much the same thing happens when i use a decision tree and i change the minimum gain to 0,001!

    To make it more clear i give you my real excel final results data.

    1=1028, X=693, 2= 768. Now i if my final results are different,for example if i have these results: 1=2000, X=50, 2= 100,i will understand why to take almost everytime 1(home win) as a prediciton...But now with, 1=1028, X=693, 2= 768 i cant get it.....I mean ok the 1 is a little bit more,but not so much to take almost everytime this prediction...!

    Thanks again,appreciate your help,i learn already a lot of things!
  • CondwrasCondwras Member Posts: 15 Contributor II
    Hello..!I am back..!

    I finish all tutorial videos,really helpfull...!I understand how to to predict and how to test the accuracy prediction of my models...!Really excited!

    But i want to make a last question about my project....

    As i mention on my first post,i have an excel data,with about 3000 matches....This matches was from 18 different leagues from European soccer...!

    And my question is pretty simple!When i use ALL 3000 matches,and i use,either  a "decision tree",or a "random forest",or a "neural net" operator my final preditction gives me almost anytime an 1 (homewin).Rarely i get a X (draw) or a  2(away win).....

    But now i try to separate all these matches by league..!

    And now when i tested for example,only the football matches which is only by England Premier League(the matches i have are 196 for that league),i got result predictions,for all the possble labels(1,X,2)...

    Why this happening...?I am trying to understand but i cannot find an answer for that....!



  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    hard to say without a look on the actual process.

    A general tip: It i usally not about the algorithm itself, but about the used attributes.
    Think about that: You describe your teams by values like "BettingOddForDraw" or something. The better you describe, the better the result will be. So adding stuff or even ratios etc. might be good

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CondwrasCondwras Member Posts: 15 Contributor II
    Hmmm all night i try to make my accuracy better...I end up to this conclusion....I try different operators,i move away attributes,i i move away football matches from my excel but nothing.....

    So today i decide that is better to put some new attributes and replace the old ones....!

    I dont know if this will help but i make a list with the attributes i use,and a list with news one,to make it more clear for anyone who reads the topic...!

    This is All the attributes i used until now.

    *Home wins,draws,defeats totally------>Example 3-3-2

    *Home wins,draws,defeats ONLY into home team stadium------>Example 1-2-1

    *Home form last 6 matches------>Example 3-2-1

    *Home goals front/conceded------>Example 10-4

    *Home goals front conceded ONLY into home team stadium------>Example 5-1

    *I use the same attributes for AWAY team.

    *Finally i use the odds from a betting company for 1-X-2------>Example 2,00-3,20-2,90

    So totally i use 29 columns......!And my label attribute which is the final result of the match,going to 30...!

    Now i am thinking of changing the goals which is all integer numbers,with the average goals per match for both home and away team.Also i am thinking of make the wins,draws and losses of a team a percentage number and finally delete the form teams attribute..!
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    all of that makes perfectly sense :)
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CondwrasCondwras Member Posts: 15 Contributor II
    Hello again...!I am in the final stage of my project...!I finally use neural net for my prediction...!

    I delete from my data all the odds,that for the home or away win was too low(for example when for home win the odds was 1,20).I think that this is a way for balance someway my system..!I change and some of my attributes,and my new x-validation results are:

    accuracy 39% +/- 3.50

    For every 1 system prediction i have win percentage  ----   43%..
    For every X system prediction i have win percentage ----    32%.
    For every 2 system prediction i have win percentage ----    39%.

    Thats someway reasonable cause now i try to predict only the matches which is a little bit more irresolute...!

    Thanks a lot for your help Martin,it was a very nice trip to rapidminer  :)
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    you are welcome.

    is there  any signficant difference in how good different leagues can be predicted? Is Premiere League easier than Bundesliga?

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CondwrasCondwras Member Posts: 15 Contributor II
    Well Martin,i try a lot of different combinations....

    First of all,i saw that the best results(for my case),cames with "neural net" operator...I try k-NN,random forest.decision tree,but i finally choose Neural net for my predictions.

    Secondly as i has mention,i have an excel file with about 2500 football matches...This was a good data number i guess....But i finally delete all games,that the odds was from 1.90 or lower(for home or away win)....With that way i avoid to predict games,that the network will give me a prediction for a match that has an underdog.....For example i dont want to  predict what will going to happen for a match like "Bayern Munich"-"Kaizerslaoutern"...This will be an obvious "1" prediction....

    Now i have a smaller database (my matches now is about at 1600),and my final accuracy is 38% for "1", 35% for "2" and 30% for "X".....

    So my last step is to understand what attributes is best for my predictions......

    How can that be done?I mean  i have my data saved to my rapidminer and i connect them to a "neural net" operator...I dont want to predict nothing at this stage....i train them and on my description column i read some kind of stuff that make me feel a little lost....


    What is node,bias,Threshold and how the numbers from each output calculated....?Is any video for that,so i can understand all these usell information..!

    Thanks!Jim




  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi Jim,

    it is about feature selection. Try the Weight by operators. Further this video might help you https://www.youtube.com/watch?v=JlhoTAk1ow8

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CondwrasCondwras Member Posts: 15 Contributor II
    This video is very helpful Martin,but now i face some others problems...When i try tou use this combination of operators,Rapidminer process freezes and all my pc,cannot work....Any idea why that happens....?Also on tab card says something about "binary",that cannot work well....
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    sorry. That can't be answered out of the box. Any do you do something like nominal to numerical, pivot or so? Something where you generate a lot of attributes?


    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CondwrasCondwras Member Posts: 15 Contributor II
    Well no,i use exactly the same excel data file....No nominal attributes....All attributes is numerical...!

    But the matches are again about 1800....with 20 columns each...!So this is a big file,as you can understand...!

    Can i try something different for testing the weight for each attribute....?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    which operator did you try? Weight by Gini Index?
    and 1800x20 is small data for standard use cases. Common use cases may have 50.000 examples and 200 attributes or more.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CondwrasCondwras Member Posts: 15 Contributor II
    I use the example exactly as the video shows...!I use brute force for  the weight...!Does is it play some role?Maybe i should try something different?

    Actually the combination of operators are by row:

    1)retrieve----->optimize selection(brute force)----->x-validation----->linear regresion----->apply model and performance...

    But again when i use this combination,rapidminer freezes.....Should i change optimize selection(brute) with something else?I dont know why ,but when i use my neural net prediction system,rapidminer works great....
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    brute force is really compute intesive. Try a Forward Selection. Another algorithm i really like for feature selection is called MRMR. It is included in the feature selection extension. You need to use MRMR + Select by weights.
    Video for Forward Selection: https://www.youtube.com/watch?v=o-9gyWrQ00w
    and another advice - put a fast learner in. My usual advices are Naive Bayes and Decision Tree. If you use a Random Forest for classification, a Decision Tree for selection is a good idea.

    ~Martin

    Edit: It just came to my mind why it might freeze. The reason might be Linear Regression. Just replace it with a decision tree. I would suspect that your design matrix for linear regression is nearly singular.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CondwrasCondwras Member Posts: 15 Contributor II
    Actually Martin,i get confused at this point....

    Lets explain again from the beginning,so to see if i miss something....!

    My first step is to store my data...All good at that point...!My data have 19 columns and it is all numerical attributes!The 20 column is the column i want to predict,and has inside either 1, X OR 2...So when i choose attributes type,i use "attribute","label"(i do that cause i want to predict with ""neural net",so i cant use polynominal)....!

    Secondly i use this combination of operators......  ------>retrieve----->multiply----->filter example 1----->neural net,filter example 2----->apply model....And i get my predictions...!This is exactly same as the video tutorial you suggest me...!If i change at that point the "NN" operator with a "decision tree" for example,i get very bad results....!So i choose "neural net".....

    All good until now,i guess....Then i use  the combination to see if my system works well....And i follow your videos and use this operators....

    retrieve----->x-validation......(you know the rest).....And i get my results......40% for 1, 35% for 2, 30% for X.....

    Now i just want to see,what of these 20 columns of my excel data,"have" more important role for my prediction with my neural net system...

    And thats the point i get stuck....if i use linear regression,says that cannot handle polynominal attributes....But why?when i select attributes type on my excel data,i choose "attribute" "label".....This happens cause on the cells of excel i have 1 or X or 2?

    What is the right combination at this point?

    retrieve----->?----->x-validation------>?----->apply model----->performance....!

    On the questionmarks,i use.....on the first optimize selection and on the second linear regression...!


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    Could you maybe post an example process for this?

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.