Neutral Network for football match prediction

Condwras · September 2015

Hello guys!My name is Jim and i am new to the forum,amd new to the Rapidminer program!

I want to make a question about how to predict a football match with the help of Rapidminer.So lets start!!

I have an excel file with about 3000 games of football matches....I have 20 culumns with data(19 columns with numbers like "goals,wins,losses" etc,and 1 column with the final result of the match(1,X,2),which i use as a label).So total 20 columns!

Now i have a second excel file with 10 matches i want to predict the final result,with the same number of columns(19).And i just want to predict the result of the match(1,X,2).

I use 4 operators...2 "read excel" operators,1 "neural net" operator,and 1 "apply model" operator...

NOW my problem is that i try to predict 10 matches for beginning,but i take final result 1 to all of my matches!

Why thats happening?And when i change the "neutral net" operator with the "k-NN" i take better results...

So my question is,what is better to use to predict something like that?Do you have an other operator or generally some kind of advice that will help me to my work?

Thanks a lot in advance,Jim!

MartinLiebig · September 2015

Hi Jim!

Just for clarification - Soccer or American Football? The first one would be more interesting for me

A general thing: You can not generally say which algorithm will work best on your data. You simply need to define a performance measure (Accuracy?) and then try different things out. My shortlist usually starts with a Random Forest - but a Neural Net is ofc also fine.
Then you might simply try out read excel - x-validation with Random forest and have a look what happens.
Have you checked out our new tutorials? http://docs.rapidminer.com/studio/getting-started/

Might be worth a look. There is also a beginners book by David north which is available for free: https://rapidminer.com/resource/data-mining-masses/

If you have any further question - feel free to ask

Cheers!
Martin

Condwras · September 2015

Thanks for the response Martin..!

I will try to predict soccer(mostly European leagues like England,Germany and Spain).I try the "random forest"
operator that you say,but i take again the same results on my matches.The final result prediction is "1",on all 10 matches....

Actually my problem is not what operator to use,but if i use them with the right way.As i mention on my first message i use 2 operators...2 "read excel" operators,1 "neural net" operator,and 1 "apply model" operator..And then i connect them and i take my results from my label....Now i change "neutral net",with ""k-NN" OR with the "random forest" and i take different results....But i really dont understand how to use "validation" operator on this project..!

Thanks a lot again!

MartinLiebig · September 2015

Hi Jim,

the videos posted above should explain the x-validation. Video 4-5 would be those.

~Martin

Condwras · September 2015

Thanks again for your help Martin!

I already study your links and especially the videos...I see them from the beginning,(not only the videos 4 and 5 you say),and i try step by step to follow the instructions.Until now i have a little problem to video 3,but i will try to find out what happened!I will be back with the results by the night,so i can tell you if i can solve the problem myself!

Thanks for your time!

Condwras · September 2015

Wow,thats really good stuff Martin..!Great videos,really helped me!I follow the steps but i have some kind of problems,cause when i use the decision tree operator and do exactly the same as video 4 shows me,i use the missing labels on the one "filter example" operator and the no missing labels on the other....But for some reason when the process finish i stil have questionmarks to my predictions..Aby idea what i do wrong?

Condwras · September 2015

Really,really thank you

..! I try all night and finally i find out,what i was doing wrong....! I understand the way all the processes work,and i think now i can continue my project with more confidence! I will try to change some things so i can take better results for my predictions!

I have just 2 questions,and i will be glad to hear your opinion...!Every help is going to be a huge step closer to my purpose for me!

1)My accuracy for an amount of 3000 football matches is about 38%....Do you think is this a good percentage?And if understand right the +/-,which is next to the accuracy,it is better to be the lowest it can,because that make the system,more stable,right?My +/- is 2,45...!

2)This is the most important for me...!When i use the "k-NN" operator,i take different predictions for my football matches(1,X,2).The 35% percentage i mention earlier is with a"k-NN" operator.....But when i use a "decision tree" operator ALL my matches prediction WAS 1....i really cant understand what i am doing wrong...The same thing happens when i use the ""neural net" operator....ALL matches prediction is 1....Did i miss something....I really need a guidance here....!

My greetings,Jim!

MartinLiebig · September 2015

For 1)
Two things on that. What are you trying to predict? Homewin, Awaywin, Draw? If so, you can do a simple thing. The dominant class will be Homewin. I would expect something like 40% of your games where won by the home team. So the naive approch (Hometeam wins all the time) gives you 40% accuracy. Check this number. If you are better than this, then you did something good.
The second thing: I know that ~55% accuracy is possible on german bundesliga.

For 2)
Have you checked the decision tree? It can be, that it does not find any split which fullfills the pruning options. Then you only have a stump. Try to reduce the pruning (e.g. bei change the minimal gain to 0.001). Then there should be something. My personal tipp: try a random forest and deactivate pruning and prepruning.

Cheers,
Martin

Condwras · September 2015

I always try to predict the final result of a match!My testing gives me that results:

Accuracy:37.38%,+/- 2,45...!
My true 1 is 43,85%,my true X is 30,41% and my true 2 is 34,47%.These is all class recall..!I will try to change some things,like delete some of the matches that the odds is smaller than 1,90...I can try different things to make my accuracy better..!

As for the question number 2,i deactivate as you say pruning and prepruning on "random forest' operator,but the results arent quite good.I take 85% of my predictions for final result as 1 only...!Pretty much the same thing happens when i use a decision tree and i change the minimum gain to 0,001!

To make it more clear i give you my real excel final results data.

1=1028, X=693, 2= 768. Now i if my final results are different,for example if i have these results: 1=2000, X=50, 2= 100,i will understand why to take almost everytime 1(home win) as a prediciton...But now with, 1=1028, X=693, 2= 768 i cant get it.....I mean ok the 1 is a little bit more,but not so much to take almost everytime this prediction...!

Thanks again,appreciate your help,i learn already a lot of things!

Condwras · September 2015

Hello..!I am back..!

I finish all tutorial videos,really helpfull...!I understand how to to predict and how to test the accuracy prediction of my models...!Really excited!

But i want to make a last question about my project....

As i mention on my first post,i have an excel data,with about 3000 matches....This matches was from 18 different leagues from European soccer...!

And my question is pretty simple!When i use ALL 3000 matches,and i use,either a "decision tree",or a "random forest",or a "neural net" operator my final preditction gives me almost anytime an 1 (homewin).Rarely i get a X (draw) or a 2(away win).....

But now i try to separate all these matches by league..!

And now when i tested for example,only the football matches which is only by England Premier League(the matches i have are 196 for that league),i got result predictions,for all the possble labels(1,X,2)...

Why this happening...?I am trying to understand but i cannot find an answer for that....!

MartinLiebig · September 2015

Hi,

hard to say without a look on the actual process.

A general tip: It i usally not about the algorithm itself, but about the used attributes.
Think about that: You describe your teams by values like "BettingOddForDraw" or something. The better you describe, the better the result will be. So adding stuff or even ratios etc. might be good

Best,
Martin

Condwras · September 2015

Hmmm all night i try to make my accuracy better...I end up to this conclusion....I try different operators,i move away attributes,i i move away football matches from my excel but nothing.....

So today i decide that is better to put some new attributes and replace the old ones....!

I dont know if this will help but i make a list with the attributes i use,and a list with news one,to make it more clear for anyone who reads the topic...!

This is All the attributes i used until now.

*Home wins,draws,defeats totally------>Example 3-3-2

*Home wins,draws,defeats ONLY into home team stadium------>Example 1-2-1

*Home form last 6 matches------>Example 3-2-1

*Home goals front/conceded------>Example 10-4

*Home goals front conceded ONLY into home team stadium------>Example 5-1

*I use the same attributes for AWAY team.

*Finally i use the odds from a betting company for 1-X-2------>Example 2,00-3,20-2,90

So totally i use 29 columns......!And my label attribute which is the final result of the match,going to 30...!

Now i am thinking of changing the goals which is all integer numbers,with the average goals per match for both home and away team.Also i am thinking of make the wins,draws and losses of a team a percentage number and finally delete the form teams attribute..!

MartinLiebig · September 2015

all of that makes perfectly sense

Condwras · September 2015

Hello again...!I am in the final stage of my project...!I finally use neural net for my prediction...!

I delete from my data all the odds,that for the home or away win was too low(for example when for home win the odds was 1,20).I think that this is a way for balance someway my system..!I change and some of my attributes,and my new x-validation results are:

accuracy 39% +/- 3.50

For every 1 system prediction i have win percentage ---- 43%..
For every X system prediction i have win percentage ---- 32%.
For every 2 system prediction i have win percentage ---- 39%.

Thats someway reasonable cause now i try to predict only the matches which is a little bit more irresolute...!

Thanks a lot for your help Martin,it was a very nice trip to rapidminer

MartinLiebig · September 2015

you are welcome.

is there any signficant difference in how good different leagues can be predicted? Is Premiere League easier than Bundesliga?

Best,
Martin

Condwras · September 2015

Well Martin,i try a lot of different combinations....

First of all,i saw that the best results(for my case),cames with "neural net" operator...I try k-NN,random forest.decision tree,but i finally choose Neural net for my predictions.

Secondly as i has mention,i have an excel file with about 2500 football matches...This was a good data number i guess....But i finally delete all games,that the odds was from 1.90 or lower(for home or away win)....With that way i avoid to predict games,that the network will give me a prediction for a match that has an underdog.....For example i dont want to predict what will going to happen for a match like "Bayern Munich"-"Kaizerslaoutern"...This will be an obvious "1" prediction....

Now i have a smaller database (my matches now is about at 1600),and my final accuracy is 38% for "1", 35% for "2" and 30% for "X".....

So my last step is to understand what attributes is best for my predictions......

How can that be done?I mean i have my data saved to my rapidminer and i connect them to a "neural net" operator...I dont want to predict nothing at this stage....i train them and on my description column i read some kind of stuff that make me feel a little lost....

What is node,bias,Threshold and how the numbers from each output calculated....?Is any video for that,so i can understand all these usell information..!

Thanks!Jim

MartinLiebig · September 2015

Hi Jim,

it is about feature selection. Try the Weight by operators. Further this video might help you https://www.youtube.com/watch?v=JlhoTAk1ow8

Best,
Martin

Condwras · September 2015

This video is very helpful Martin,but now i face some others problems...When i try tou use this combination of operators,Rapidminer process freezes and all my pc,cannot work....Any idea why that happens....?Also on tab card says something about "binary",that cannot work well....

MartinLiebig · October 2015

Hi,

sorry. That can't be answered out of the box. Any do you do something like nominal to numerical, pivot or so? Something where you generate a lot of attributes?

~Martin

Condwras · October 2015

Well no,i use exactly the same excel data file....No nominal attributes....All attributes is numerical...!

But the matches are again about 1800....with 20 columns each...!So this is a big file,as you can understand...!

Can i try something different for testing the weight for each attribute....?

MartinLiebig · October 2015

Hi,

which operator did you try? Weight by Gini Index?
and 1800x20 is small data for standard use cases. Common use cases may have 50.000 examples and 200 attributes or more.

~Martin

Condwras · October 2015

I use the example exactly as the video shows...!I use brute force for the weight...!Does is it play some role?Maybe i should try something different?

Actually the combination of operators are by row:

1)retrieve----->optimize selection(brute force)----->x-validation----->linear regresion----->apply model and performance...

But again when i use this combination,rapidminer freezes.....Should i change optimize selection(brute) with something else?I dont know why ,but when i use my neural net prediction system,rapidminer works great....

MartinLiebig · October 2015

Hi,

brute force is really compute intesive. Try a Forward Selection. Another algorithm i really like for feature selection is called MRMR. It is included in the feature selection extension. You need to use MRMR + Select by weights.
Video for Forward Selection: https://www.youtube.com/watch?v=o-9gyWrQ00w
and another advice - put a fast learner in. My usual advices are Naive Bayes and Decision Tree. If you use a Random Forest for classification, a Decision Tree for selection is a good idea.

~Martin

Edit: It just came to my mind why it might freeze. The reason might be Linear Regression. Just replace it with a decision tree. I would suspect that your design matrix for linear regression is nearly singular.

Condwras · October 2015

Actually Martin,i get confused at this point....

Lets explain again from the beginning,so to see if i miss something....!

My first step is to store my data...All good at that point...!My data have 19 columns and it is all numerical attributes!The 20 column is the column i want to predict,and has inside either 1, X OR 2...So when i choose attributes type,i use "attribute","label"(i do that cause i want to predict with ""neural net",so i cant use polynominal)....!

Secondly i use this combination of operators...... ------>retrieve----->multiply----->filter example 1----->neural net,filter example 2----->apply model....And i get my predictions...!This is exactly same as the video tutorial you suggest me...!If i change at that point the "NN" operator with a "decision tree" for example,i get very bad results....!So i choose "neural net".....

All good until now,i guess....Then i use the combination to see if my system works well....And i follow your videos and use this operators....

retrieve----->x-validation......(you know the rest).....And i get my results......40% for 1, 35% for 2, 30% for X.....

Now i just want to see,what of these 20 columns of my excel data,"have" more important role for my prediction with my neural net system...

And thats the point i get stuck....if i use linear regression,says that cannot handle polynominal attributes....But why?when i select attributes type on my excel data,i choose "attribute" "label".....This happens cause on the cells of excel i have 1 or X or 2?

What is the right combination at this point?

retrieve----->?----->x-validation------>?----->apply model----->performance....!

On the questionmarks,i use.....on the first optimize selection and on the second linear regression...!

MartinLiebig · October 2015

Hi,

Could you maybe post an example process for this?

~Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Neutral Network for football match prediction

Answers