currently I'm working on my master thesis about the application of data mining in soccer, I'm trying to predict soccer matches based on some stats of the two involved Teams.
My use case is the German Bundesliga and I will predict the last season (15/16) based on the three seasons before (12/13 to 14/15); the object to predict is the match result (home win, draw, away win).
Stats I am using are for example the market values of the teams (transfermarkt.de), the position in the table right before the match, the position at the end of the last season and some data to picture the 'form' of the teams, something like percentage of wins / goals shot / goals conceded / goal difference etc. in the last X matches. All in all I have like 20 attributes.
I already applied some algorithms on the data set, f.e. decision trees, neural nets, bayes rules and so on. But I don't know how to optimize my results.
First approach would be like creating a perfect model (using Optimize Selection and Optimize Parameters) for the seasons 2012-2014 and apply it to 2015. But in this case, the model has a performance of something like 70% for the training data (perhaps a bit of overfitting), but for the test data it's only like 49%.
Second approach was to change the optimization process to get a optimal model for 2015 (of course here the performance was better), but this is not very realistic, because in reality you don't know the results of the season to predict, therefore this is not a valid way.
The only reasonable way is to create a model with the test data of 2012-2014 and then apply the model to 2015, where the model is directly not optimized for 2015.
I thought about splitting up the test data, f.e. create a model with 2012+2013 and apply it to 2014 (or using X-Validation); here i am able to do the optimization for 2014, since we know the results. Here i am getting a performance of like 48% up to 52%, but this depends on my optimization parameters and the selection of my attributes. F.e. sometimes the algorithm finds a model, that has a performance of like 60% for the training data and 48% for the test data; sometimes the result is a model with 55% for the training data and 52% for the test data.
Any ideas on this topic?
After all i would like to answer my own question, since this was the main point of my master thesis. I took the seasons 2010 to 2014 and the first half of the last season (2015/2016) as Training Data, the second half of the last season as Test Data.
But furthermore I splitted the Training Data another time - using cross validation. Therefore the training data got splitted into 10 parts, in each iteration one part is used as Evaluation Data, while the other 9 parts are used as Training Data.
Because of the fact that the Evaluation Data are known, this process can be optimized (--> finding the optimal algorithm and it's parameters).
The found Algorithm can now be applied to the Test Data, i.e. the second half of the last season.
You can find my master thesis here (the thesis itself is german: "Anwendung von Data-Mining-Technologien zu statistischen Auswertungen und Vorhersagen im Fußball"), including the RapidMiner-Processes and the used Data: