Predict Sports Outcome

PyNoobPyNoob Member Posts: 15 Contributor I
I am a total newbie in Rapidminer. I was trying in python but the coding bit is a bit intense. I want to predict the outcome "Classification" for the team/s that i can specify. "Home Team" & "Away Team". e.g. Liverpool playing against Burnley, Big win goal scored and FC Shalke playing against Borrussia Dortmund, Draw goal sored.  So far I have got distinct responses in the Simulator tab. I would like to use/test those responses. I would ultimitately like to run machine learning to predict these outputs. How can I cluster my data? Where do I start? Please help.. I am really excited about this!
Attachment: MyDream. (for some reason I am unable to remove attachment)

Comments

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,497 RM Data Scientist
    Hi There,

    there is one very important thought you may need to have before: Even when you are able to do this, you need to know what to do with it. Just because you know who will likely win, you do not what to bet on.
    Let me give you an example:
    Munich won the last german championships and is likely the strongest team. Naturally your algorithm will predict a high likelyhood in a game of Munich vs Schalke last season or so. STILL you may want to bet on Frankfurt, if the odds are better than the chances to win.

    You quickly run into interesting optimization problems.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • PyNoobPyNoob Member Posts: 15 Contributor I
    @mschmitz Absolutely Yes!
    Whats the point of having data and the ability to manipulate it if you cannot have fun right?
    Well its all fun and games until you start doing the bit. (Here, I mean, building the correct model, getting the correct dependencies and the dataset and then getting the results, testing it).
    Then, when it comes to betting, I am not the most interested in the outcome bit. I.e. There can be a big win or a big draw or big loss.
    Just these headers are enough for me to help feed into other algorithoms. They can write an article, Have an animated interaction on the video.. So many possibilities. But, to get there, I feel getting the bit where there are statements involving the 'Classification" Column is important. Thats why I am so invested in getting the model correct.
    Would you be ready to help me?
    I would do all the work. All I need is proper guidance.
    So far I have got it down to this:

    Now I understand the Regression model is Binary and I have incorrectly used it in here, However, How do I make it non-binary in here? What would the correct format for the  input file and the correct parameters for the Auto model/Designed model (non-Auto)?


    Thanks,
    Harshad
  • PyNoobPyNoob Member Posts: 15 Contributor I
    @varunm1, @Telcontar120, @lionelderkrikor, @rfuentealba, @kayman, @SGolbert, @hughesfleming68, @kypexin, @Thomas_Ott, @JEdward, @newmint, @vinod_nageshwar, @b031710452, @carolyap2611, @Jedi88, @Pavithra_Rao, @Niharika Can you all helpful people help me in this project? I am sure it would be worth our time!
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Let's see.

    I have done *some* of this before, but football (for American people: "soccer") is the least predictable of team sports.

    Some questions arise from the top of my head:
    • How long was the pre-season each year?
    • Is this the first, second, third... game of the season?
    • What playing strategies are being used? 1-4-2-4, 1-4-3-3, 1-5-3-2
    • What position are the players taking in this strategy?
    • How many players have suffered injuries? What was the recovery time?
    • How many players are newcomers to the team?
    • What are the physical details of the players?
    • Man-to-man ability (passes, skills, saves, receptions, losses, fouls, red cards, yellow cards).
    • How many games have the team played with this coach?
    • Is the playing style wide open and offensive? or closed and defensive? aerial game? tiki-taka like Barcelona?
    Even with that information in mind, nobody could imagine that Universidad de Chile dropped to the "B" division here (pero bueno, es que el equipo no tiene estadio*).

    Tomorrow I need to train someone in RapidMiner. We will sit down to check what can we do with this chart. I promise I'll be back to you with some results, ok?

    All the best,

    Rodrigo.


    (*) Only 4 football teams own their stadiums in Chile, all the others are public and administered by the municipalities. Two of the three biggest teams own their stadiums, and we usually mock the third one for not having it despite multiple failed attempts to build one.
  • PyNoobPyNoob Member Posts: 15 Contributor I
    @rfuentealba Yes!. Firstly, THANK YOU!
    Thank you for helping me in my quest.
    Please find the preliminary conditions for the exercise: Breaking it down in Phases
    Phase 1:

    We should be able to predict based on past performance data alone.
    • Event Date "Date Time"
    • Competition "Competition of the event" e.g. Premier League/Champions League/etc "Polynomial"
    • Team Playing at home "Home Team" "Polynomial"
    • Home Team Score "Team Score" "Integer"
    • Team Playing away "Away Team" "Polynomial"
    • Away Team Score "Team Score" "Integer"
    • Score of the event "Score" e.g. 1-0/0-6 "Polynomial"
    • Classification describes the event in a bit of descriptive manner i.e. Big Win, Big Loss, Small Win, Small Loss, Big Draw,, etc. Its defined in the 'Classification No Dupes' Sheet "Classification" "Polynomial"
    I want to predict this data for now. (For an event, for the home team vs the away team, the result is expected to be as "Classification".)

    Phase 2: We can add these datapoints for every event and make the algorithim better.

    • How long was the pre-season each year?
    • Is this the first, second, third... game of the season?
    • What playing strategies are being used? 1-4-2-4, 1-4-3-3, 1-5-3-2
    • What position are the players taking in this strategy?
    • How many players have suffered injuries? What was the recovery time?
    • How many players are newcomers to the team?
    • What are the physical details of the players?
    • Man-to-man ability (passes, skills, saves, receptions, losses, fouls, red cards, yellow cards).
    • How many games have the team played with this coach?
    • Is the playing style wide open and offensive? or closed and defensive? aerial game? tiki-taka like Barcelona?
    I am really excited to do this. Thank you again for helping me in this project.I am excited to learn a truckload and also have the knowledge resource along the way!
    fyi. I am pretty comfortable with Alteryx so the learning curve should not be so steep with RapidMiner. 

    Thank You!
    Harshad Barge
  • PyNoobPyNoob Member Posts: 15 Contributor I
    What is the best model or classifier for this task? Logistic regression, I have to select the teams and the score to get an outcome. I want to just select the teams and there would be an output.. possible outputs Home score, Away Score, Score(concatenated home and away Score). I have a descriptive classifier for every score. If the model pick it up.. clusters it and that would be the response, that would be the 🍒 cherry on the cake.. haha.
    I am ready to put in the work.. need worthy team.. 
    [email protected]
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,497 RM Data Scientist
    Hi @PyNoob ,
    as much as I, as a German, would love to do Bundesliga Analysis, i need to pass. I simply don't have the bandwith. I am sorry.

    My go-to algorithm would be as usual GBTs. They are super strong on Tabular data. You need to do very careful analysis to beat them with Neural Networks, SVMs or other.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @PyNoob

    Some pointers from my work on a sport analytics project.

    1. Lots of literature need to be done on event statistics that need to be used in the model building. 
    2. You don't need to consider all, but you need to focus on features that are important from the soccer perspective.
    3. Adopt some feature selection techniques and see if the features you are trying to use are useful in model building or not.
    4. Start with simple models like Decision Tree and GLM.
    5. Try to build different models for different leagues and see how it goes.
    6. Validate your model well. This is not just machine learning validation, but domain validation which is very important. If your model is predicting based on an attribute that has statistical relevance but not domain relevance it is not a good generalizable model even though it is highly accurate.

    Just my 2c
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • TheweneThewene Member Posts: 1 Newbie
    Great sport analytics project. i like everything.
  • PyNoobPyNoob Member Posts: 15 Contributor I
    edited December 2019
    @Thewene Yes! It is an awesome earning project! I just wish it was much better streamlined. I ran K-means algo for EPL data since 2009. Python code:
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt

    'exec(matplotlib inline)'

    pd.set_option('display.max_rows', 5000)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)

    data = pd.read_excel("C:/Users/harsh/Documents/My Dream/EPL_Consolidated Edit Version.xlsx",
    sheet_name='EPL_Consolidated Edit Version')

    data = data.drop(
    ['HomeTeam', 'AwayTeam', 'Full Time Result', 'Half Time Result', 'Points', 'Winning Team', 'Full Time BTTS',
    'Half Time BTTS', 'Home Team DNB (AH 0)', 'Away Team DNB (AH 0)', 'Home AH 0.5 (DD)', 'Home AH 1', 'GO 2.5',
    'GU 2.5', 'Home DD', 'Away DD', 'HT FT', 'Home AH - .5 (Win by 1 Goal)', 'Away AH .5 (Win by 1 Goal)',
    'Full Time Result (For Dataframe)'], axis=1)
    # print(data)
    # data = data[(data['Season'] == 2019)]
    data = data.drop(['Date', 'Season', 'Half Time Home Score',
    'Half Time Away Score', 'Total Goals'], axis=1)
    # data = data.set_index('Date')
    # data.index = pd.to_datetime(data.index)
    # data.plot()
    # plt.show()
    data = data.rename(columns={'Home Team Score': 'x', 'Away Team Score': 'y'}, inplace=False)
    df = data
    # print(data)
    np.random.seed(20)
    k = 3
    # centroids[i] = [x, y]
    centroids = {
    i + 1: [np.random.randint(0, 10), np.random.randint(0, 10)]
    for i in range(k)
    }

    fig = plt.figure(figsize=(5, 5))
    plt.scatter(df['x'], df['y'], color='k')
    colmap = {1: 'r', 2: 'g', 3: 'b'}
    for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
    plt.xlim(0, 10)
    plt.ylim(0, 10)
    plt.show()


    def assignment(df, centroids):
    for i in centroids.keys():
    df['distance_from_{}'.format(i)] = (
    np.sqrt(
    (df['x'] - centroids[i][0]) ** 2
    + (df['y'] - centroids[i][1]) ** 2
    )
    )
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
    df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
    df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from')))
    df['color'] = df['closest'].map(lambda x: colmap[x])
    return df


    df = assignment(df, centroids)
    print(df.head())

    fig = plt.figure(figsize=(5, 5))
    plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
    for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
    plt.xlim(0, 10)
    plt.ylim(0, 10)
    plt.show()

    import copy

    old_centroids = copy.deepcopy(centroids)


    def update(k):
    for i in centroids.keys():
    centroids[i][0] = np.mean(df[df['closest'] == i]['x'])
    centroids[i][1] = np.mean(df[df['closest'] == i]['y'])
    return k


    centroids = update(centroids)

    fig = plt.figure(figsize=(5, 5))
    ax = plt.axes()
    plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
    for i in centroids.keys():
    old_x = old_centroids[i][0]
    old_y = old_centroids[i][1]
    dx = (centroids[i][0] - old_centroids[i][0]) * 0.75
    dy = (centroids[i][1] - old_centroids[i][1]) * 0.75
    ax.arrow(old_x, old_y, dx, dy, head_width=2, head_length=3, fc=colmap[i], ec=colmap[i])
    plt.show()

    df = assignment(df, centroids)

    fig = plt.figure(figsize=(5, 5))
    plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
    for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
    plt.xlim(0, 10)
    plt.ylim(0, 10)
    plt.show()

    while True:
    closest_centroids = df['closest'].copy(deep=True)
    centroids = update(centroids)
    df = assignment(df, centroids)
    if closest_centroids.equals(df['closest']):
    break

    fig = plt.figure(figsize=(5, 5))
    plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
    for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
    plt.xlim(0, 10)
    plt.ylim(0, 10)
    plt.show()


    Rapidminer could help but, I am a bit rusty here. You can check it out!
Sign In or Register to comment.