Image

Linear Regression Modeling for Soccer Player Performance Prediction in the EPL

Linear regression is commonly used in machine learning to solve prediction problems. The aim of this project is to predict EPL football player scores based on various factors. Furthermore, this method helps us understand how to model soccer player performance based on different factors. We use Python to build the model, making it easy for beginners to learn about linear regression. In addition, this project uses real-world data to improve learning and practice regression analysis.

Project Overview

This project focuses on building a multiple linear regression model to predict EPL soccer player scores. We use a dataset that includes attributes like player costs, goals, and shots per game. Moreover, the goal is to establish meaningful relationships between these factors and a player's score. This analysis helps team managers and scouts make better recruitment decisions.


This project covers key machine-learning ideas. It teaches data cleaning and regression analysis. You also learn how to check if a model works well. Beginners get hands-on practice with linear regression. They also understand how to measure a model's performance.


Prerequisites

We suggest having a basic understanding of Python, statistics, and machine learning before starting this project. It's helpful to know about model evaluation, visualization, and data preparation methods. You will need libraries like Matplotlib, NumPy, Pandas, and Scikit-learn for this project. Understanding ordinary least squares (OLS) regression and regression analysis is also helpful.


You can easily write and execute Python code by using Google Colab or Jupyter Notebook to run the code. You also learn important statistics like R-squared, modified R-squared, and p-values. These help you better understand the model's results.


Approach

In this project, we use multiple linear regression to predict EPL football player scores. We chose this method because it is simple to use. It shows how factors like player costs, shots per game, and goals impact the player's score.


You can also use other methods to predict player performance. These methods include decision trees, random forests, or neural networks. However, linear regression provides a simple and clear model. It helps you easily understand the connections between features and results. This makes it an excellent choice for beginners.


Workflow and Methodology

The overall workflow of this project includes:

  • Problem definition: Predicting EPL soccer player scores.
  • Data collection and preprocessing: First, we collect and preprocess the data, ensuring it is clean and ready for modeling.
  • Data splitting: Next, we split the dataset into training and testing sets.
  • Model building: We build a multiple linear regression model using ordinary least squares (OLS) regression
  • Model evaluation: Next, we check how the model performs using R-squared and mean-squared error (MSE).

The methodology involves:

  • Data handling: Cleaning, transforming, and splitting the data.
  • Model selection: Choosing the linear regression model due to its interpretability.
  • Training and evaluation: Training the model and validating its performance on the test set.

Additionally, other methods, such as random forest regression or neural networks, could be used to solve the problem. However, we chose this algorithm because it is simple and explains how different features relate to the target variable.


Data Collection

Data Preparation

First, we analyzed some players from EPL teams to create a dataset. After completing the analysis, we created a dataset with specific features. Moreover, the features we included in our dataset are:

  • Player's Name
  • Club
  • Distance Covered (in Kms)
  • Goals per Minute Ratio
  • Shots per Game
  • Agent Fee
  • BMI
  • Cost
  • Previous Club Cost
  • Height (Squared)

We analyzed these features and added the values of all players' characteristics to the dataset. The final dataset is now ready for use in the model.


Data Preparation Workflow

The data preparation workflow involves several steps to ensure the dataset is properly structured for the model:

Code Explanation

STEP 1:

You can mount your Google Drive in a Google Colab notebook with this block of code. This lets users easily view files saved in Google Drive within Colab. They can modify and analyze the data. Users can also train models using the files.

from google.colab import drive
drive.mount('/content/drive')
import warnings
warnings.filterwarnings('ignore')

Install required packages

These commands install the necessary Python libraries. They include numpy, seaborn, matplotlib, statsmodels, pandas, scipy, and scikit-learn. We use 'pip' to install them. This sets up the environment for data analysis, modeling, and visualization.

!pip install numpy
!pip install seaborn
!pip install matplotlib
!pip install statsmodels
!pip install pandas
!pip install scipy
!pip install scikit_learn

Import required packages

This code imports libraries for handling data, modeling, and creating visuals. It prepares the environment for data analysis and plotting tasks.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import seaborn as sns
from scipy import stats
import scipy
from matplotlib.pyplot import figure

STEP 2:

Data Reading from Different Sources

  • Files: In many cases, the data is stored in the local system. To read the data from the local system, specify the correct path and filename.

CSV format

  • Comma-separated values, also known as CSV, are a specific way to store data in a table structure format. We use CSV Formated data in this project.
  • Use the following code to read data from CSV file using pandas.
  • With appropriate data, pd.read_csv() function will read the data and store it in df variable.
  • If you get FileNotFoundError or No such file or directory, try checking the path provided in the function. Moreover, it's possible that Python is not able to find the file or directory at a given location.

Load the data

data = pd.read_csv('/content/drive/MyDrive/Aionlinecourse/EPL_Soccer_Dataset.csv')
data.head(10)
data.columns

Data Dictionary

  • PlayerName: Player Name
  • Club: Club of the player
  • DistanceCovered(InKms): Average Kms distance covered by the player in each game
  • Goals: Average Goals per match
  • MinutestoGoalRatio: Minutes
  • ShotsPerGame: Average shots taken per game
  • AgentCharges: Agent Fees in h
  • BMI: Body-Mass index
  • Cost: Cost of each player in hundred thousand dollars
  • PreviousClubCost: Previous club cost in hundred thousand dollars
  • Height: Height of player in cm
  • Weight: Weight of player in kg
  • Score: Average score per match

Data analysis and visualization

Exploratory Data Analysis

Exploratory Data Analysis, commonly known as EDA, is a technique to analyze the data with visuals. Additionally, it involves using statistics and visual techniques to identify particular trends in data.

EDA helps understand data patterns and find odd values. It also checks assumptions. The main goal is to analyze the data before making any decision about it.


Dataframe Information

The dataframe.info() method shows details about the DataFrame. This includes index type, columns, non-null values, and memory usage.

It can be used to get basic info, look for missing values, and get a sense of each variable's format.

data.info()

There are a total of 254 rows and 13 columns in the EPL Soccer Dataset. Interestingly, there are no null values in the dataset. Out of 13 columns, 10 are float type, and 1 is integer type. The remaining 2 have object types.


Dataframe Description

  • To generate descriptive statistics pandas.dataframe.describe() function is used.
  • Descriptive statistics summarize the central point, spread, and shape of a dataset. They ignore any NaN values.
  • It is used to get a simple overview of the data. This includes checking how different variables spread out. It also looks at sudden changes between the minimum, 25th, 50th, 75th, and maximum values for each variable.
  • The quartiles provide an excellent insight into the range of a set of data. You can easily see where your data fits in the range. By knowing the 25th, 50th, and 75th percentiles, you can find out which quartile your data falls into
    • The 25th percentile is also referred to as the first, or lower, quartile. The 25th percentile is the figure at which 25% of the data falls below it and 75% of the answers fall above it.
    • The Median is also known as the 50th percentile. The median divides the set of data in half. Half of the data points are below the median, while the other half are above it.
    • The 75th percentile is often referred to as the third, or upper, quartile. In other words, the 75th percentile is the value at which 25% of the responses are higher and 75% of the answers are lower.

Descriptive statistics for quantitative variables

  • DataFrame.count: Count the number of non-NA/null observations
  • DataFrame.max: Maximum of the values in the object
  • DataFrame.min: Minimum of the values in the object
  • DataFrame.mean: Mean of the values
  • DataFrame.std: Standard deviation of the observations
  • DataFrame.select_dtypes: Subset of a DataFrame including/excluding columns based on their type
data.describe()

Relationship between Cost and Score

Score and Cost have a 96% correlation, making it a significant variable. Cost can be used as the predictor for simple linear regression. The scatter plot shows a clear linear relationship between them.


To see this relationship visually, let's plot the scatter plot for Cost and Score.

figure(figsize=(8, 6), dpi=80)
plt.scatter(data['Cost'], data['Score'])
# define the label
plt.xlabel("Cost")
plt.ylabel("Score")
plt.title("Scatter plot: Cost vs. Score")

STEP 3:

Splitting the dataset into training data and test data


1721106965_Splitting_dataset


After the data points are collected, they are split into two sets, called train and test. The model is trained on the training data. It is then tested on new data to see how well it performs and check if it fits too well or poorly.


Underfitting and Overfitting

Underfitting: Underfitting happens when a model is too simple to learn the data's patterns. This leads to poor results on both training and testing data. It often happens when the model is too basic for complex data or lacks sufficient features. To fix underfitting, you can make the model more complex or add relevant features.


Overfitting: Overfitting happens when a model is too complex, learning from noise and errors in the training data. This causes it to perform well on training data but poorly on testing data, as it fails to generalize. To fix overfitting, try reducing the model's complexity. You can also use techniques like regularization or more training data.

x=data['Cost']
y=data['Score']
#The dataset is split into 80% training data and 20% testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.80, test_size = 0.20, random_state = 100)

Choose an AI Model

For this project, we chose multiple linear regression as the AI model. We chose this model because it's simple and easy to understand. It helps explain the relationship between features and the target. It shows how factors like player cost, goals, and distance affect the player's score. This makes it an ideal choice for beginners in machine learning.


Stats models approach to regression

Let's get to our case. We will use Ordinary Least Squares from the statsmodels library to model the link between Cost and Scores.

#Fit the linear regression model without intercept
lr = sm.OLS(y_train, x_train).fit()
# Retrieve and print the parameters of the fitted model
print("Parameters without intercept:")
print(lr.params)
# Print summary statistics of the fitted model without intercept
print("Summary without intercept:")
print(lr.summary())
# Add a constant term (intercept) to the independent variable
x_train_with_intercept = sm.add_constant(x_train)
# Fit the linear regression model with intercept
lr = sm.OLS(y_train, x_train_with_intercept).fit()
# Print summary statistics of the fitted model with intercept
print("Summary with intercept:")
print(lr.summary())
# Print the parameters of the fitted model
print(lr.params)
b0 = lr.params[0]
b1 = lr.params[1]
# Plot the fitted line on training data
plt.figure(figsize=(8, 6), dpi=80)
plt.scatter(x_train, y_train)
# Plotting the regression line
plt.plot(x_train, b0 + b1 * x_train, 'r')
# Labeling the axes and adding a title
plt.xlabel("Cost")
plt.ylabel("Score")
plt.title("Fitted Regression Line on Trining Data")
plt.show()

Prediction on test data

# Plot the fitted line on test data
x_test_with_intercept = sm.add_constant(x_test)
y_test_fitted = lr.predict(x_test_with_intercept)
# Scatter plot on test data
plt.figure(figsize=(8, 6), dpi=80)
plt.scatter(x_test, y_test)
# Plotting the regression line
plt.plot(x_test, y_test_fitted, 'r')
# Labeling the axes and adding a title
plt.xlabel("Cost")
plt.ylabel("Score")
plt.title("Fitted Regression Line on Test Data")
plt.show()

STEP 4:

Diagnostics checklist:

  • Non-linearity
  • Non-constant variance
  • Deviations from normality
  • Errors not iid
  • Outliers
  • Missing predictors
# Build predictions on training data
predictions_y = lr.predict(x_train_with_intercept)
# Find residuals
r_i = (y_train - predictions_y)
# Plot residuals vs. Cost
plt.figure(figsize=(8, 6), dpi=80)
plt.title('Residuals vs. Cost')
plt.xlabel('Cost', fontsize=15)
plt.scatter(x_train, r_i)
plt.show()
# Plot absolute residuals vs. Cost
abs_r_i = np.abs(y_train - predictions_y)
plt.figure(figsize=(8, 6), dpi=80)
plt.title('Absolute Residuals vs. Cost')
plt.xlabel('Cost', fontsize=15)
plt.scatter(x_train, abs_r_i)
plt.show()
# Probability plot of residuals
plt.figure(figsize=(8, 6), dpi=80)
scipy.stats.probplot(r_i, plot=plt)
# Distribution plot of residuals
plt.figure(figsize=(8, 6), dpi=80)
sns.distplot(r_i, bins=15)
plt.title('Error Terms', fontsize=15)
plt.xlabel('y_train - y_train_pred', fontsize=15)
plt.show()
# Boxplot for outliers
plt.figure(figsize=(8, 6), dpi=80)
plt.boxplot(r_i, boxprops=dict(color='red'))
plt.title('Residual Boxplot')
plt.show()

Transformations to avoid non-constant variance

Non-constant variance in linear regression can lead to inefficient coefficients and biased predictions. Therefore, to address this, various data transformations can be applied:

  • Log Transformation: It reduces variance by changing data into logarithmic values. This is helpful when variance increases with the average.
  • Square Root Transformation: Reduces variance by converting data to square root values.
  • Box-Cox Transformation: It's a flexible method that reduces variance by transforming data. This brings the data closer to a normal distribution.
  • Yeo-Johnson Transformation: A modern alternative to Box-Cox that works with both positive and negative values.
# Calculate residuals for the test set
test_residuals = (y_test - y_test_fitted)
# Plot residuals vs. predictor in the test set
plt.figure(figsize=(8, 6), dpi=80)
plt.title('Test Residuals vs. Cost')
plt.xlabel('Cost', fontsize=15)
plt.ylabel('Residuals', fontsize=15)
plt.scatter(x_test, test_residuals)
plt.show()

We can see the scatter of data points increases as we increase the cost. This is evidence of Heteroscedasticity.

We'll try transformations like square root, log, and Box-Cox. These help check if we can make the data more linear.

# Transformations
sqrt_y = np.sqrt(y)
ln_y = np.log(y)
bc_y, _ = stats.boxcox(y)
# Plot original and transformed data
plt.figure(figsize=(8, 6), dpi=80)
plt.scatter(x, sqrt_y, color='red', label='Square Root')
plt.scatter(x, ln_y, color='blue', label='Natural Logarithm')
plt.scatter(x, bc_y, color='orange', label='Box-Cox')
# Adding labels and legend
plt.xlabel('x')
plt.ylabel('Transformed y')
plt.title('Transformations of y')
plt.legend()
plt.show()

There is a clear pattern in the data points when we use the square root change, which is shown by the red dots. It would be interesting to see what happens when we run the linear regression model on the variable that has been changed.

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, sqrt_y, train_size=0.75,
                                                    test_size=0.25, random_state=100)
# Add a constant term (intercept) to the training data
x_train_with_intercept = sm.add_constant(x_train)
# Fit the linear regression model on the training data
lr = sm.OLS(y_train, x_train_with_intercept).fit()
# Print the summary statistics of the fitted model
print(lr.summary())

This code plots the linear regression line on the training data. It first retrieves the model parameters ('b0' for the intercept and 'b1' for the slope). Then, it creates a scatter plot of the training data ('x_train', 'y_train') and overlays the regression line ('b0 + b1 * x_train'). Finally, it labels the axes ("Cost" and "Score") and displays the plot to visualize the model's fit.

# Print the parameters of the fitted model
print(lr.params)
b0 = lr.params[0]
b1 = lr.params[1]
# Plot the training data points
plt.figure(figsize=(8, 6), dpi=80)
plt.scatter(x_train, y_train)
# Plot the regression line
plt.plot(x_train, b0 + b1 * x_train, 'r')
# Labeling the axes and adding a title
plt.xlabel("Cost")
plt.ylabel("Score")
plt.title("Fitted Regression Line on Training Data")
# Display the plot
plt.show()

This code uses the model to predict test data. Then, it creates a plot to compare the actual vs. predicted values. Next, it looks at the difference between predicted and actual values, called residuals. It shows these residuals against the predictor. This helps check the model's accuracy and spot any potential problems.

# Add a constant term (intercept) to the test data
x_test_with_intercept = sm.add_constant(x_test)
# Make predictions on the test data
y_test_fitted = lr.predict(x_test_with_intercept)
# Plot actual vs. predicted values on the test data
plt.figure(figsize=(8, 6), dpi=80)
plt.scatter(x_test, y_test)  # Actual values
plt.plot(x_test, y_test_fitted, 'r')  # Predicted values
plt.xlabel("Cost")
plt.ylabel("Score")
plt.title("Actual vs. Predicted Values on Test Data")
plt.show()
# Evaluate variance and residuals
# Calculate residuals
test_residuals = y_test - y_test_fitted
# Plot residuals vs. predictor
plt.figure(figsize=(8, 6), dpi=80)
plt.title('Residuals vs. Cost')
plt.xlabel('Cost', fontsize=15)
plt.scatter(x_test, test_residuals)
plt.show()

STEP 5:

Summary Statistics

df=data.describe()
df
corr = df.corr()
corr

There is a negative correlation between DistanceCovered(InKms) and the goal variable score of −0.49. There is a strong positive connection between cost and the target variable. The correlation coefficient is 0.96, showing this link.

# Set the figure size
plt.figure(figsize=(12, 6), dpi=80)
# Create a heatmap of the correlation matrix
ax = sns.heatmap(
    corr,  # The correlation matrix
    vmin=-1, vmax=1, center=0,  # Set the color scale limits
    cmap=sns.diverging_palette(20, 220, n=200),  # Set the color palette
    square=True,  # Make the heatmap square
    annot=True  # Annotate each cell with the numeric value
)
# Rotate the x-axis labels for better readability
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)
# Show the plot
plt.show()

Let's look at the correlation scores of some factors that have to do with "Score."

  • Some factors should be taken out because they are not strongly linked, like Height and Weight, which are linked by -0.190 and 0.00016, respectively.
  • Some factors, like MinutestoGoalRatio and ShotsPerGame, are also linked to each other. Only one of them, ShotsPerGame, will do. If we add more than one of these factors to our model, what will happen? This is known as multicollinearity, and we will talk more about it later.

STEP 6:

Multiple Linear Regression Analysis Results

This code extracts the predictor variables from the dataset while excluding categorical variables. It then splits the data into 80% training and 20% testing sets. After that, it adds a constant term to the training data. Subsequently, it fits a linear regression model using the Ordinary Least Squares (OLS) method. Finally, it prints the summary statistics of the fitted model to evaluate its performance.

# Extract predictor variables (remove categorical variables like team)
X = df[['DistanceCovered(InKms)', 'Goals', 'ShotsPerGame', 'AgentCharges', 'BMI', 'Cost', 'PreviousClubCost']]
y = df['Score']
# The dataset is split into 80% training data and 20% testing data
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size = 0.80,
                                                    test_size = 0.20, random_state = 100)
# Add a constant term (intercept) to the training data
x_train_with_intercept = sm.add_constant(x_train)
# Fit the linear regression model using OLS
lr = sm.OLS(y_train, x_train_with_intercept).fit()
# Print the summary statistics of the fitted model
print(lr.summary())

The value we got for 𝑅2 was 0.959, which is pretty good. Also, the difference between 𝑅2 and Adjusted 𝑅2 is not very big, which is also a good sign. We should try getting rid of some factors to see if that helps.

#Can we trim some variables and see how it performs?
X=df[['DistanceCovered(InKms)'
        , 'Cost',
       'PreviousClubCost']]
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size = 0.80,
                                                    test_size = 0.2, random_state = 100)
x_train_with_intercept = sm.add_constant(x_train)
lr = sm.OLS(y_train, x_train_with_intercept).fit()
print(lr.summary())

This code extracts unique clubs from the 'Club' column. After that, it converts the 'Club' categorical variable into dummy variables using one-hot encoding. Then, it displays the first few rows of these dummy variables. This allows us to examine the transformed data.

# Extract unique clubs from the 'Club' column
clubs = set(data['Club'])
# Print the set of unique clubs
clubs
# Convert categorical variable 'Club' into dummy variables
nominal_features = pd.get_dummies(data['Club'])
# Display the first few rows of the dummy variables
nominal_features.head()

This block of code joins the original DataFrame ('data') with the one-hot encoded 'nominal_features' to add the dummy variables. After that, it shows the first few rows of the combined DataFrame to check the updated dataset.

# Concatenate original DataFrame 'data' with one-hot encoded 'nominal_features'
data_encoded = pd.concat([data, nominal_features], axis=1)
# Display the first few rows of the concatenated DataFrame
data_encoded.head()

This code defines the feature matrix 'X', which includes selected numerical and one-hot encoded categorical variables. It also defines the target variable 'y' as the Score. After that, it splits the data into 80% training and 20% testing sets. Finally, it displays the 'x_train' and 'y_train' data to review the training features and target values.

# Define the feature matrix X and target variable y
X = data_encoded[['DistanceCovered(InKms)', 'BMI', 'Cost', 'PreviousClubCost', 'ARS', 'MC', 'CHE', 'MUN', 'LIV']]
y = data_encoded['Score']
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.80,
                                                    test_size=0.20, random_state=100)
x_train
y_train

This code converts the 'x_train' and 'y_train' data to floats. It then adds a constant term (intercept) to the training data. After that, it fits a multiple linear regression model using the Ordinary Least Squares (OLS) method. Finally, it prints the summary statistics of the fitted model to evaluate its performance and coefficients.

x_train = x_train.astype(float)
y_train = y_train.astype(float)
# Add a constant term (intercept) to the training data
x_train_with_intercept = sm.add_constant(x_train)
# Fit the multiple linear regression model using OLS(Ordinary Least Squares regression)
lr = sm.OLS(y_train, x_train_with_intercept).fit()
# Print the summary statistics of the fitted model
print(lr.summary())

Including the club feature significantly improved the model as we got 𝑅2 as 0.966 and AIC and BIC dropped significantly.

# Set the figure size
plt.figure(figsize=(8, 6), dpi=80)
# Add a constant term (intercept) to the test data
x_test_with_intercept = sm.add_constant(x_test)
# Predict the target variable on the test set
y_test_fitted = lr.predict(x_test_with_intercept)
# Create a scatter plot to compare the fitted values with the actual values
plt.scatter(y_test_fitted, y_test)
plt.xlabel("Y Predicted")
plt.ylabel("Y Actual")
plt.title("Scatter plot between Fitted and Actual Test Values")
plt.show()

Project Conclusion

This project, Predicting EPL Soccer Player Scores Using Multiple Linear Regression, demonstrates how multiple linear regression can be applied to predict EPL soccer player scores. Moreover, by understanding the relationships between factors like player cost and goals, we can make more informed decisions in sports analytics.


Challenges and Troubleshooting

In this project, we faced two key challenges: multicollinearity and heteroscedasticity. Multicollinearity happens when predictor variables are highly correlated. This can cause unstable model coefficients. Therefore to fix this, we used feature selection methods. These methods remove redundant predictors and improve accuracy. You can learn more about feature selection in our tutorial.

  • Heteroscedasticity is when the residuals have unequal variance. This breaks an assumption in linear regression. As a result, the model's estimates may become biased. To address this, we applied data transformations. We used techniques like logarithmic scaling and square root transformations. Additionally, these transformations stabilize variance and make the data more normal. Stabilizing variance ensures better model performance and predictions.
  • We also focused on feature engineering. This involves creating or modifying variables to enhance the model. Good feature engineering improves the model's ability to capture patterns. Moreover, it is a critical step in any data science project. You can explore feature engineering in our detailed tutorial. These techniques help solve issues common in regression models. They improve both accuracy and reliability.

Interested In Deep Learning?

Deep Learning is a powerful tool that helps machines learn from large amounts of data. Learn the basics of Deep Learning and its application in this tutorial. Understand how machines learn and improve with data. Build your first Ai project on Deep Learning and gain hands-on experience. Follow simple steps to train models easily.


FAQ

  1. What are the Assumptions of Linear Regression?

    • Answer: Linear regression assumes a straight-line relationship between input and output variables. Errors should also be normally distributed with constant variance.

  2. What is linear regression, and how does it work?

    • Answer: Linear regression is a method used to predict a relationship between two variables. It works by fitting a straight line to the data points.

  3. What are the common techniques used to improve the accuracy of a linear regression model?

    • Answer: A common way to improve the model is by removing outliers. Outliers can negatively affect the model's accuracy. You can also use feature scaling or transform the variables if needed.

  4. What are the overfitting and underfitting of the regression model?

    • Answer: Overfitting occurs when the model fits too closely to the training data. It captures noise instead of patterns. Underfitting occurs when the model fails to capture the main trend in the data.

  5. What are the possible ways of improving the accuracy of a linear regression model?

    • Answer:To improve the model, add useful features or remove unnecessary ones. Another way is to regularize the model to avoid overfitting.

Code Editor