## Linear Regression Modeling for Soccer Player Performance Prediction in the EPL

# Explanation All Code

### Step 1:

You can mount your Google Drive in a Google Colab notebook with this piece of code. This creates it simple to see files saved in Google Drive in the Colab setting so that data can be changed, analyzed, and models can be trained.

**Install required packages**

**Import required packages**

### Step 2:

**Load the data**

**Information of Dataframe**

This function data.info() shows details about the DataFrame, like the index dtype, columns, non-null values, and memory usage.

It's useful for getting basic information, searching for missing numbers, and understanding how each variable is formatted.

**Diagram showing the relationship between cost and score**

Score and Cost have a 96% correlation, making it a significant variable. Cost can be selected as the predictor variable for simple linear regression since the scatter plot between them will demonstrate a linear relationship. To see this relationship visually, let's plot the scatter plot for Cost and Score.

Scatter plot: Cost vs. Score

### Step 3:

**Splitting the dataset into training data and test data**

After the data points are collected, they are split into two sets, called train and test. The model is trained on the train data, and then it is used to make predictions on the test data to see how well it does on data it hasn't seen before and to find out if it is too good or too bad at fitting the cases.

**Stats models approach to regression**

Let's get to our case. We will use Ordinary Least Squares from the statsmodels library to model the link between Cost and Scores.

**Making predictions with test data**

### Step 4:

**Diagnostics checklist:**

- Non-linearity
- Non-constant variance
- Deviations from normality
- Errors not iid
- Outliers
- Missing predictors

There is a clear pattern in the data points when we use the square root change, which is shown by the red dots. It would be interesting to see what happens when we run the linear regression model on the variable that has been changed.

### Step 5:

**Summary Statistics**

There is a negative correlation between DistanceCovered(InKms) and the goal variable score of −0.49. A strong positive link exists between the variable cost and the target variable, as shown by the correlation coefficient of 0.96.

Let's look at the correlation scores of some factors that have to do with "Score."

- Some factors should be taken out because they are not strongly linked, like Height and Weight, which are linked by -0.190 and 0.00016, respectively.
- Some factors, like MinutestoGoalRatio and ShotsPerGame, are also linked to each other. Only one of them, ShotsPerGame, will do. If we add more than one of these factors to our model, what will happen? This is known as multicollinearity, and we will talk more about it later.

### Step 6:

**Multiple Linear Regression Analysis Results**

Including the club feature significantly improved the model as we got 𝑅2 as 0.966 and AIC and BIC dropped significantly.

**Conclusion **

This project gives a full introduction to multiple linear regression using real-world data. It uses regression techniques in practice with Python. It helps beginners to get hands-on experience and a good grasp of basic ideas. These ideas include model evaluation, assumptions, and statistical measures. This project is a great way to learn about machine learning and data science because it offers a hands-on approach and in-depth study of regression. It gives students the skills they need to use these methods in a variety of real-life situations.