Time Series Forecasting Using Multiple Linear Regression Model
Welcome to the world of data analytics and predictive modeling. This project hinges on understanding trends, relationships, and anomalies within a particular time series dataset. Ground-breaking techniques such as Linear regression, ARIMA, and anomaly detection allow us to reach insightful and actionable conclusions.
Project Overview
This project aims to analyze time series across various sectors and pays special attention to banking, telecom, and health. It intends to forecast trends, map relationships, and focus on unusual patterns in the data. In preparing datasets for missing value infusion, feature scaling, and engineered variables such as lags, rolling statistics, or interaction terms, we also characterized the features using missing values and feature scaling and created engineered variables such as lags, rolling statistics, or interaction terms. We have used three basic predictions: simple linear regression, multiple linear regression, and ARIMA. Each has a different insight, which is compared through RMSE, along with visualizing actual versus predicted values. Anomalies were also accentuated using Z-scores to avoid missing important data points.
It is a successful banking project dealing with data engineering, feature creation, and modeling, all knitted within colorful visualizations to narrate a story rather than simply analyze it.
Prerequisites
- Python Basics: Knowledge of Python programming and libraries like Pandas, NumPy, and Matplotlib.
- Time Series Concepts: Know what time series data, lagged features and rolling statistics are.
- Basics in Machine Learning: Familiarize oneself with regression models and metrics to evaluate such models like RMSE.
- Visualization Skills: Create and decipher data visualizations using libraries like Seaborn and Matplotlib.
- Anomaly Detection: Basic understanding of Z-scores for finding unusual data.
- ARIMA models: ARIMA, familiarity with time series analysis and forecasting.
Approach
The dataset will first be thoroughly explored for trends, patterns, and missing values. The time column will be specified as an index of consistent monthly frequency and filled with data using forward filling for pre-processing of the data for further processing to make it suitable for robust modeling. Additionally, predictive values will then be developed through engineering lagged values, rolling statistics, and interactions between the two key variables.
Then we apply and compare three models, Simple Linear Regression, Multiple Linear Regression, and ARIMA. Take each model to be trained using appropriate features to predict the target variable and assess the performance using RMSE. We will also make use of z-scores for anomaly detection, which identifies outliers from normal trends. Finally, visualization will play a major role in giving a clear and lively comparison between actual values and predicted ones while highlighting anomalies for actionable insights. This makes this approach both thorough and insightful.
Workflow and Methodology
Workflow
- Data Preparation: Import and clean the dataset through index settings, missing value filling, and frequency handling.
- Feature Engineering: Create lagged, rolling, interaction, and polynomial features for improving prediction capabilities.
- Scaling and Encoding: Scale numerical data and apply one-hot encoding for categorical seasonal indicators.
- Model Training: Train the Simple Linear Regression, Multiple Linear Regression, and ARIMA models on the prepared data.
- Evaluation: Compare results with the help of RMSE and visualize the predictions against actual values.
- Anomaly Detection: Identify and analyze anomalies using Z-scores and mark them for further insights.
Methodology
- Analyze trends and correlations, along with missing values, with the help of statistical metrics and visualization.
- Find relevant features to improve prediction accuracy without incurring unnecessary complexity.
- Build and train a regression model, combined with ARIMA for well-forecasted time series data.
- Use RMSE for evaluating the dependability and accuracy of the model.
- Use visual tools to effectively present predictions, trends, and anomalies.
Data Collection and Preparation
Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow:
- Fix the time column as the index and make certain that the frequency is monthly for all the data.
- Replace missing values through forward filling to avoid falling off data continuity.
- Form lagged, rolling, and interactivity dimensions for further improved data analysis and modeling.
- Normalize the numerical features to scale into one unit using StandardScaler.
- Use features’ one-hot encoding to have a better representation of categorical data such as months indicating seasonality.
- Introduce an index of time to measure temporal characteristics for the increase of density values in the dataset.
Code Explanation
STEP 1:
Mounting Google Drive
First, mount Google Drive to access the dataset that is stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
Import Libraries for Evaluation and Modeling
This code block imports libraries that are required for data analysis, graphical representation, statistical modeling, and machine learning. It comprises data manipulation (pandas, numpy), data visualization (matplotlib, seaborn), analysis of time series data (seasonal_decompose, ARIMA), and preparing and measuring results (StandardScaler, LinearRegression, mean_squared_error). A warning suppressor is also incorporated to enhance the outlook of the console.
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from math import sqrt
from statsmodels.tsa.arima.model import ARIMA
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
import warnings
warnings.filterwarnings("ignore")
Loading Data and Checking Shape
This code loads the CSV file. After loading the dataset, it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.
# Load the data
file_path = '/content/drive/MyDrive/New 90 Projects/Project_13/Data/CallCenterData.xlsx'
df = pd.read_excel(file_path)
df.shape
Previewing Data
This code displays the dataset's first few rows for a quick overview.
df.head()
Setting the Index and Visualizing Time Series
This piece of code makes the ‘month’ column the index with month frequency and deals with the missing values by forward filling. It is an exploratory data analysis (EDA) in which time series graphs of all the sectors are plotted, plotting trends over time using line graphs.
# Set the 'month' column as the index and set frequency to monthly
df.set_index('month', inplace=True)
df = df.asfreq('M')
# Check for missing values and handle them by forward filling
df.fillna(method='ffill', inplace=True)
# Exploratory Data Analysis (EDA) and Visualization
fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(18, 20))
fig.suptitle('Time Series Plots for Different Sectors')
for i, column in enumerate(df.columns):
ax = axes[i//2, i%2]
sns.lineplot(data=df, x=df.index, y=column, ax=ax, marker='o')
ax.set_title(f"{column} Over Time")
ax.set_xlabel('')
ax.set_ylabel(column)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
Correlation Analysis
The heatmap that is produced in this section acts as a visual representation of how features in the data set correlate to one another. This is done in the form of a clear matrix that indicates relevant connections and trends using colors.
# Correlation matrix to understand relationships between features
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix of Features")
plt.show()
Decomposition of Time Series in the Banking Sector
This code applies the additive model to the ‘Banking’ column to obtain trend, seasonal, and residual components of the series. It also provides visual aids for comprehension of the breakdown of levels and fluctuations.
# Assuming 'df' is your DataFrame with the 'Banking' column
decomposition = seasonal_decompose(df['Banking'], model='additive')
fig = decomposition.plot()
# Set the title and adjust the layout to prevent text overlap
fig.suptitle('Seasonal Decomposition of Banking Sector', fontsize=16)
fig.tight_layout(rect=[0, 0.03, 1, 0.95]) # Adjusts the layout for better spacing
plt.show()
Uniformity Through Scaling of Features
The following code employs the StandardScaler to standardize chosen variables such that they have a mean of 0 and a std of 1. For the new DataFrame created, the scaled data is inserted while retaining the original index and column names in it.
# Additional Data Processing: Scaling Features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[["Healthcare", "Telecom", "Technology", "Insurance", "#noofchannels", "#ofphonelines"]])
df_scaled = pd.DataFrame(scaled_features, index=df.index, columns=["Healthcare", "Telecom", "Technology", "Insurance", "#noofchannels", "#ofphonelines"])
Creating Lagged Features
This particular code creates lagged versions of the column "Banking" by adding 1-month, 3-month, and 12-month lagged versions. These features incorporate previous values to increase prediction accuracy.
# Feature Engineering
# Lagged Features
df['Banking_lag1'] = df['Banking'].shift(1) # 1-month lag
df['Banking_lag3'] = df['Banking'].shift(3) # 3-month lag
df['Banking_lag12'] = df['Banking'].shift(12) # 12-month lag
Calculation of Rolling Statistics
This code computes the 3-month rolling mean and standard deviation of the 'Banking' column. These features are useful in identifying trends and variability in a subject on a shorter timescale.
# Rolling Statistics
df['Banking_rolling_mean_3'] = df['Banking'].rolling(window=3).mean()
df['Banking_rolling_std_3'] = df['Banking'].rolling(window=3).std()
Add Seasonal Indicators
This code snippet extracts the month from the index and applies one-hot encoding to create seasonal indicators to capture the monthly variation in the data.
# Seasonal Indicators (One-hot encoding for month)
df['month'] = df.index.month
df = pd.get_dummies(df, columns=['month'], drop_first=True) # One-hot encode month
Creating a Feature for Time Index
The code adds a column called time_index, wherein each row is assigned a number corresponding to its sequence. It will help collect time-dependent trends in the data.
# Time-based Trends
df['time_index'] = np.arange(len(df))
Creating Interaction Features
This code creates an interaction feature by taking the product of the Telecom and Technology columns. This would take into account the combined effects of both these variables towards the predictive modeling exercise.
# Interaction Features
df['Telecom_Technology_interaction'] = df['Telecom'] * df['Technology']
Generating Exponential and Polynomial Features
This script generates a squared feature for Banking and an exponential feature for Healthcare which facilitates the incorporation of non-linear traits in the data.
# Exponential or Polynomial Features
df['Banking_sq'] = df['Banking'] ** 2
df['Healthcare_exp'] = np.exp(df['Healthcare'])
Making Discrete Characteristics
It calculates the difference for each month within the banking section so that the changes can be analyzed over time for trends.
# Difference or Change Features
df['Banking_diff'] = df['Banking'].diff(1) # Monthly difference
Data Split into Train-test Sets
This code splits the dataset into a training and testing set by retaining the last 22 months for the testing data and removing any missing values from lagged or rolling features.
# Train-test split (last 22 months for testing)
test_size = 22
df_train = df.dropna()[:-test_size] # Drop NA values due to lagged and rolling features
df_test = df.dropna()[-test_size:]
Data Preparation for Simple Linear Regression
This code takes #noofchannels and #ofphonelines as predictors for the Bank target variable and prepares the training and testing data for regression analysis.
# Simple Linear Regression with #noofchannels and #ofphonelines
X_train_slr = df_train[["#noofchannels", "#ofphonelines"]]
y_train_slr = df_train["Banking"]
X_test_slr = df_test[["#noofchannels", "#ofphonelines"]]
y_test_slr = df_test["Banking"]
Building the Linear Regression Model
The code also initializes and trains a simple linear regression model using training data (#noofchannels, #ofphonelines) to predict the Banks.
slr_model = LinearRegression()
slr_model.fit(X_train_slr, y_train_slr)
Making Predictions with Linear Regression
This code generates predictions for both training and testing datasets using the trained linear regression model.
# Predictions for SLR
y_train_slr_pred = slr_model.predict(X_train_slr)
y_test_slr_pred = slr_model.predict(X_test_slr)
Assessing Performance of Linear Regression Model: RMSE
This code calculates the RMSE (Root Mean Squared Error) between training and testing datasets to assess the model's performance.
# Calculate RMSE for Simple Linear Regression
train_rmse_slr = sqrt(mean_squared_error(y_train_slr, y_train_slr_pred))
test_rmse_slr = sqrt(mean_squared_error(y_test_slr, y_test_slr_pred))
Preparing Data for Multiple Linear Regressions
The constructs select multiple features like lagged values, rolling statistics, interactions, and engineered features to model the Banking target variable.
# Multiple Linear Regression with additional features
features = ["Healthcare", "Telecom", "Technology", "Insurance", "#noofchannels", "#ofphonelines",
"Banking_lag1", "Banking_lag3", "Banking_rolling_mean_3", "Telecom_Technology_interaction",
"time_index", "Banking_diff", "Banking_sq"]
Splitting Dataset for Multiple Linear Regression
The selected features are assigned to X_train_mlr and X_test_mlr and the target variable Banking is assigned to y_train_mlr and y_test_mlr respectively.
X_train_mlr = df_train[features]
y_train_mlr = df_train["Banking"]
X_test_mlr = df_test[features]
y_test_mlr = df_test["Banking"]
Training Multiple Linear Regression Model
This code trains a multiple linear regression model by using all the selected features and the target variable for banking- the training dataset.
mlr_model = LinearRegression()
mlr_model.fit(X_train_mlr, y_train_mlr)
Making Predictions with Multiple Linear Regression
This code generates predictions for both the training and testing datasets using the trained multiple linear regression model.
# Predictions for MLR
y_train_mlr_pred = mlr_model.predict(X_train_mlr)
y_test_mlr_pred = mlr_model.predict(X_test_mlr)
Assessing Multiple Linear Regression Model For RMSE
It calculates the root mean square error (RMSE) of training and testing sets for evaluating multiple linear regression models.
# Calculate RMSE for Multiple Linear Regression
train_rmse_mlr = sqrt(mean_squared_error(y_train_mlr, y_train_mlr_pred))
test_rmse_mlr = sqrt(mean_squared_error(y_test_mlr, y_test_mlr_pred))
An ARIMA Model for Time Series Forecasting
This trains an ARIMA(1,1,1) model on the training data to predict Bank values for the forecast period and checks the models' validity with RMSE on the test data.
# ARIMA Model for comparison
arima_model = ARIMA(df_train['Banking'], order=(1, 1, 1))
arima_model_fit = arima_model.fit()
arima_pred = arima_model_fit.forecast(steps=test_size)
arima_rmse = sqrt(mean_squared_error(y_test_slr, arima_pred))
Detection of Anomalies Using Z-score
The following code computes the Z-scores for the Banking column in order to detect outliers beyond z-score 3. The anomalous rows are captured in a separate DataFrame.
# Anomaly Detection using Z-scores for 'Banking'
df['zscore_Banking'] = np.abs(stats.zscore(df['Banking']))
anomalies = df[df['zscore_Banking'] > 3] # Assuming anomalies have z-score > 3
Visualizing Actual vs Predicted Plots with Anomaly Detection
This code depicts the actual Banking values associated with their predicted values when running the Simple Linear Regression, Multiple Linear Regression, and ARIMA models. Anomalies are indicated using red markers for better interpretation.
# Comparative Plot for Actual vs Predicted Values (SLR vs MLR vs ARIMA)
plt.figure(figsize=(20, 6))
plt.plot(df_test.index, y_test_slr, label='Actual Banking')
plt.plot(df_test.index, y_test_slr_pred, label='Predicted Banking (Simple LR)', line)
plt.plot(df_test.index, y_test_mlr_pred, label='Predicted Banking (Multiple LR)', line)
plt.plot(df_test.index, arima_pred, label='Predicted Banking (ARIMA)', line)
plt.scatter(anomalies.index, anomalies['Banking'], color='red', label='Anomalies')
plt.title('Actual vs Predicted Banking Sector (SLR vs MLR vs ARIMA) with Anomalies')
plt.xlabel('Date')
plt.ylabel('Banking')
plt.legend()
plt.show()
Comparison of Model Performance by RMSE
The given code outputs RMSE for the different models such as Simple Linear Regression, Multiple Linear Regression (train-test) set, and ARIMA models (test). Clearly states how accurate the models are against each other.
# Output RMSE comparison
print("Simple Linear Regression - Train RMSE:", train_rmse_slr)
print("Simple Linear Regression - Test RMSE:", test_rmse_slr)
print("Multiple Linear Regression - Train RMSE:", train_rmse_mlr)
print("Multiple Linear Regression - Test RMSE:", test_rmse_mlr)
print("ARIMA Model - Test RMSE:", arima_rmse)
Conclusion
This project shows how data-driven insights can understand time series trends and make accurate predictions. We present meaningful results for Linear Regression and ARIMA models, along with anomaly detection, which adds another dimension of assurance against missing significant patterns. The project facilitates presenting visualizations and comparisons between the project models whereby structured workflows have transformed raw data into actionable hands. Thus, better decision-making will emerge regarding various sectors.
Challenges New Coders Might Face
*Challenge***: Handling missing data
Solution:** Forward filling or interpolation methods can take care of smooth transitions in data without loss of trends.*Challenge***: Time Series Seasonality
Solution:** Decompose the time series into trend, seasonal, and irregular components through an additive model.*Challenge***: Overfitting Models
**Solution: Feature selection should be done very carefully and the model should be evaluated by the RMSE criterion for bias-variance balance.*Challenge***: Scaling and encoding
Solution:** Standardized numerical features and applied one-hot encoding for such categoricals such as seasonality indicators.*Challenge***: Anomaly detection complexity
Solution:** Use Z-scores for statistical anomaly detection and visually check the results by plotting the anomalies.Challenge: Interpreting Model Output
*Solution***:** RMSE is a commonly used metric and also gives a way to present actual vs predicted against easier understanding.
FAQ
Question 1: Define time series analysis in data science:
Answer: Time series analysis gives insight into data points collected at periodic intervals to recognize temporal patterns, trends, and seasonality for an understanding of the phenomenon and predictability in various sectors like Banking and Healthcare.
Question 2: What are lagged features in time series forecasting?
Answer: Lagged features are past values of some variable where the models will use that past performance to establish likely future potential for the variable using learned patterns with historical records.
Question 3: How do you treat missing values in time series datasets?
Answer: In most cases, the missing values in time series data are forward-filled, backward-filled or through an interpolation process of continuity into the next period.
Question 4: What is the need for seasonal decomposition in this project?
Answer: In seasonal decomposition, the time series is decomposed into trend, seasonal, and residual components and helps to analyze more effectively in identifying patterns from those components.
Question 5: Which is the best model for forecasting in time series: Linear Regression or ARIMA?
Answer: Linear regression works best with engineered features; ARIMA is better used for time-dependent patterns. It lacks the dataset to dictate the rest.
6. How is Z-score anomaly detection applicable for time series?
Answer: Z-scores measure how far a specific value surpasses or falls below the mean, whereby it is judged that such values with Z-scores above 3 belong to an anomaly.