Time Series Forecasting with ARIMA and SARIMAX Models in Python
In this project, we will be working on time series forecasting, which is a powerful way to understand and predict the trend over time. It will mainly deal with real-time data from industries like Healthcare, Banking, Telecom, and many others. With the help of ARIMA, ARIMAX, and SARIMAX, we will try to identify some patterns in data, test models, and give forecasts based on such models.
Are you intimidated by the words ACF plots, stationarity, or residuals? Don't worry, we'll keep it fun and simple for you. In the end, it will all come together for you in one nice little package to show such models could be made to predict future time.
Project Overview
This project starts by importing and preparing the data. We clean the data, handle missing values, and set the date column as an index. This is how we establish it as a time series. We also set the frequency to monthly data using .asfreq('M'). Next, we explore the data visually. For example, we plot the trends for features like Healthcare, Banking, Telecom, and others. We also generate random white noise to understand the randomness and how it compares with actual data patterns.
The real fun starts when testing the stationarity of the series with the ADF Test. This enables us to make decisions on whether the series requires stationarity before modeling. Then we adopt three different approaches from Arima, Arimax and Sarimax. Finally, we shall compile all the models into a table to find out which has the best performance based on metrics like AIC and Log Likelihood. At the end of this, we will have created a solid forecasting model that is useful in predicting the future of industries like Healthcare.
Prerequisites
- Python programming and knowledge of Pandas, NumPy, and Matplotlib libraries.
- Prior knowledge of time series data and some of the key things like time series data including trends, seasons, and stationarity.
- Understanding of cyclical and trend patterns using ARIMA, ARIMAX, SARIMAX, and forecasting procedures.
- Some background knowledge about statistical computing and elements of model assessment as AIC and Log Likelihood factors.
- Familiarity with using Jupyter Notebooks for code execution and result visualization.
Approach
In this project, a systematic procedure is used for forecasting the time series. We import and clean the dataset, creating a date-indexed data structure that contains monthly frequency data. After missing values, we visualize the data for further understanding of its trends and patterns. Next, an ADF test is performed to check for stationarity, and if needed, transformations, including differencing, are applied. Afterward, we fit several models starting with ARIMA, which would then incorporate additional variables such as Banking to create an ARIMAX model and also explore seasonal components with SARIMAX. Finally, we evaluate all the models using metrics such as AIC and Log Likelihood to make a comparative analysis between them and find out the best model for future forecasting in the Healthcare sector and others.
Workflow and Methodology
Workflow
- Data Preparation: Import and cleanse the data, establishing the date column as an index while checking for missing values.
- Data Visualization: Visualize the data to observe trends and patterns in features such as Healthcare and Banking.
- Stationarity Check: ADF test performed for stationarity with data transformation, if needed.
- Model Fitting: Fit and evaluate ARIMA, ARIMAX, and SARIMAX.
- Model Comparison: Compare the models using AIC and Log Likelihood.
Methodology
- We use ARIMA to model univariate time series data and test various configurations.
- Use ARIMAX to incorporate external variables to forecast better.
- Seasonality and trend should be accounted for in the data using SARIMAX.
- Use ACF plots to analyze model residuals to see if they are random, and check model fit.
- Identify the best model using AIC and Log Likelihood parameters for better predictions.
Data Collection and Preparation
Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow:
- Using pandas load the dataset and inspect the first few rows.
- Set the date column (e.g., "month") as the index for time series analysis.
- Use isna().sum() to check for values in the data.
- Missing values are handled differently based on the situation, by filling or dropping.
- Use .asfreq('M') to set the frequency of data to monthly.
- Ensure that the data is in the right format for time series analysis.
Code Explanation
STEP 1:
Mounting Google Drive
First, mount Google Drive to access the data stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
Install Required Python Libraries
This code installs Python essential libraries for data analysis, visualization, and statistical modeling. These libraries include Pandas, NumPy, Seaborn, Matplotlib, and Statsmodels, and for advanced time series like pmdarima, and auto_arima. These libraries provide the basis for storing and operationalizing data visualization capabilities.
!pip install pandas
!pip install numpy
!pip install seaborn
!pip install matplotlib
!pip install scipy
!pip install random
!pip install matplotlib
!pip install pylab
!pip install statsmodels
# !pip install scipy.stats
# !pip install statsmodels.graphics.tsaplots
# !pip install statsmodels.tsa.stattools
# !pip install statsmodels.tsa.seasonal
# !pip install statsmodels.tsa.arima.model
# !pip install statsmodels.tsa.statespace.sarimax
# !pip install pmdarima
# !pip install auto_arima
Importing the Required Libraries and Setting Configurations
This code imports libraries that are needed for data manipulation, visualization, and time series analysis. Imports NumPy, Seaborn, Matplotlib, Pressure, and Statsmodels-as well as establishing filters for error suppression (to prevent displaying unnecessary warnings), thus maintaining clarity and concentrating all outputs during the running time.
# import the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot
sns.set_theme(style="darkgrid")
import scipy.stats
from random import seed
from random import random
from matplotlib import pyplot
import pylab
import statsmodels.graphics.tsaplots as sgt
import statsmodels.tsa.stattools as sts
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from scipy.stats.distributions import chi2
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.statespace.sarimax import SARIMAX
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", message="Covariance matrix calculated using the outer product of gradients")
warnings.filterwarnings("ignore", message="Covariance matrix is singular or near-singular")
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
Importing the Dataset
This code loads the dataset from an Excel file using Pandas, making it ready for data analysis and processing.
# importing the data
raw_csv_data = pd.read_excel("/content/drive/MyDrive/New 90 Projects/Project_14/Dataset/Data.xlsx")
Creating a Copy Dataset
This code creates a copy of the original dataset as df_comp. This ensures the raw data remains unaltered for future reference.
# check point of data
df_comp = raw_csv_data.copy()
STEP 2:
Previewing Data
This code displays the dataset's first few rows for a quick overview.
df_comp.head()
Generating Descriptive Statistics
This code computes and displays descriptive statistics for the dataset.
# check for data description
df_comp.describe()
Checking Missing Values
This code calculates the overall number of null values present in every column of the Data Frame. This helps in identifying null values for further processing of the data.
# check for null values
df_comp.isna().sum()
Describing the month Field
This code provides a statistical summary of the month column in the dataset, which is treated as a datetime field.
# taken as a date time field
df_comp.month.describe()
Setting Month as Index
This code transforms the column of months to timestamps, sets it as the index for the data, and makes sure that the data is monthly for analysis in a time series.
# set month as the index parameter instead
df_comp.set_index("month", inplace=True)
Previewing Data
This code displays the dataset's first few rows for a quick overview.
# check for the first 5 rows
df_comp.head()
STEP 3:
Setting Monthly Frequency
This code sets the dataset's frequency to monthly. This ensures time-series consistency for analysis.
# seeting the frequency as monthly
df_comp.asfreq('M')
Assigning Monthly Frequency
This code updates the dataset to have a monthly frequency by assigning it directly to df_comp.
# seeting the frequency as monthly
df_comp = df_comp.asfreq('M')
Checking Missing Values
This code calculates the overall number of null values present in every column of the DataFrame. This helps in identifying null values for further processing of the data.
# checking for the null values
df_comp.isna().sum()
Getting an Insight into the Feature of Healthcare
The following code creates a line plot in purple which shows the Healthcare feature with time along with the trend that can be observed from the data.
# Plotting the Healthcare feature with a different color for visualization
plt.figure(figsize=(20, 5))
df_comp['Healthcare'].plot(color='purple', title="Healthcare Feature Visualization") # Changed df to df_comp
plt.xlabel("Date")
plt.ylabel("Healthcare")
plt.show()
Getting an Insight into the Features of Telecom
The following code creates a line plot in blue color which shows the Telecom feature with time along with the trend that can be observed from the data/
# Plotting the Telecom feature with a specified color
plt.figure(figsize=(20, 5))
df_comp['Telecom'].plot(color='blue', title="Telecom Feature Visualization")
plt.xlabel("Date")
plt.ylabel("Telecom")
plt.show()
Getting an Insight into the Feature of Banking
The following code creates a line plot in green color that shows the Banking feature with time along with the trend that can be observed from the data.
# Plotting the Banking feature with a specific color
plt.figure(figsize=(20, 5))
df_comp['Banking'].plot(color='green', title="Banking Feature Visualization")
plt.xlabel("Date")
plt.ylabel("Banking")
plt.show()
Getting an Insight into the Feature of Technology
The following code creates a line plot in orange color tha shows the Technology feature with time along with the trend that can be observed from the data.
# Plotting the Technology feature with a specific color
plt.figure(figsize=(20, 5))
df_comp['Technology'].plot(color='orange', title="Technology Feature Visualization")
plt.xlabel("Date")
plt.ylabel("Technology")
plt.show()
Getting an Insight into the Feature of Insurance
The following code creates a line plot in red color that shows the Insurance feature with time along with the trend that can be observed from the data.
# Plotting the Insurance feature with a specific color
plt.figure(figsize=(20, 5))
df_comp['Insurance'].plot(color='red', title="Insurance Feature Visualization")
plt.xlabel("Date")
plt.ylabel("Insurance")
plt.show()
STEP 4:
Healthcare White Noise Generation
The code generates white noise data for the Healthcare attribute based on its mean and standard deviation parameters.
# generating a white noise data for the Healthcare attribute
wn = np.random.normal(loc=df_comp.Healthcare.mean(), scale=df_comp.Healthcare.std(), size=len(df_comp))
Adding White Noise to the Dataset
This code adds the generated white noise as a new column, wn, in the dataset for further analysis.
df_comp["wn"] = wn
Describing the wn Field
This code provides a statistical summary of the wn column in the dataset.
df_comp.wn.describe()
Visualizing the White Noise Time-Series
This code creates a purple line plot of the white noise data over time, illustrating its randomness.
# Plotting the white noise time-series with a specific color
plt.figure(figsize=(20, 5))
df_comp['wn'].plot(color='purple', title="White Noise Time-Series")
plt.xlabel("Date")
plt.ylabel("White Noise")
plt.show()
Autocorrelation Plot for White Noise:
It creates a patternless hypothetical noise with no correlations which can be observed through this autocorrelation plot of its white noise series.
# Creating an autocorrelation plot for white noise series
plt.figure(figsize=(10, 5))
autocorrelation_plot(df_comp['wn'], color='#FA4032')
plt.title("Autocorrelation Plot for White Noise Series")
plt.show()
Autocorrelation Plot for the Healthcare Feature
It generates an autocorrelation plot for the relevant healthcare feature involving time lags concerning data.
# Plotting the autocorrelation for the 'Healthcare' feature
plt.figure(figsize=(10, 5))
autocorrelation_plot(df_comp['Healthcare'], color='#FF77B7')
plt.title("Autocorrelation Plot for Healthcare Feature")
plt.show()
ACF Plot for White Noise
This code generates an autocorrelation function (ACF) plot for white noise series indicating minimal lag correlations.
sgt.plot_acf(df_comp.wn, zero = False, lags = 40, color='purple')
plt.title("ACF of White Noise (WN)",size=20)
plt.show()
ACF Plot for Healthcare
This code generates an autocorrelation function (ACF) plot for the healthcare column indicating minimal lag correlations.
sgt.plot_acf(df_comp.Healthcare, zero = False, lags = 40, color='Red')
plt.title("ACF Of Healthcare",size=20)
plt.show()
Random Walk Series Visualization
This code creates a random walk series and plots the first 132 steps, showing stepwise fluctuations in teal.
# Initialize the random walk
walk = [99]
noise1 = []
# Generate random noise and apply it to create the random walk
for i in range(1900):
noise = \-1 if np.random.random() \< 0.5 else 1
noise1.append(noise)
walk.append(walk\[\-1\] \+ noise)
# Plotting the first 132 steps of the random walk with a specified color
plt.figure(figsize=(20, 5))
plt.plot(walk[:132], color='teal')
plt.title("Random Walk Series", size=20)
plt.xlabel("Steps")
plt.ylabel("Value")
plt.show()
Visualization of Random Noise Series
This code plots the first 150 steps of the random noise series in stepwise randomness in purple.
# Plotting the first 150 steps of the noise series with a specified color
plt.figure(figsize=(20, 5))
plt.plot(noise1[:150], color='purple')
plt.title("Random Noise Series", size=20)
plt.xlabel("Steps")
plt.ylabel("Noise Value")
plt.show()
Autocorrelation Plot for a Random Walk
The following code demonstrates the drawing of the autocorrelation plot for random walk series, revealing very strong correlations at all lags.
# Plotting the autocorrelation plot for the random walk series with a specified color
plt.figure(figsize=(10, 5))
autocorrelation_plot(walk, color='blue')
plt.title("Autocorrelation Plot for Random Walk Series")
plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.show()
Augmented Dickey-Fuller Test for Stationarity
This code performs the ADF test on the white noise series to check for stationarity in the data.
# AD fuller test for stationarity
sts.adfuller(df_comp.wn)
Augmented Dickey-Fuller Test for Healthcare.
The ADF test applies to the Healthcare feature to show whether the feature is stationary or not.
# AD fuller test for stationarity
sts.adfuller(df_comp.Healthcare)
QQ Plot for White Noise
In this code, the created QQ plot compares the distribution of the white noise series to the normal distribution, with the line in red.
# Creating the QQ plot with a specified color for the line
plt.figure(figsize=(10, 5))
scipy.stats.probplot(df_comp['wn'], plot=pylab)
pylab.gca().get_lines()[1].set_color('red') # Set QQ line color to red
plt.title("QQ plot for White Noise")
pylab.show()
QQ Plot for Healthcare
This creates a QQ plot for the Healthcare feature comparing its distribution against a normal distribution, with the line drawn in blue.
# Creating the QQ plot with a specified color for the line
plt.figure(figsize=(10, 5))
scipy.stats.probplot(df_comp['Healthcare'], plot=pylab)
pylab.gca().get_lines()[1].set_color('blue') # Set QQ line color to blue
plt.title("QQ plot for Healthcare")
pylab.show()
Additive Decomposition of Healthcare
This code performs the additive decomposition of the Health care feature into four components: observed, trend, seasonal, and random or residual, each drawn with a different color.
# Performing additive decomposition
additive = seasonal_decompose(df_comp['Healthcare'], model="additive")
# Plotting decomposition components with specific colors
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(15, 10), sharex=True)
# Observed
additive.observed.plot(ax=ax1, color='blue')
ax1.set_ylabel("Observed")
ax1.set_title("Additive Decomposition of Healthcare")
# Trend
additive.trend.plot(ax=ax2, color='green')
ax2.set_ylabel("Trend")
# Seasonal
additive.seasonal.plot(ax=ax3, color='orange')
ax3.set_ylabel("Seasonal")
# Residual
additive.resid.plot(ax=ax4, color='purple')
ax4.set_ylabel("Residual")
plt.xlabel("Date")
plt.show()
Multiplicative Decomposition of Healthcare
This code performs the Multiplicative decomposition of the Health care feature into four components: observed, trend, seasonal, and random or residual, each drawn with a different color.
# Multiplicative decomposition for Healthcare feature
multiplicative = seasonal_decompose(df_comp['Healthcare'], model="multiplicative")
# Plotting decomposition components with specific colors
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(15, 10), sharex=True)
# Observed
multiplicative.observed.plot(ax=ax1, color='blue')
ax1.set_ylabel("Observed")
ax1.set_title("Multiplicative Decomposition of Healthcare")
# Trend
multiplicative.trend.plot(ax=ax2, color='green')
ax2.set_ylabel("Trend")
# Seasonal
multiplicative.seasonal.plot(ax=ax3, color='orange')
ax3.set_ylabel("Seasonal")
# Residual
multiplicative.resid.plot(ax=ax4, color='purple')
ax4.set_ylabel("Residual")
plt.xlabel("Date")
plt.show()
STEP 5:
Fitting and Predicting using Exponential Smoothing
This code will fit the Exponential Smoothing model to the Healthcare data and then give predictions for the entire dataset.
hw_model = ExponentialSmoothing(df_comp.Healthcare.tolist())
model_fit = hw_model.fit()
# make prediction
yhat = model_fit.predict(1, len(df_comp))
Actual vs Predicted in Holt-Winters Model
It plots the actual and predicted values of the Healthcare feature based on the Holt-Winters model while seeing various markers and colors for differences.
# Plotting actual vs predicted for Holt-Winters model on Healthcare
plt.figure(figsize=(20, 5))
plt.plot(df_comp['Healthcare'].tolist(), color='blue', line, marker='o', markersize=4, label="Actual")
plt.plot(yhat.tolist(), color='red', line, marker='x', markersize=5, label="Predicted")
plt.title("Holt-Winters Model Prediction vs Actual for Healthcare", size=18)
plt.xlabel("Time")
plt.ylabel("Healthcare")
plt.legend(["Actual", "Predicted"], loc="upper left")
plt.grid(True)
plt.show()
STEP 6:
Specifying the LLR Test Function
This function will perform the Log-Likelihood Ratio (LLR) test that compares the fit of two models and will return a p-value for evaluating their statistical significance.
# define a fucntion for LLR test
def LLR_test(mod_1, mod_2, DF = 1):
L1 = mod\_1.fit().llf
L2 = mod\_2.fit().llf
LR = (2\*(L2-L1))
p = chi2.sf(LR, DF).round(3)
return p
First-order differencing for Integration
It calculates the first-order difference of the Healthcare feature to make it stationary and stores the result in a new column delta_1_Healthcare.
# Intgrating by 1 factor
df_comp["delta_1_Healthcare"] = df_comp.Healthcare.diff(1)
df_comp.delta_1_Healthcare
Stationarity Check After Differencing
Applying the ADF test to the Healthcare data for its first-order differenced series helps see if it has become stationary.
# Check for stationarity
sts.adfuller(df_comp.delta_1_Healthcare[1:])
STEP 7:
An ARIMA Model Fitted on Healthcare
The following code fits an ARIMA(1,1,1) model to the Healthcare dataset and outputs a detailed summary of model parameters and performance.
# Fitting an ARIMA model on the Healthcare series with order (1,1,1)
model_ar_1_i_1_ma_1 = ARIMA(df_comp['Healthcare'], order=(1,1,1))
results_ar_1_i_1_ma_1 = model_ar_1_i_1_ma_1.fit()
# Displaying the summary in a table format
print(results_ar_1_i_1_ma_1.summary())
ACF Plot for ARIMA Residuals
This code calculates and plots the Autocorrelations Function (ACF) for the residuals from the ARIMA (1,1,1) model to check for any pattern left behind.
df_comp['res_ar_1_i_1_ma_1'] = results_ar_1_i_1_ma_1.resid
sgt.plot_acf(df_comp.res_ar_1_i_1_ma_1, zero = False, lags = 40, color='red')
plt.title("ACF Of Residuals for ARIMA(1,1,1)",size=20)
plt.show()
Fitting Additional ARIMA Models
This code fits three ARIMA models with different orders: ARIMA(1,1,2), ARIMA(2,1,1), and ARIMA(2,1,2), to evaluate and compare their performance on the Healthcare data.
model_ar_1_i_1_ma_2 = ARIMA(df_comp.Healthcare, order=(1,1,2))
results_ar_1_i_1_ma_2 = model_ar_1_i_1_ma_2.fit()
model_ar_2_i_1_ma_1 = ARIMA(df_comp.Healthcare, order=(2,1,1))
results_ar_2_i_1_ma_1 = model_ar_2_i_1_ma_1.fit()
model_ar_2_i_1_ma_2 = ARIMA(df_comp.Healthcare, order=(2,1,2))
results_ar_2_i_1_ma_2 = model_ar_2_i_1_ma_2.fit()
Comparing ARIMA Models
This script outputs log-likelihood (LL) and Akaike Information Criterion (AIC) for ARIMA(1,1,2), ARIMA(2,1,1), and ARIMA(2,1,2) models and finds the model which best fits the data.
print("ARIMA(1,1,2): \t LL = ", results_ar_1_i_1_ma_2.llf, "\t AIC = ", results_ar_1_i_1_ma_2.aic)
print("ARIMA(2,1,1): \t LL = ", results_ar_2_i_1_ma_1.llf, "\t AIC = ", results_ar_2_i_1_ma_1.aic)
print("ARIMA(2,1,2): \t LL = ", results_ar_2_i_1_ma_2.llf, "\t AIC = ", results_ar_2_i_1_ma_2.aic)
The LLR Test for the Comparison of Models
This code here executes the LLR test for comparing the ARIMA(1,1,1) and ARIMA(2,1,1) models with DF=2 and returns the p-value to decide whether the increased complication adds any improvement in the model fit.
# Check with LLR test
print("\nLLR test p-value = " + str(LLR_test(model_ar_1_i_1_ma_1, model_ar_2_i_1_ma_2, DF=2)))
The LLR Test for the Comparison of Models
This code here executes the LLR test for comparing the ARIMA(1,1,1) and ARIMA(2,1,1) models with DF=1 and returns the p-value to decide whether the increased complication adds any improvement in the model fit.
# Check with LLR test
print("\nLLR test p-value = " + str(LLR_test(model_ar_1_i_1_ma_1, model_ar_2_i_1_ma_1, DF=1)))
The LLR Test for the Comparison of Models
This code here executes the LLR test for comparing the ARIMA(1,1,1) and ARIMA(2,1,1) models with DF=1 and returns the p-value to decide whether the increased complication adds any improvement in the model fit.
# Check with LLR test
print("\nLLR test p-value = " + str(LLR_test(model_ar_2_i_1_ma_1, model_ar_2_i_1_ma_2, DF=1)))
ACF Plot for Residuals of ARIMA(2,1,1)
This script computes and generates the ACF for residuals from an ARIMA(2,1,1) model to check whether there exist any patterns or correlations in the residuals.
# Residual ARIMA(2,1,1)
df_comp['res_ar_2_i_1_ma_1'] = results_ar_2_i_1_ma_1.resid
sgt.plot_acf(df_comp.res_ar_2_i_1_ma_1, zero = False, lags = 40, color='#FF77B7')
plt.title("ACF Of Residuals for ARIMA(2,1,1)",size=20)
plt.show()
Residuals Plot for ARIMA(2,1,1) Model
This code plots the residuals obtained from the ARIMA(2,1,1) model against time with an orange line to visualize any patterns or randomness in them.
import matplotlib.pyplot as plt
# Assuming df_comp['res_ar_2_i_1_ma_1'] contains residuals from the ARIMA(2,1,1) model
plt.figure(figsize=(20, 5))
df_comp['res_ar_2_i_1_ma_1'].plot(color='#FF8000', title="Residuals of ARIMA(2,1,1) Model")
plt.xlabel("Time")
plt.ylabel("Residuals")
plt.show()
Comparison of Predictions on ARIMA Models with Real Healthcare Data
This code visualizes the actual and predicted values from the ARIMA model for the Healthcare feature, with actual values in blue and predicted values in red, so that it can be seen how well aligned they are.
# Assuming yhat is your prediction array from the ARIMA model for the Healthcare feature
plt.figure(figsize=(20, 5))
plt.plot(df_comp['Healthcare'].tolist(), color='blue', line, marker='o', markersize=4, label="Actual")
plt.plot(yhat.tolist(), color='red', line, marker='x', markersize=5, label="Predicted")
plt.title("ARIMA Model Prediction vs Actual Healthcare", size=18)
plt.xlabel("Time")
plt.ylabel("Healthcare")
plt.legend(["Actual", "Predicted"], loc="upper left")
plt.grid(True)
plt.show()
Second-Order Differencing for Health Care
The code computes the second-order difference of the Healthcare feature to further stabilize the time series. It stores the result in a new column delta_2_Healthcare.
df_comp["delta_2_Healthcare"] = df_comp.Healthcare.diff(2)
Viewing Second-Order Differenced Data
This code displays the first few values of the delta_2_Healthcare column
df_comp["delta_2_Healthcare"].head()
ADF Test on Second-Order Differenced Data
This code performs the ADF test on the delta_2_Healthcare data to verify its stationarity after second-order differencing.
# check for adfuller
sts.adfuller(df_comp.delta_2_Healthcare[2:])
Fitting ARIMA(1,2,1) Model in Healthcare
The following code fits an ARIMA(1,2,1) model to the Healthcare series and provides an exhaustive summary of the hyperparameters as well as performance aspects of the model.
# Fitting the ARIMA model on the Healthcare series with order (1,2,1)
model_ar_1_i_2_ma_1 = ARIMA(df_comp['Healthcare'], order=(1,2,1))
results_ar_1_i_2_ma_1 = model_ar_1_i_2_ma_1.fit()
# Displaying the summary in table format
print(results_ar_1_i_2_ma_1.summary())
ACF Plot for Residuals of ARIMA(1,2,1)
This code calculates and plots the ACF of the residuals of an ARIMA(1,2,1) model and draws a blue line through the ACF plot to indicate the remaining patterns.
df_comp['res_ar_1_i_2_ma_1'] = results_ar_1_i_2_ma_1.resid.iloc[:]
sgt.plot_acf(df_comp.res_ar_1_i_2_ma_1[2:], zero = False, lags = 40, color='blue')
plt.title("ACF Of Residuals for ARIMA(1,2,1)",size=20)
plt.show()
Fitting ARIMAX Models on Healthcare with Banking
This code applies an ARIMAX(1,1,1) model to the Healthcare series using Banking as an external variable and provides a descriptive summary of the performance and parameters of the model.
# Fitting the ARIMAX model on Healthcare with Banking as an exogenous variable
model_ar_1_i_1_ma_1_X = ARIMA(df_comp['Healthcare'], exog=df_comp['Banking'], order=(1,1,1))
results_ar_1_i_1_ma_1_X = model_ar_1_i_1_ma_1_X.fit()
# Displaying the summary in table format
print(results_ar_1_i_1_ma_1_X.summary())
ACF Plot for Residuals of ARIMAX(1,1,1)
This code plots the Autocorrelation Function (ACF) of residuals from the ARIMAX(1,1,1) model, using a red line to detect remaining patterns or correlations.
df_comp['resX_ar_1_i_1_ma_1'] = results_ar_1_i_1_ma_1_X.resid.iloc[:]
sgt.plot_acf(df_comp.resX_ar_1_i_1_ma_1, zero = False, lags = 40, color='red')
plt.title("ACF Of Residuals for ARIMAX(1,1,1)",size=20)
plt.show()
Fitting SARIMAX Model
This code generates and fits the SARIMAX model to facilitate the process of predicting Healthcare data through Banking data considered as an external variable. Furthermore, the code produces a summary of the model's parameters and performance measures.
# Fitting the SARIMAX model on Healthcare with Banking as an exogenous variable
model_sarimax = SARIMAX(df_comp['Healthcare'], exog=df_comp['Banking'], order=(1,1,1), seasonal_order=(2,0,1,5))
results_sarimax = model_sarimax.fit()
# Displaying the summary in table format
print(results_sarimax.summary())
STEP 8:
Analysis of Residuals
The following code saves the residuals coming from the SARIMAX model in a data frame, as well as shows its Autocorrelation Function (ACF), so that one can check whether residuals are randomly distributed, thereby estimating the accuracy of the model in the process.
df_comp['results_sarimax'] = results_sarimax.resid.iloc[:]
sgt.plot_acf(df_comp.results_sarimax, zero = False, lags = 40, color='green')
plt.title("ACF Of Residuals for SARIMAX(1,1,1)",size=20)
plt.show()
Comparing the Performance of Models
This program compares models, which include ARIMA, ARIMAX, and SARIMAX, using several performance metrics like Log Likelihood and AIC. It will identify the best/optimal model among them by comparing values of Log Likelihood and AIC and display all the results in tabular form.
data = {
'Model': \['ARIMA(1,1,1)', 'ARIMA(2,1,1)', 'ARIMA(1,2,1)', 'ARIMAX(1,1,1)', 'SARIMAX(1,1,1)'\],
'Log Likelihood': \[results\_ar\_1\_i\_1\_ma\_1.llf, results\_ar\_2\_i\_1\_ma\_1.llf, results\_ar\_1\_i\_2\_ma\_1.llf, results\_ar\_1\_i\_1\_ma\_1\_X.llf, results\_sarimax.llf\],
'AIC': \[results\_ar\_1\_i\_1\_ma\_1.aic, results\_ar\_2\_i\_1\_ma\_1.aic, results\_ar\_1\_i\_2\_ma\_1.aic, results\_ar\_1\_i\_1\_ma\_1\_X.aic, results\_sarimax.aic\],
\# Add other metrics as needed
}
df_comparison = pd.DataFrame(data)
# Find the model with the highest log likelihood and lowest AIC
best_model = df_comparison.loc[df_comparison['Log Likelihood'].idxmax()]
print("The best model is:", best_model['Model'])
df_comparison
Conclusion
It explores and applies the time series forecasting techniques method in this project using ARIMA, ARIMAX, and SARIMAX models. After that, the prepared data and tested whether it was stationary and model fitted to the various models for Healthcare and Banking datasets. It selects the best model based on AIC and Log Likelihood for performing accurate forecast modeling. This whole project demonstrates that by including exogenous and seasonal variables into any statistical model, the accuracy of forecasting can be very much improved, which provides a clear picture of how statistical models can be used for forecasting in specific industries such as Healthcare.
Challenges New Coders Might Face
Challenge: Handling missing data
Solution: Forward filling or interpolation methods can take care of smooth transitions in data without loss of trends.
Challenge: Stationarity Issues
Solution: Transformation like differencing or log transform will be used to make stationary time series or use ARIMA model that takes account of it.
Challenge: Overfitting Model
Solution: Regularly checking model residuals by using ACF plots in the model should use a well-suited p, d, and q from AIC and BIC criteria to avoid complex models.
Challenge: Model Selection
Solution: Different comparisons of these models should be done based on AIC, Log Likelihood, and residual analysis to select the best model fitted according to performance metrics.
Challenge: Include seasonality and exogenous variables
Solution: Use a SARIMAX model for seasonality and ARIMAX for adding different external variables, but make sure all the variables synchronize and the right format for time series data analysis.
FAQ
Question 1: What is time series forecasting?
Answer: Time-series forecasting is about predicting future values for periods based on a series of given past data points in the sequence. Time series forecasting is also widely used in areas such as Healthcare, Finance, and Retail for making decisions using historical data.
Question 2: How can I manage missing data in time series forecasting?
Answer: Some techniques for dealing with incomplete datasets include forward or backward filling, interpolation, and row removal. The method should be chosen based on the amount and pattern of missing data.
Question 3: Why is stationarity necessary in a time series analysis?
Answer: Most forecasting models, including ARIMA, require a time series to be stationary (its statistical properties remain constant over time). Transformations such as differencing or log transformation may be applied when the data is found to be non-stationary.
Question 4: How to find the best time series model?
Answer: To identify the best model, one has to preferably evaluate various models on the common criterion of AIC, BIC, and Log Likelihood. Furthermore, the model's residuals should be checked for any remaining pattern.
Question 5: What is the difference between ARIMA and SARIMAX?
Answer: Simply ARIMA is a basic model that is used to forecast the time series, while SARIMAX enhances it with seasonal components and even allows the inclusion of exogenous variables that influence the behavior of the dependent variable.
Question 6: What is residual analysis, and why is it important?
Answer: Residual analysis is used to examine the fit of the model. It includes checking the residuals-differences between observed and predicted values for randomness. If the residuals tend to have a pattern, this implies that the model has some room for improvement.