
Time Series Forecasting with ARIMA and SARIMAX Models in Python
In this project, we will be working on time series forecasting, which is a powerful way to understand and predict the trend over time. It will mainly deal with real-time data from industries like Healthcare, Banking, Telecom, and many others. With the help of ARIMA, ARIMAX, and SARIMAX, we will try to identify some patterns in data, test models, and give forecasts based on such models.
Are you intimidated by the words ACF plots, stationarity, or residuals? Don't worry, we'll keep it fun and simple for you. In the end, it will all come together for you in one nice little package to show such models could be made to predict future time.
Project Overview
This project starts by importing and preparing the data. We clean the data, handle missing values, and set the date column as an index. This is how we establish it as a time series. We also set the frequency to monthly data using .asfreq('M'). Next, we explore the data visually. For example, we plot the trends for features like Healthcare, Banking, Telecom, and others. We also generate random white noise to understand the randomness and how it compares with actual data patterns.
The real fun starts when testing the stationarity of the series with the ADF Test. This enables us to make decisions on whether the series requires stationarity before modeling. Then we adopt three different approaches from Arima, Arimax and Sarimax. Finally, we shall compile all the models into a table to find out which has the best performance based on metrics like AIC and Log Likelihood. At the end of this, we will have created a solid forecasting model that is useful in predicting the future of industries like Healthcare.
Prerequisites
- Python programming and knowledge of Pandas, NumPy, and Matplotlib libraries.
- Prior knowledge of time series data and some of the key things like time series data including trends, seasons, and stationarity.
- Understanding of cyclical and trend patterns using ARIMA, ARIMAX, SARIMAX, and forecasting procedures.
- Some background knowledge about statistical computing and elements of model assessment as AIC and Log Likelihood factors.
- Familiarity with using Jupyter Notebooks for code execution and result visualization.
Approach
In this project, a systematic procedure is used for forecasting the time series. We import and clean the dataset, creating a date-indexed data structure that contains monthly frequency data. After missing values, we visualize the data for further understanding of its trends and patterns. Next, an ADF test is performed to check for stationarity, and if needed, transformations, including differencing, are applied. Afterward, we fit several models starting with ARIMA, which would then incorporate additional variables such as Banking to create an ARIMAX model and also explore seasonal components with SARIMAX. Finally, we evaluate all the models using metrics such as AIC and Log Likelihood to make a comparative analysis between them and find out the best model for future forecasting in the Healthcare sector and others.
Workflow and Methodology
Workflow
- Data Preparation: Import and cleanse the data, establishing the date column as an index while checking for missing values.
- Data Visualization: Visualize the data to observe trends and patterns in features such as Healthcare and Banking.
- Stationarity Check: ADF test performed for stationarity with data transformation, if needed.
- Model Fitting: Fit and evaluate ARIMA, ARIMAX, and SARIMAX.
- Model Comparison: Compare the models using AIC and Log Likelihood.
Methodology
- We use ARIMA to model univariate time series data and test various configurations.
- Use ARIMAX to incorporate external variables to forecast better.
- Seasonality and trend should be accounted for in the data using SARIMAX.
- Use ACF plots to analyze model residuals to see if they are random, and check model fit.
- Identify the best model using AIC and Log Likelihood parameters for better predictions.
Data Collection and Preparation
Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow:
- Using pandas load the dataset and inspect the first few rows.
- Set the date column (e.g., "month") as the index for time series analysis.
- Use isna().sum() to check for values in the data.
- Missing values are handled differently based on the situation, by filling or dropping.
- Use .asfreq('M') to set the frequency of data to monthly.
- Ensure that the data is in the right format for time series analysis.
Code Explanation
STEP 1:
Mounting Google Drive
First, mount Google Drive to access the data stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
Install Required Python Libraries
This code installs Python essential libraries for data analysis, visualization, and statistical modeling. These libraries include Pandas, NumPy, Seaborn, Matplotlib, and Statsmodels, and for advanced time series like pmdarima, and auto_arima. These libraries provide the basis for storing and operationalizing data visualization capabilities.
!pip install pandas
!pip install numpy
!pip install seaborn
!pip install matplotlib
!pip install scipy
!pip install random
!pip install matplotlib
!pip install pylab
!pip install statsmodels
# !pip install scipy.stats
# !pip install statsmodels.graphics.tsaplots
# !pip install statsmodels.tsa.stattools
# !pip install statsmodels.tsa.seasonal
# !pip install statsmodels.tsa.arima.model
# !pip install statsmodels.tsa.statespace.sarimax
# !pip install pmdarima
# !pip install auto_arima
Importing the Required Libraries and Setting Configurations
This code imports libraries that are needed for data manipulation, visualization, and time series analysis. Imports NumPy, Seaborn, Matplotlib, Pressure, and Statsmodels-as well as establishing filters for error suppression (to prevent displaying unnecessary warnings), thus maintaining clarity and concentrating all outputs during the running time.
# import the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot
sns.set_theme(style="darkgrid")
import scipy.stats
from random import seed
from random import random
from matplotlib import pyplot
import pylab
import statsmodels.graphics.tsaplots as sgt
import statsmodels.tsa.stattools as sts
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from scipy.stats.distributions import chi2
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.statespace.sarimax import SARIMAX
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", message="Covariance matrix calculated using the outer product of gradients")
warnings.filterwarnings("ignore", message="Covariance matrix is singular or near-singular")
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
Importing the Dataset
This code loads the dataset from an Excel file using Pandas, making it ready for data analysis and processing.
# importing the data
raw_csv_data = pd.read_excel("/content/drive/MyDrive/New 90 Projects/Project_14/Dataset/Data.xlsx")
Creating a Copy Dataset
This code creates a copy of the original dataset as df_comp. This ensures the raw data remains unaltered for future reference.
# check point of data
df_comp = raw_csv_data.copy()