
Build ARCH and GARCH Models in Time Series using Python
In this project, we dive into forecasting stock market volatility using ARCH and GARCH models. By analyzing past stock data, we predict future market fluctuations, giving investors and traders a handy tool to make smarter decisions and manage risks with confidence.
Project Overview
This project focuses on analyzing and modeling stock market data using time series methods. The goal is to explore volatility patterns and forecast future market behavior using ARCH and GARCH models. First, the project loads and preprocesses stock price data, including cleaning, handling missing values, and capping outliers. Visualizations like rolling volatility, daily returns distribution, and correlation heatmaps are used to analyze the data. Stationarity tests and seasonal decomposition help in understanding trends and cycles.
Next, the ARCH and GARCH models are fitted to the returns data, and their performance is evaluated based on forecasted volatility. Error metrics like MSE and MAE are computed to compare model accuracy. Finally, the forecasted volatility for both models is plotted to visualize their predictions and assess their performance in forecasting future stock price fluctuations.
Prerequisites
- Python programming knowledge, especially in data analysis and modeling.
- Familiarity with libraries like pandas, numpy, matplotlib, and statsmodels for data manipulation and visualization.
- Understanding of time series analysis, particularly ARCH and GARCH models.
- Basic knowledge of volatility and financial markets.
- Experience with rolling statistics and outlier detection methods.
- Ability to interpret statistical tests like the Augmented Dickey-Fuller (ADF) test for stationarity.
Approach
The project begins by loading stock price data from CSV files and setting up the necessary file paths. The data is then preprocessed by handling missing values, removing irrelevant columns, and ensuring the dataset is clean. Exploratory Data Analysis (EDA) follows, where stock price trends and daily returns are visualized, and rolling volatility is calculated to understand market behavior. The Augmented Dickey-Fuller (ADF) test is applied to check the stationarity of the returns series. Outliers are detected using z-scores and capped based on quantiles to ensure clean data for modeling. Both ARCH and GARCH models are then fitted to the returns data to model volatility, and forecasts for a 5-day horizon are generated. The models' accuracy is evaluated using error metrics like MSE and MAE. Finally, the forecasted volatility from both models is plotted, and their performance is compared to assess which model better captures the market's volatility.
Workflow
- Load the stock price data from CSV files and set the file path.
- Preprocess the data by handling missing values and dropping irrelevant columns.
- Perform exploratory data analysis (EDA) to visualize stock price trends and returns.
- Conduct the Augmented Dickey-Fuller (ADF) test to check for stationarity in the returns.
- Detects and cap outliers using z-scores to clean the dataset.
- Fit ARCH and GARCH models to the returned data.
- Generate volatility forecasts for a 5-day horizon using both models.
- Evaluate model performance using Mean Squared Error (MSE) and Mean Absolute Error (MAE).
- Visualize and compare the forecasted volatility from both models.
Methodology
- Use data preprocessing techniques to ensure the dataset is clean and ready for analysis.
- Apply exploratory data analysis (EDA) to understand key trends and patterns in the stock data.
- Check for stationarity of the returns using the Augmented Dickey-Fuller (ADF) test to ensure valid time series modeling.
- Use z-scores to identify and cap outliers to prevent their influence on model performance.
- Fit both ARCH and GARCH models to the returns series to estimate volatility.
- Forecast future volatility using both models and evaluate the accuracy with MSE and MAE metrics.
- Compare the models' performance and visualize the results to determine which model best forecasts volatility.
Data Collection and Preparation
Data collection
The Time Series dataset is available in Kaggle. It is possible to conveniently and securely access a Kaggle dataset from within Google Colab after configuring your Kaggle credentials to prevent compromising sensitive information. It brings in the user’s data to collect securely the Kaggle API key and username and assigns them as environment variables. This enables the use of Kaggle’s CLI command which authenticates the user and downloads the dataset straight into Colab.
Data preparation workflow
- Load the dataset from the specified file path.
- Check the dataset structure to ensure it’s properly loaded.
- Handle missing values by dropping rows with missing values in key columns like 'Close'.
- Drop irrelevant columns such as 'Trades', 'Deliverable Volume' and '%Deliverble'.
- Ensure the 'Date' column is parsed correctly and set as the index.
- Perform exploratory data analysis (EDA) to understand the data, including visualizations of stock prices and returns.
- Test for stationarity using the Augmented Dickey-Fuller (ADF) test and apply differencing if necessary.
- Detects outliers using z-scores and cap extreme values to maintain data integrity.
- Calculate additional metrics like daily returns and rolling volatility for further analysis.
Understanding the Code
Here’s what is happening under the hood. Let’s go through it step by step:
Step 1:
Mount your Google Drive to access and save datasets, models, and other resources.
from google.colab import drive
drive.mount('/content/drive')
This programming code installs the required Python packages
for performing data analysis and modeling, such as: pandas, numpy, matplotlib, statsmodels, arch, and seaborn.
!pip install pandas numpy matplotlib statsmodels arch seaborn
This code imports libraries for data manipulation, statistical analysis, modeling, and evaluation, including pandas, numpy, statsmodels, and arch.
# Import libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
from arch import arch_model
from scipy.stats import zscore
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.metrics import mean_squared_error, mean_absolute_error
This code sets the dataset folder path and checks the contents of the specified directory to verify its structure.
# Dataset folder path
data_folder = '/content/drive/MyDrive/Aionlinecourse_badhon/Project/Build ARCH and GARCH Models in Time Series using Python /nifty50-stock-market-data'
# Check the structure
os.listdir(data_folder)
Step 2:
This code loads the 'ASIANPAINT.csv' dataset into a pandas DataFrame, parsing the 'Date' column as dates and setting it as the index.
data = pd.read_csv(os.path.join(data_folder, 'ASIANPAINT.csv'), parse_dates=['Date'], index_col='Date')
This code displays the first few rows of the dataset to give an overview of its structure and contents.
# Check the first few rows
data.head()
The code defines a function to plot the adjusted close price, high price, low price, and daily returns of the ASIANPAINT stocks in a 3-panel chart.
def plot_data():
fig, axes = plt.subplots(3, 1, figsize=(10, 18))
data['Close'].plot(ax=axes[0], title='Adjusted Close Price - ASIANPAINT', color='blue', ylabel='Price')
# High and Low Prices Plot
axes[1].plot(data['High'], label='High', color='blue')
axes[1].plot(data['Low'], label='Low', color='orange')
axes[1].set_title('High and Low Prices - ASIANPAINT')
axes[1].set_ylabel('Price')
axes[1].legend()
# Daily Returns Plot
data['returns'] = data['Close'].pct_change()
data['returns'].dropna().plot(ax=axes[2], title='Daily Returns - ASIANPAINT', color='green', ylabel='Returns')
plt.tight_layout()
plt.show()
plot_data()
This code eliminates rows that contain missing values under the 'Close' column and then drops the irrelevant columns such as 'Trades', 'Deliverable Volume' and '%Deliverble'.
# Check for missing data and drop irrelevant columns
data = data.dropna(subset=['Close'])
data = data.drop(['Trades', 'Deliverable Volume', '%Deliverble'], axis=1, errors='ignore')
The following code applies the Augmented Dickey-Fuller (ADF) test to returns series to establish their stationarity. If the p-value obtained is more than 0.05, differencing is done to make the returns series stationary.
def adf_test(series):
result = adfuller(series.dropna())
print(f"ADF Statistic: {result[0]}")
print(f"p-value: {result[1]}")
return result[1]
adf_pvalue = adf_test(data['returns'])
if adf_pvalue > 0.05:
data['returns'] = data['returns'].diff().dropna()
This code calculates and plots the 30-day rolling volatility of the returns, showing how the volatility changes over time.
# Plot rolling volatility (30-day window)
data['rolling_volatility'] = data['returns'].rolling(window=30).std()
data['rolling_volatility'].dropna().plot(figsize=(10, 6), title='Rolling Volatility (30 days)', color='red')
plt.xlabel('Date')
plt.ylabel('Volatility')
plt.show()
This code computes and displays a heatmap of the correlation matrix for selected columns, including 'Close', 'Prev Close', 'High', 'Low' and 'returns'.
# Correlation matrix heatmap
corr_matrix = data[['Close', 'Prev Close', 'High', 'Low', 'returns']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()
This code plots the distribution of daily returns using a histogram with a kernel density estimate (KDE) to visualize the data's frequency and spread.
# Distribution of daily returns
plt.figure(figsize=(10, 6))
sns.histplot(data['returns'], bins=50, kde=True)
plt.title('Distribution of Daily Returns')
plt.xlabel('Returns')
plt.ylabel('Frequency')
plt.show()
This code calculates the z-score of the returns and identifies outliers with an absolute z-score greater than 3, printing the number of detected outliers.
from scipy.stats import zscore
data['z_score'] = zscore(data['returns'].dropna())
outliers = data[data['z_score'].abs() > 3]
print(f"Number of outliers detected: {len(outliers)}")
This code visualizes daily returns with outliers shown in red, in comparison and in the ground of returns' time series.
# Plot outliers on top of returns
plt.figure(figsize=(10, 6))
plt.plot(data['returns'], label='Daily Returns')
plt.scatter(outliers.index, outliers['returns'], color='red', label='Outliers')
plt.title('Daily Returns with Outliers')
plt.xlabel('Date')
plt.ylabel('Returns')
plt.legend()
plt.show()
This function caps outliers and its returns are clipped by the 1st and 99th percentiles, ensuring the data stays bounded by these limits.
# Detect and cap outliers
def cap_outliers(series):
lower_bound = series.quantile(0.01)
upper_bound = series.quantile(0.99)
return series.clip(lower=lower_bound, upper=upper_bound)
data['returns'] = cap_outliers(data['returns'])
This code plots the daily returns after capping the outliers, showing the adjusted returns time series.
# Plot the daily returns after capping the outliers
plt.figure(figsize=(10, 6))
plt.plot(data['returns'], label='Capped Daily Returns')
plt.title('Daily Returns with Outliers Capped')
plt.xlabel('Date')
plt.ylabel('Returns')
plt.legend()
plt.show()
This code uses the multiplicative model to divide the 'Close' prices into trend, seasonality, and residue components and then visualize the results.
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose the 'Close' prices
decomposed = seasonal_decompose(data['Close'].dropna(), model='multiplicative', period=365)
decomposed.plot()
plt.show()
This code calculates the 30-day moving average and moving standard deviation of the Close Price along with the original close price as a graph.
# Rolling Mean and Standard Deviation
rolling_mean = data['Close'].rolling(window=30).mean()
rolling_std = data['Close'].rolling(window=30).std()
plt.figure(figsize=(10, 6))
plt.plot(data['Close'], label='Close Price', color='blue')
plt.plot(rolling_mean, label='Rolling Mean (30 days)', color='red')
plt.plot(rolling_std, label='Rolling Std Dev (30 days)', color='green')
plt.title('Rolling Mean and Standard Deviation - Close Price')
plt.legend()
plt.show()
Step 3:
This code fits both an ARCH model and a GARCH model to the returns series, outputs summaries, and returns the fitted models for subsequent analysis.
def fit_models(series):
arch_model_fit = arch_model(series.dropna(), vol='ARCH', p=1).fit()
print("ARCH Model Summary")
print(arch_model_fit.summary())
garch_model_fit = arch_model(series.dropna(), vol='Garch', p=1, q=1).fit()
print("GARCH Model Summary")
print(garch_model_fit.summary())
return arch_model_fit, garch_model_fit
arch_fit, garch_fit = fit_models(data['returns'])
This code predicts the volatility for a specified period in default 5 using the fitted ARCH and GARCH models and returns the square root of the expected variance.
def forecast_volatility(model, horizon=5):
forecast = model.forecast(horizon=horizon)
return np.sqrt(forecast.variance.iloc[-1])
arch_forecast_volatility = forecast_volatility(arch_fit)
garch_forecast_volatility = forecast_volatility(garch_fit)
The code draws a plot namely forecasted volatility under both models of ARCH and GARCH comparative on the predicted volatilities during the forecast horizon.
# Plot forecasted volatility for both models
plt.figure(figsize=(10, 6))
plt.plot(arch_forecast_volatility, label='Forecasted Volatility (ARCH)', color='blue')
plt.plot(garch_forecast_volatility, label='Forecasted Volatility (GARCH)', color='purple')
plt.title('Forecasted Volatility Comparison')
plt.xlabel('Days')
plt.ylabel('Volatility')
plt.legend()
plt.show()
This code prints the AIC and BIC values for both the ARCH and GARCH models to evaluate their fit and compare model performance.
print(f"ARCH Model AIC: {arch_fit.aic}")
print(f"ARCH Model BIC: {arch_fit.bic}")
print(f"GARCH Model AIC: {garch_fit.aic}")
print(f"GARCH Model BIC: {garch_fit.bic}")
This code calculates actual volatility using 5-day rolling windows by keeping only those observations with no missing values.
actual_volatility = data['returns'].rolling(window=5).std().dropna()
Step 4:
This code defines a function aimed at MSE and MAE calculations between actual and forecast values.
def calculate_errors(actual, forecasted):
mse = mean_squared_error(actual, forecasted)
mae = mean_absolute_error(actual, forecasted)
return mse, mae
It computes and prints the mean square error (MSE) and mean absolute errors (MAE) of the volatility forecasts from the ARCH model against the actual volatility.
# ARCH model MSE and MAE
arch_mse, arch_mae = calculate_errors(actual_volatility[-5:], arch_forecast_volatility)
print(f"ARCH Model MSE: {arch_mse}, MAE: {arch_mae}")
It computes and prints the mean square error (MSE) and mean absolute errors (MAE) of the volatility forecasts from the GARCH model against the actual volatility.
#GARCH model MSE and MAE
garch_mse, garch_mae = calculate_errors(actual_volatility[-5:], garch_forecast_volatility)
print(f"GARCH Model MSE: {garch_mse}, MAE: {garch_mae}")
This code compares the performance of the ARCH and GARCH models by plotting their MSE and MAE values, visually showing the error for each model.
plt.figure(figsize=(10, 6))
plt.plot(['ARCH', 'GARCH'], [arch_mse, garch_mse], marker='o', label='MSE')
plt.plot(['ARCH', 'GARCH'], [arch_mae, garch_mae], marker='o', label='MAE')
plt.title('Model Performance Comparison')
plt.xlabel('Model')
plt.ylabel('Error')
plt.legend()
plt.show()
Conclusion
In this project, we explored time series analysis to model and forecast stock price volatility. By preprocessing the data, handling missing values, removing irrelevant columns, and detecting outliers, we ensured that the dataset was clean and ready for modeling. We performed exploratory data analysis (EDA) to visualize stock price trends and returns, tested for stationarity, and applied the Augmented Dickey-Fuller (ADF) test. Using ARCH and GARCH models, we forecasted volatility and evaluated model performance using error metrics like MSE and MAE. The findings from this project provide a deeper understanding of volatility patterns, helping us to choose the best model for predicting future market behavior.
Challenges New Coders Might Face
Challenge: Handling missing data
Solution: Forward filling or interpolation methods can take care of smooth transitions in data without loss of trends.Challenge: Time Series Seasonality
Solution: Decompose the time series into trend, seasonal, and irregular components through an additive model.Challenge: Overfitting Models
Solution: Feature selection should be done very carefully and the model should be evaluated by the RMSE criterion for bias-variance balance.Challenge: Scaling and encoding
Solution: Standardized numerical features and applied one-hot encoding for such categoricals such as seasonality indicators.Challenge: Anomaly detection complexity
Solution: Use Z-scores for statistical anomaly detection and visually check the results by plotting the anomalies.Challenge: Interpreting Model Output
Solution: RMSE is a commonly used metric and also gives a way to present actual vs predicted against easier understanding.
FAQ
Question 1: Define time series analysis in data science.
Answer: Time series analysis gives insight into data points collected at periodic intervals to recognize temporal patterns, trends, and seasonality for an understanding of the phenomenon and predictability in various sectors like Banking and Healthcare.
Question 2: What are lagged features in time series forecasting?
Answer: Lagged features are past values of some variable where the models will use that past performance to establish likely future potential for the variable using learned patterns with historical records.
Question 3: How do you treat missing values in time series datasets?
Answer: In most cases, the missing values in time series data are forward-filled, backward-filled or through an interpolation process of continuity into the next period.
Question 4. How is Z-score anomaly detection applicable for time series?
Answer: Z-scores measure how far a specific value surpasses or falls below the mean, whereby it is judged that such values with Z-scores above 3 belong to an anomaly.
Question 5. Why should rolling windows be used for measuring volatility?
Answer: A rolling window such as 30 days or 5 days is used to calculate volatility because it shows how volatility unfolds over time rather than averaging out the effects. The current market thus becomes a better reflection of it through such a window.