Image

Time Series Analysis with Facebook Prophet Python and Cesium

This project demonstrates time series forecasting using Prophet and the additional Cesium features. We try to use historic data trends and any seasonal nature of healthcare call data to prepare for predictions enhanced with external time series features. We would like to make accurate future predictions for this data by combining optimized feature extraction with forecasting capabilities inherent in Prophet.

Project Overview

The project aims to generate a predictive healthcare call data model using the Prophet, enhanced by features extracted by Cesium. The first step is cleaning the data, then followed by the extraction of important time series features—mean, standard deviation, and many others—using Cesium from the historical data before it's fed directly into the Prophet for forecasting. Future call volumes are predicted after training, which are visualized for interpretation. The trends, seasonality, and uncertainty intervals captured in the plots provide a comprehensive view of the forecast.

Prerequisites

  • Knowledge of time series analysis and forecasting to a basic extent.
  • Having some hands-on experience with Python and also manipulating data with pandas.
  • Knowledge about the Prophet model for time series forecasting.
  • Experience with Cesium for feature extraction from time series.
  • Knowledge of basic statistical features such as mean, standard deviation, and skewness.
  • Familiarity with data visualization using matplotlib and seaborn.
  • Python packages: pandas, prophet, cesium, matplotlib, seaborn, numpy, scipy.

Approach

We started by cleaning the healthcare call data and renaming columns ds (date) and y (value) so that we could work with the Prophet library. With this, we trained our Prophet model. We enhanced our dataset by adding empirical statistical features extracted from Cesium such as mean, standard deviation, or skewness, and used it for enriching the dataset and baseline modeling purposes. An advanced dataset is what we now have, which consists of all the initial call events and new features as regressors embedded in it. The new model based on the Prophet training dataset could be used to forecast the total count of call volumes for the next 12 months after it is fitted. Finally, forecast visualization is presented by which trends, seasonality, and other components interpreted by the model could be used for further understanding of the factors affecting the prediction. In this manner, it combines both historical data and custom features to enhance the accuracy.

Workflow and Methodology

  • Data Collection and Manipulation: Clean and collect the Healthcare Call Data, making sure it may be analyzed as a time series.
  • Data Transformation: Rename columns to ds, and y, and transform the data to the Prophet model format.
  • Feature Extraction: Extracting statistical features (mean, std, skewness) using mean, std, and skewness from time series using Cesium.
  • Data Enrichment: Put the extracted features with the original data and get an enriched dataset.
  • Model Training: Again train the Prophet model on the enhanced dataset but with additional features as regressors.
  • Forecasting: Using the trained model forecast the future call volumes for the next 12 months.
  • Visualization: Display the forecast results and visualize trends, seasonality, and uncertainty intervals.

Data Collection and Preparation

Data Collection:

In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

  • Import the healthcare call data, and put it into a dataframe for analysis using pandas.
  • Make the month column into a datetime form to be time series compatible.
  • Check missing values or null values in the dataset.
  • Clean up the dataset and extract relevant columns (only the month and Healthcare columns).
  • Rename the columns to ds (date) and y (value) so that the Prophet can use them.
  • Verify that the dataset is ready to be fed to time series forecasting.

Understanding the Code:

Here’s what is happening under the hood. Let’s go through it step by step:

Step 1:

Mounting Google Drive

Mount your Google Drive to access and save datasets, models, and other resources.

from google.colab import drive
drive.mount('/content/drive')

Management of Packages

In this code, we uninstall numpy, scipy, cesium, prophet, and seaborn and then reinstall the latest available versions. This guarantees that you are working with the latest changes made to these packages.

!pip uninstall -y numpy scipy cesium prophet seaborn
!pip install numpy
!pip install scipy
!pip install cesium
!pip install prophet
!pip install seaborn

Import Library

This code imports all essential libraries, such as the pandas library for data manipulation; seaborn for plotting; Prophet for time series forecasting; cesium for feature extraction; matplotlib for visualization; and statsmodels for seasonal decomposition. It also imports the diagnostic functions from Prophet for validating models.

import pandas as pd
import seaborn as sns
from prophet import Prophet
from cesium import featurize
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from prophet.diagnostics import cross_validation, performance_metrics

Step 2:

Loading Excel File

This code loads an Excel file put in a predefined path. If it succeeds, it prints the message "Successfully Loaded" and displays the contents. Otherwise, if the file does not exist or another error occurs, the given exception will be caught, and an error message will be displayed.

excel_file_path = '/content/drive/MyDrive/Aionlinecourse_badhon/Project/Time Series Analysis with Facebook Prophet Python and Cesium/CallCenterData.xlsx'
try:
df = pd.read_excel(excel_file_path)
print("Successfully loaded") # Display the first few rows of the DataFrame
except FileNotFoundError:
print(f"Error: File not found at {excel_file_path}")
except Exception as e:
print(f"An error occurred: {e}")

Previewing Data

This block of code displays the first few rows of the dataset to give a quick overview of its structure.

df.head()

Checking Null Values

This line of code checks if there are any null values present in each feature.

df.isnull().sum()

Descriptive Statistics

This code displays a summary of the numerical variables contained in the Data Frame, including mean, standard deviation, minimum, maximum, and quartiles, etc.

df.describe()

The purpose of the given code is to provide a summary of the DataFrame by displaying the number of records, names of the columns, types of columns, count of non-null values, and the size in memory.

df.info()

Step 3:

Trend Analysis on Call Patterns

This code creates multiple subplots for the visualization of each feature call trend over time. Each subplot contains the data related only to that feature, with the feature name labeled at a gridline, and a legend for clarity. Finally, the presentation of this layout is better adjusted.

# Plot each feature separately
features = ['Healthcare', 'Telecom', 'Banking', 'Technology', 'Insurance']
plt.figure(figsize=(20, 12))
for i, feature in enumerate(features, 1):
plt.subplot(3, 2, i)
plt.plot(df['month'], df[feature],linewidth=2, label=feature, color='red')
plt.title(f"{feature} Call Trends Over Time", fontsize=14)
plt.xlabel("Month", fontsize=12)
plt.ylabel("Call Count", fontsize=12)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

Correlation Heatmap

This code generates a heatmap to demonstrate the aspects of correlation between the sectors (Healthcare, Telecom, Banking, Technology, and Insurance) in the form of correlation values with annotation concerning them for the relationship between two features while using cool warm colors.

# Correlation heatmap for the numerical columns
plt.figure(figsize=(10, 6))
correlation = df[['Healthcare', 'Telecom', 'Banking', 'Technology', 'Insurance']].corr()
sns.heatmap(correlation, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title('Correlation Between Sectors', fontsize=16)
plt.show()

Annual Trends Graph

This code scrapes the year from the month column and groups the call counts per year for all sectors (Healthcare, Telecom, Banking, Technology, Insurance). It then plots the total calls per year, adding markers to indicate each data point, a legend, and gridlines for easy reading.

df['year'] = df['month'].dt.year
# Aggregate by year
yearly_trends = df.groupby('year')[['Healthcare', 'Telecom', 'Banking', 'Technology', 'Insurance']].sum()
# Plot yearly trends
yearly_trends.plot(figsize=(12, 6), marker='o')
plt.title('Yearly Trends of Call Counts', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Total Call Counts', fontsize=12)
plt.legend(title='Sector')
plt.grid(True)
plt.show()

Seasonal Decomposition

With the 12-month periodic separator, this code implements an additive seasonal decomposition on the Healthcare column. It separates the input data into trend, seasonal, and residual components and visualizes them through a plot.

# Perform seasonal decomposition
result = seasonal_decompose(df['Healthcare'], period=12, model='additive')
# Plot decomposition
result.plot()
plt.show()

Prophet Data Preparation

Imports the month column as datetime values and prepares the input data for Healthcare. Renames the columns ds (date) and y (value), so that they can be used as input in a Prophet model.

# Step 1: Ensure datetime format for the 'month' column
df['month'] = pd.to_datetime(df['month'])
# Step 2: Prepare data for a single sector (e.g., Healthcare)
healthcare_df = df[['month', 'Healthcare']].rename(columns={'month': 'ds', 'Healthcare': 'y'})

Previewing Data

This block of code displays the first few rows of the new healthcare_df dataset to give a quick overview of its structure.

healthcare_df.head()

Data Visualization of Healthcare Time Series Data

It plots time series data about the Healthcare sector using call counts over time in blue markers. It adds the axes labels along with a legend and some gridlines for better visualization.

# Plot the data
plt.figure(figsize=(12, 6))
plt.plot(healthcare_df['ds'], healthcare_df['y'], marker='o', color='blue', label='Healthcare Calls')
plt.title('Time Series Data for Healthcare Sector', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Call Counts', fontsize=12)
plt.legend()
plt.grid(True)
plt.show()

Step 4:

Fitting the Prophet Model

This code initializes the prophet model and proceeds to fit it into the Healthcare time series dataset. The model thus learns the trend, seasonality, and other patterns within the data.

# Initialize and fit the model
model = Prophet()
model.fit(healthcare_df)

Building a Future Dataframe and Forecasting

This code creates a future dataframe for the next 12 months and uses the generated fitted Prophet model to forecast. The forecast would include predicted values for the Healthcare sector for the coming months.

# Create future dataframe
future = model.make_future_dataframe(periods=12, freq='M')  # Forecast for 12 months
forecast = model.predict(future)

Plot forecast

This code visualizes the forecast generated by the Prophet model, on top of the predicted values from the historical data. It includes a plot of the forecasted trend, uncertainty intervals, and actual observations.

# Plot forecast
model.plot(forecast)
plt.show()

Forecast Component Plotting

The code plots the different single components of the Prophet forecasts: trend, seasonality, and holidays. With this method, we can realize how the forecaster breaks down trends and patterns in the data to arrive at a final value prediction.

# Plot forecast components
model.plot_components(forecast)
plt.show()

Visualizing Change Points in Predictions

This code identifies and prints out changepoints by the Prophet model, during which the trend changes. Then it draws actual call numbers against the forecast and the changepoints marked with red dashed vertical lines to indicate trend changes.

changepoints = model.changepoints
print("Detected Changepoints:\n", changepoints)
plt.figure(figsize=(12, 6))
plt.plot(healthcare_df['ds'], healthcare_df['y'], label='Actual Calls')
plt.plot(forecast['ds'], forecast['yhat'], label='Forecast', color='orange')
for cp in changepoints:
plt.axvline(cp, color='red', line, alpha=0.5)
plt.title('Changepoints in Healthcare Calls Trend', fontsize=16)
plt.legend()
plt.show()

Step 5:

Converting Dates into Timestamps

This code transforms the ds (date) column in healthcare_df into Unix timestamps, according to the .timestamp() method. It then converts values to an integer format and saves these into a new column named ts.

healthcare_df['ts'] = healthcare_df['ds'].apply(lambda x: x.timestamp()).astype(int)

Data Preparation for Time-Series Forecasting

This code prepares the dataset 'healthcare_df' to put it in a condition to be used for the time-series predictor model with the past n_past periods as predictors (y and ts). It creates a new dataframe cesium_df, whereby each row contains the historical values (y), related timestamps (ts), and the target for the upcoming time step. After this, the resulting dataframe is shown for review.

# Number of past periods to use for prediction
n_past = 4
target_data = []
for i in range(len(healthcare_df) - n_past - 1):     
temp = healthcare_df['y'][i:i + n_past].values  # Predictor values (y)
time = healthcare_df['ts'][i:i + n_past].values  # Corresponding timestamps (ts)
target = healthcare_df['y'][i + n_past]
if len(temp) == n_past and len(time) == n_past:
target_data.append([temp, time, target])
cesium_df = pd.DataFrame(target_data, columns=['y', 'ts', 'target'])
print(cesium_df.head())

Converting Data to Dictionary

This code converts the ts and y columns of the cesium_df dataframe, into the dictionary in the sense that the names of the columns become keys as and value in the corresponding positions becomes a list. The output goes to cs_df which can be used for further analysis or input into models. Then display the cs_df.

cs_df = cesium_df[['ts','y']].to_dict('list')
cs_df

Step 6:

Feature Generation in One Function

This code will create four different functions for extracting features from the 5-period predictor data.

# featurizing the data
# Generating new features out the 5 time period predictor
def mean_signal(t, m, e):
return np.mean(m)
def std_signal(t, m, e):
return np.std(m)
def mean_square_signal(t, m, e):
return np.mean(m ** 2)
def abs_diffs_signal(t, m, e):
return np.sum(np.abs(np.diff(m)))

Naming and Organizing Features

This code creates a dictionary, guo_features, where keys are descriptive names ("mean", "std", "mean2", "abs_diffs") for the new features, and values are the corresponding functions (mean_signal, std_signal, mean_square_signal, abs_diffs_signal).

# Giving names to new features
guo_features = {
"mean": mean_signal,
"std": std_signal,
"mean2": mean_square_signal,
"abs_diffs": abs_diffs_signal,
}

Applying Custom Features

The guo_features will apply custom features on the time series data (ts and y from cs_df). The function featurize_time_series from Cesium is called to extract these features using names defined in the guo_features dictionary. The extracted features are now stored in fset_cesium to be analyzed or used as inputs to models later.

# introducing the feature
fset_cesium = featurize.featurize_time_series(times=cs_df["ts"],
                                          values=cs_df["y"],
errors=None,
features_to_use=list(guo_features.keys()),
custom_functions=guo_features)

Adding Target Column

This code adds the target column from cesium_df to the fset_cesium dataframe.

fset_cesium["target"] = cesium_df["target"]

Previewing Feature Set

This code displays the first few rows of the fset_cesium dataframe, which now contains the extracted features and the target values.

fset_cesium.head()

Extracting Feature Names

This code stores the column names of the fset_cesium dataframe (which represent the feature names) in the feature_names variable.

feature_names= fset_cesium.columns

Visualizing Extracted Features

The bar plot displays time series features extracted for the Healthcare domain.

features.T.plot(kind='bar', figsize=(15, 8), legend=False)
plt.title('Extracted Time Series Features for Healthcare', fontsize=16)
plt.xlabel('Feature', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.xticks(ticks=range(len(feature_names)), labels=feature_names, rotation=45)
plt.show()

Step 7:

Adding extracted features as regressors

This piece of code adds the extracted time-series features like mean2, mean, std, abs_diff, target from Cesium to the healthcare_df as additional regressors, thus improving the models by increasing the amount of information available for forecasting.

# Step 3: Add Cesium extracted features as additional regressors
healthcare_df['mean'] = fset_cesium['mean']
healthcare_df['mean2'] = fset_cesium['mean2']
healthcare_df['std'] = fset_cesium['std']
healthcare_df['abs_diffs'] = fset_cesium['abs_diffs']
healthcare_df['target'] = fset_cesium['target']

Checking Null Values

The code mentioned here is used for checking the missing values in the healthcare_df dataframe by summing the NaN values in each column.

healthcare_df.isna().sum()

Handling Missing Values

The following code fills any voids in the healthcare_df dataset with a forward fill (ffill ) which propagates the last valid observation forward to the next missing value. This avoids gaps in the data before fitting the model.

healthcare_df.fillna(method='ffill', inplace=True)

Adding Regressors To Prophet's Model

This code initiates a fresh Prophet model, with the features extracted (mean2, mean, std, abs_diff, target) included as additional regressors. The model will use these time-series features of the regressors to enhance predictability.

model = Prophet()
# Add the extracted features as additional regressors
model.add_regressor('mean')
model.add_regressor('mean2')
model.add_regressor('std')
model.add_regressor('abs_diffs')
model.add_regressor('target')

Fitting Model with Additional Regressors:

The following code fits the Prophet model to the healthcare_df dataset, now enhanced with additional regressors (amplitude, mean, std, skew, max_slope). These enable the model to grasp increasingly complex patterns in predicting accuracy.

model.fit(healthcare_df)

Creating Future Dataframe

This code generates future data for the next 12 months every month. The purpose of this future dataframe is to make predictions about future values using the patterns learned by the model and any other additional regressors.

future = model.make_future_dataframe(periods=12, freq='M')

Merging DataFrames

The following code will merge the future dataframe with the healthcare_df dataframe in column ds (date). The merged_df will now contain both the future periods and the actual data combined in a single dataset that includes the original information as well as future forecasts.

merged_df = pd.merge(future, healthcare_df, on='ds')
merged_df

Predicting with Merged Data Predictions

This code uses the merged_data_frame in forecasting with the Prophet model. The method model.predict() applies the trained model to combined data, both historical and future, and generates a forecast appended with predicted values for the Healthcare sector.

forecast = model.predict(merged_df)

Forecast Visualization

This code shows the forecast created by the Prophet model so that users can realize the predictions for the Healthcare calls along with the historical data. The plot denotes the trend, uncertainty intervals, and the future forecast.

# Step 8: Plot the forecast
model.plot(forecast)
plt.title('Healthcare Call Forecast with Prophet and Cesium Features')
plt.show()

Plotting Forecast Components

This code visualizes the individual components of the forecast, such as the trend, seasonality, and any holidays (if used).

# Optionally: Plot the components of the forecast (trend, seasonality, etc.)
model.plot_components(forecast)
plt.show()

Conclusion

This project successfully proves that Prophet and Cesium give a better quality of time series forecasting. Extracting important features from historical healthcare call data like mean, standard deviation, and absolute difference, we enhanced the dataset to improve the accuracy of the model. The Prophet model, developed on both original and enriched data, showed highly reliable future call volume forecasts. Displaying the forecasts and their components helps us understand the trend plus seasonality of healthcare calls. Hence, this is an incredibly appropriate approach to forecasting and understanding time series data in terms of real-world forecasting tasks.

Challenges New Coders Might Face

  • *Challenge***: Non-stationary Time Series Data
    **Solution
    : They can stabilize and make stationary data by differentiating, and once that is done, modeling can be done.

  • *Challenge***: Missing Data
    Solution
    :** It helps achieve smooth transitions of data without loss of trends by forward filling or using interpolation methods.

  • Challenge: Selection of Features
    Solution: We applied the capability of Cesium to capture a variety of statistical features (mean, standard deviation, absolute difference) to enrich the model, resulting in improved prediction power.

  • Challenge: Model Overfitting
    Solution: Complementing the internal regressors with external features without making the model too complicated for good generalization to future data.

FAQ

Question 1: How do I forecast healthcare call data using Prophet?
Answer: To predict healthcare phone calls with Prophet, you first need to format the time series data, which has a ds (date) and y (call count) column. After preparing the data, you can now fit the Prophet model and forecast for future months, while the trend and seasonal components will be included as part of the forecast.

Question 2: How can I manage missing data in time series forecasting?
Answer: Some techniques for dealing with incomplete datasets include forward or backward filling, interpolation, and row removal. The method should be chosen based on the amount and pattern of missing data.

Question 3: Why should I use Cesium for feature Extraction in time series forecasting?
Answer: Cesium is a well-known feature extraction library out of well-known potential libraries for extracting features from time-series data. It can also automatically generate all those statistical features like mean, standard deviation, and skewness, which indeed tend to improve the prediction accuracy of models such as Prophet.

Question 4: What are some of the significant attributes that can be used in forecasting telephone calls to healthcare facilities?
Answer: Important features cover historical call history; other time series features would be mean, standard deviation, skewness, trend, etc. These would exhibit patterns in calls' behavior at different points in time and thus elevate model performance.

Question 5: What is Prophet and how does it work for time series forecasting?
Answer: Prophet is a time series forecasting tool specifically constructed to find hidden seasonalities, paragraph holidays, and even missing parts in data. This inference from Prophet is entirely under Bayesian projection from trend and seasonal modeling. This makes it very conducive to forecasting call volume in health care and other business data.

Code Editor