Predictive Analytics on Business License Data Using Deep Learning
Let us introduce an interesting deep learning project on predictive analysis based on business license data. This project offers easy step-by-step guidance on the modeling process, which involves predicting if a business license application is to be approved, renewed, or revoked, using the features of TensorFlow and H2O.
This project is aimed at those who have little or no experience with machine learning and those who want to take their skills several notches higher. It is a practical project that embraces all aspects including data preprocessing and the development of the model using deep neural networks (DNNs).
Let's explore the topic more!
Project Overview:
The purpose of this project is to predict whether a business license will be active or not using deep learning and machine learning techniques. We’ll explore, clean, and prepare a dataset with more than 86,000 businesses for modeling. So, for one we’re going to build out a baseline model using H2O’s Random Forest and then a more complex Deep Neural Network (DNN) using TensorFlow.
At the end of all this, you will have built a system that can foresee what is most likely to happen with a business license application.
Key Features:
- Tools: TensorFlow, H2O, Python libraries (pandas, numpy, matplotlib, seaborn, scikit-learn.
- Outcome: Predict business license statuses such as Approved, Renewed, or Revoked.
- Use case: Predictive analytics for businesses, government regulations, or consultancy services.
Prerequisites
Before working, please ensure that you have the following:
- Google Colab or a local working Python environment
- Knowledge in libraries like TensorFlow, H2O, Pandas, seaborn, and NumPy, scikit-learn libraries
- Knowledge regarding machine learning concepts and deep learning techniques.
- Business License dataset
Approach
We follow a structured approach:
- Data Collection: Collect a dataset that contains data on 86,000 different businesses and their licensing details.
- Data Preparation: Clean and preprocess the available data for model training.
- Model Building: There are two models we built to establish a baseline. We implemented a random forest baseline model using H20 and deep learning neural networks using TensorFlow
- Evaluation: We run and evaluate the model using some essential parameters like accuracy.
Workflow and Methodology
The overall workflow of this project includes:
- Data Preparation: Load and clean the dataset of business licenses. Then handle missing values and normalize data features.
- Exploratory Data Analysis: Analyze data distribution and relationships between features to understand patterns.
- Baseline Model with H2O: Build a Random Forest baseline model using the H2O framework to predict license statuses.
- DNN Setup: Train a DNN model using TensorFlow including dropout regularization.
- Evaluation: Test the trained model with test data. Then calculate accuracy and loss metrics for performance evaluation.
- Prediction: Use the trained DNN model to make predictions on unseen data
The methodology involves:
- Supervised learning: We train the model with labeled data to predict license statuses.
- Feature engineering: Important features like license type, business type, and ZIP code are used for predictions.
- Model training: We use cross-entropy loss for the DNN and Gini Impurity for the random forest.
Data Collection
First, we load a business license dataset with detailed information about businesses, including license number, license description, license status, application type, and so on from Kaggle. It helps us predict if a business license will be issued, renewed, or revoked.
Data Preparation
After collecting the dataset, we will prepare the data and clean it before modeling. It involves handling missing values, how to encode categorical variables, and how to split the dataset into training and testing sets.
Data Preparation Workflow:
- Handle missing values: To ensure the model's reliability, we either fill in or remove the data points that are missing.
- Categorical encoding: Convert categorical features into numerical values using one-hot encoding.
- Train-test split: Split the dataset on train test split (i.e. 80% for training and 20% for test) so that the model works well on new data.
Code Explanation
To easily understand, let’s dive deep into the code step by step:
STEP 1:
Mounting Google Drive This code mounts Google Drive. It allows us to access datasets that are stored there.
from google.colab import drive
drive.mount('/content/drive')
Install Required Packages
This code installs these three packages which are tensorflow, numpy, and h20. The setup allows the use of classical machine learning through H2O as well as deep learning frameworks such as TensorFlow. NumPy takes care of all the computation of data.
!pip install tensorflow
!pip install numpy
!pip install h2o
Importing Required Libraries
We import all the essential Python libraries like pandas for data manipulation and seaborn for data visualization tasks. H2O is popular when it comes to heavy dealing with big data sets and analysis. Numpy offers support regarding arrays, matrices, and various mathematical capabilities. And the H2O engine. Which allows you to interact with the library for building and training machine learning models.
import h2o
from h2o.estimators import H2OGradientBoostingEstimator, H2ORandomForestEstimator
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
h2o.init()
This code ensures that panda displays up to 500 columns and rows of dataframe.
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
Loading the Dataset
This line loads the business license dataset and prints the dataset’s shape, which shows its total number of rows and columns.
pdf = pd.read_csv("/content/drive/MyDrive/Aionlinecourse/Data.csv")
print(pdf.shape)
STEP 2:
Exploratory data analysis
This code shows all column names of the dataframe.
pdf.columns
This code shows the count of unique value within the “LICENSE STATUS” column
pdf["LICENSE STATUS"].value_counts()
It returns true if ‘LICENSE STATUS' is present in dataframe columns. And return false if not present.
'LICENSE STATUS' in pdf.columns
This line discards other rows and retains only those rows where the "LICENSE STATUS" is AAI, AAC, or REV.
pdf = pdf[pdf['LICENSE STATUS'].isin(['AAI', 'AAC', 'REV'])]
It returns the total count of missing values for each column. This makes it easy to see where data might be incomplete.
pdf.isna().sum()
This code shows the concise summary of the DataFrame.
pdf.info()
This code shows the number of unique values for each column in the DataFrame pdf.
pdf.nunique()
The command pdf.head() displays the first five rows of the DataFrame pdf. This is useful for quickly inspecting the initial entries, verifying the structure, and reviewing the data types of each column, which aids in understanding the dataset before further analysis or manipulation.
pdf.head()
This code shows the count-plot of the count of the categorical variable “LICENSE STATUS” in the data frame pdf. The count plot visually represents the distribution of different license statuses. This makes it easier to spot trends or imbalances in the data.
sns.countplot(data=pdf, x='LICENSE STATUS')
plt.show()
This provides a summary of unique values present in the "CONDITIONAL APPROVAL" column.
pdf["CONDITIONAL APPROVAL"].value_counts()
This line of code renames a column in the DataFrame pdf from 'DOING BUSINESS AS NAME' to 'DOING BUSINESS AS NAME' itself, which effectively makes no change. The inplace=True parameter is used to apply the change directly to pdf without creating a new DataFrame. This command might be redundant unless it’s part of a larger operation where multiple columns are being renamed.
pdf.rename(columns={'DOING BUSINESS AS NAME': 'DOING BUSINESS AS NAME'}, inplace=True)
The code adds a new column called LEGAL BUSINESS NAME MATCH to the DataFrame. It checks whether the LEGAL NAME in LEGAL NAME is the same as the DOING BUSINESS AS NAME, regardless of the case.
pdf['LEGAL BUSINESS NAME MATCH'] = pdf.apply(lambda x: 1 if str(x['LEGAL NAME'].upper()) in str(x['DOING BUSINESS AS NAME']) .upper()
or str(x['DOING BUSINESS AS NAME']).upper() in str(x['LEGAL NAME']).upper() else 0,
axis=1)
This code shows the count of unique values within the “LICENSE DESCRIPTION” column
pdf['LICENSE DESCRIPTION'].value_counts()
The below code replaces a series of strings in the DataFrame pdf. The objective of this process is to bring standardization and simplify descriptions of various business licenses so as to ensure that the data is cleaner and easier to analyze.
df['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Motor Vehicle Repair : Engine Only (Class II)', 'Motor Vehicle Repair')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Motor Vehicle Repair: Engine/Body(Class III)', 'Motor Vehicle Repair')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Motor Vehicle Repair; Specialty(Class I)', 'Motor Vehicle Repair')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Day Care Center Under 2 Years', 'Day Care Center')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Day Care Center 2 - 6 Years', 'Day Care Center')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Day Care Center Under 2 and 2 - 6 Years', 'Day Care Center')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Peddler, non-food', 'Peddler')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Peddler, non-food, special', 'Peddler')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Peddler, food (fruits and vegtables only)', 'Peddler')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Peddler,food - (fruits and vegetables only) - special', 'Peddler')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Tire Facilty Class I (100 - 1,000 Tires)', 'Tire Facilty')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Tire Facility Class II (1,001 - 5,000 Tires)', 'Tire Facilty')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Tire Facility Class III (5,001 - More Tires)', 'Tire Facilty')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Repossessor Class A', 'Repossessor')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Repossessor Class B', 'Repossessor')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Repossessor Class B Employee', 'Repossessor')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Expediter - Class B', 'Expediter')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Expediter - Class A', 'Expediter')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Expediter - Class B Employee', 'Expediter')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Itinerant Merchant, Class II', 'Itinerant Merchant')
pdf['LICENSE DESCRIPTION'] = pdf['LICENSE DESCRIPTION'].replace('Itinerant Merchant, Class I', 'Itinerant Merchant')
The command pdf['LICENSE DESCRIPTION'].nunique() calculates the number of unique values in the LICENSE DESCRIPTION column of the DataFrame pdf. This is useful for understanding the variety of license types in the dataset and can help in identifying distinct categories or classes within this column.
pdf['LICENSE DESCRIPTION'].nunique()
The code starts by cleaning the data and removing the periods from business names. This modifies the BUSINESS TYPE by referring to certain words appearing in the LEGAL NAME and DOING BUSINESS AS NAME columns. It classifies the types as ‘INC’, ‘LLC’, ‘CORP’, and ‘LTD’.
pdf['LEGAL NAME'] = pdf['LEGAL NAME'].str.replace('.', '', regex=False)
pdf['DOING BUSINESS AS NAME'] = pdf['DOING BUSINESS AS NAME'].str.replace('.', '', regex=False)
pdf['BUSINESS TYPE'] = 'PVT'
pdf['BUSINESS TYPE'] = np.where(pdf['LEGAL NAME'].str.contains('INC'), 'INC', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['LEGAL NAME'].str.contains('INCORPORATED'), 'INC', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['DOING BUSINESS AS NAME'].str.contains('INC'), 'INC', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['DOING BUSINESS AS NAME'].str.contains('INCORPORATED'), 'INC', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['LEGAL NAME'].str.contains('LLC'), 'LLC', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['DOING BUSINESS AS NAME'].str.contains('LLC'), 'LLC', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['LEGAL NAME'].str.contains('CO'), 'CORP', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['LEGAL NAME'].str.contains('CORP'), 'CORP', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['LEGAL NAME'].str.contains('CORPORATION'), 'CORP', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['DOING BUSINESS AS NAME'].str.contains('CO'), 'CORP', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['DOING BUSINESS AS NAME'].str.contains('CORP'), 'CORP', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['DOING BUSINESS AS NAME'].str.contains('CORPORATION'), 'CORP', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['LEGAL NAME'].str.contains('LTD'), 'LTD', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['LEGAL NAME'].str.contains('LIMITED'), 'LTD', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['DOING BUSINESS AS NAME'].str.contains('LTD'), 'LTD', pdf['BUSINESS TYPE'])
pdf['BUSINESS TYPE'] = np.where(pdf['DOING BUSINESS AS NAME'].str.contains('LIMITED'), 'LTD', pdf['BUSINESS TYPE'])
This code shows the count of unique values within the “BUSINESS TYPE” column
pdf['BUSINESS TYPE'].value_counts()
This code shows the count-plot of the count of the categorical variable “BUSINESS TYPE” in the data frame pdf. The count plot visually represents the distribution of different business types. It makes it easier to spot trends or imbalances in the data.
sns.countplot(data = pdf, x ='BUSINESS TYPE')
plt.show()
This code shows the count of unique value within the “ZIP CODE” column
pdf['ZIP CODE'].value_counts()
STEP 3:
Data PreProcess
The first line fills in any missing ZIP codes with -1. The second line creates one more column named ZIP CODE MISSING with the value 1 for cases where the ZIP code is not provided.
pdf['ZIP CODE'].fillna(-1, inplace=True)
pdf['ZIP CODE MISSING'] = pdf.apply(lambda x: 1 if x['ZIP CODE'] == -1 else 0, axis=1)
The code is used to create a histogram for the SSA column in the DataFrame pdf
pdf[['SSA']].plot.hist(bins=12, alpha=0.8)
This code replaces all the null values in SSA columns with -1
pdf['SSA'].fillna(-1, inplace=True)
This code takes actions on the APPLICATION REQUIREMENTS COMPLETE column of the DataFrame to deal with a null value and formulate a binary indicator.
pdf['APPLICATION REQUIREMENTS COMPLETE'].fillna(-1, inplace=True)
pdf['APPLICATION REQUIREMENTS COMPLETE'] = pdf.apply(lambda x: 0 if x['APPLICATION REQUIREMENTS COMPLETE'] == -1
else 1, axis=1)
Train-Test Split
This code divides the DataFrame "pdf" into training and testing sets. The test set has a size of 20% and a random state that can be used again and again.
train, test = train_test_split(pdf, test_size=0.2, random_state=123)
The pandas DataFrames "train" and "test" are turned into H2OFrame objects by this code. The H2O machine learning platform uses a data structure called H2OFrame to make it easier to work with and handle large datasets.
train = h2o.H2OFrame(train)
test = h2o.H2OFrame(test)
STEP 4:
Model Building
This code initializes a Random Forest model and trains it on the given features to predict the LICENSE STATUS variable based on the independent variables. The model will generate 100 trees and limit each of those trees to a maximum depth of 10 to control the performance of the model without over-fitting the model.
h2o_rf = H2ORandomForestEstimator(ntrees=100, seed=123, max_depth=10)
h2o_rf.train(x=['APPLICATION TYPE', 'CONDITIONAL APPROVAL', 'LICENSE CODE', 'SSA', 'LEGAL BUSINESS NAME MATCH',
'ZIP CODE MISSING', 'SSA', 'APPLICATION REQUIREMENTS COMPLETE', 'LICENSE DESCRIPTION', 'BUSINESS TYPE'],
y='LICENSE STATUS', training_frame=train)
This line of code uses the driven random forest model (‘h2o_rf’) to perform prediction on the test dataset. Then it populates estimates in the “LICENSE STATUS” column for actual values present in test data. Finally, the results are converted to a pandas data frame for further analysis.
predictions = h2o_rf.predict(test)
predictions['actual'] = test['LICENSE STATUS']
predictions = predictions.as_data_frame()
The command predictions.head() displays the first five rows of the DataFrame predictions. This allows for a quick inspection of the initial predictions, helping to verify data structure, review column contents, and ensure that the prediction process is working as expected before further analysis or evaluation.
predictions.head()
It computes the accuracy of the model's predictions as a percentage. It shows how many of the predictions made by the model were correct compared to the total number of predictions.
accuracy = (predictions[predictions.actual == predictions.predict].shape[0])* 100.0 / predictions.shape[0]
accuracy
Building the DNN Model
This code initializes two variables. One is predictors and the other is target. This list defines the features (independent variables) that will be used to train the machine learning model and the dependent variable.
predictors = ['APPLICATION TYPE', 'CONDITIONAL APPROVAL', 'LICENSE CODE', 'SSA', 'LEGAL BUSINESS NAME MATCH',
'ZIP CODE MISSING', 'SSA', 'APPLICATION REQUIREMENTS COMPLETE', 'LICENSE DESCRIPTION', 'BUSINESS TYPE']
target = ["LICENSE STATUS_AAC", "LICENSE STATUS_AAI", "LICENSE STATUS_REV"]
pdf[predictors].info()
This code snippet is very important because it helps prepare your dataset for machine learning by selecting desirable features and converting any categorical variables to a suitable type for modeling. And shows the final_df data frame's all column name.
final_df = pdf[predictors + ["LICENSE STATUS"]]
final_df = pd.get_dummies(final_df, columns=['APPLICATION TYPE', 'CONDITIONAL APPROVAL', 'LICENSE CODE', 'LICENSE DESCRIPTION', 'BUSINESS TYPE', 'LICENSE STATUS'])
final_df.columns
The final_df DataFrame is split into training and testing sets. This trains the model on a training dataset and then tests its performance on the other unseen data. Then it prepares the feature matrices (X_train, X_test) and target arrays (y_train, y_test) to the right format for machine learning. Finally, it outputs the shapes of these arrays so that you can use them to debug them and know that the data is properly formatted for model training.
train, test = train_test_split(final_df, test_size=0.2, random_state=123)
X_train = train.drop(target, axis=1).values
y_train = train[target].values
X_test = test.drop(target, axis=1).values
y_test = test[target].values
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
The command train.head() displays the first five rows of the DataFrame train. This is typically used to inspect the initial entries in the training dataset, allowing you to verify the structure, examine column values, and check for any data preprocessing needs before proceeding with model training.
train.head()
The following code helps in implementing the Keras DNN model. Start with a sequential model with an input layer that matches X_train's characteristics. Three dense layers make up the model. 128 ReLU-activated neurons are in the first. Tanh activates the other two 64-neuron systems. A 0.2 dropout layer is put immediately after each thick layer to prevent overfitting. Three softmax-activated neurons in the last layer enable multi-class categorization.
An Adam optimizer with a 0.01 learning rate was applied. Categorical cross-entropy was used as a loss function and accuracy as a performance metric this time. The script ends with a model summary to show the model's layers.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential(
[
layers.InputLayer(input_shape=(X_train.shape[1])),
layers.Dense(128, activation="relu"),
layers.Dropout(0.2),
layers.Dense(64, activation="tanh"),
layers.Dropout(0.2),
layers.Dense(64, activation="tanh"),
layers.Dropout(0.2),
layers.Dense(3, activation="softmax"),
]
)
optimizer = keras.optimizers.Adam(learning_rate=0.01)
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])
print(model.summary())
This code converts the training data, X_train and y_train, into NumPy arrays of 32-bit floating-point numbers for compatibility with TensorFlow. Both X_train and y_train are then explicitly cast to tf.float32 using TensorFlow’s tf.cast function, ensuring consistent data types for model training. Finally, the code initiates the training process by fitting the model on X_train and y_train over 100 epochs, allowing the model to learn from the training data through iterative adjustments.
X_train = np.array(X_train, dtype=np.float32)
y_train = np.array(y_train, dtype=np.float32)
X_train = tf.cast(X_train, tf.float32)
y_train = tf.cast(y_train, tf.float32)
model.fit(X_train, y_train, epochs=100)
This code prepares the test data for model evaluation. It first converts X_test and y_test into NumPy arrays with a 32-bit floating-point data type, ensuring consistency with TensorFlow operations. Both X_test and y_test are then cast to tf.float32 using TensorFlow's tf.cast function for compatibility with the model. Finally, model.evaluate(X_test, y_test) is called to assess the model’s performance on the test data, providing metrics such as loss and accuracy, which indicate how well the model generalizes to new data.
X_test = np.array(X_test, dtype=np.float32)
y_test = np.array(y_test, dtype=np.float32)
X_test = tf.cast(X_test, tf.float32)
y_test = tf.cast(y_test, tf.float32)
model.evaluate(X_test, y_test)
This code makes predictions on new data,
model.predict(X_test)
Project Conclusion
During this project, we embarked on an exploratory journey through the world of predictive analytics and deep learning to solve real business problems. After analyzing a dataset of over 86,000 businesses, we’ve built a system to predict whether a business license is going to be approved, renewed, or revoked. By applying cutting-edge technologies including TensorFlow for deep neural networks (DNN) and H2O for machine learning we used powerful algorithms and smart feature engineering in order to develop an intelligent, data-driven solution.
Through this project, we have demonstrated how predictive analytics can transform business organizations. This project shows whether you’re a data enthusiast, a tech professional, or a business decision-maker, if you can figure out deep learning, you can change how we do regulatory approvals and business processes.
Challenges and Troubleshooting
Data Imbalance: There were several license statuses greatly more common, which could bias the model’s performance.
Solution: Use SMOTE or oversampling to improve the balance of your dataset.Missing Values: Records had lots of missing data, specifically in ZIP codes.
Solution: Fill in missing values with a null or a placeholder (-1) and include the missing entries as a second feature.Model Overfitting: Our DNN model had a chance of overfitting the training data.
Solution: To prevent overfitting we use Dropout layers.
FAQ
Question 1: Which dataset is used in this predictive analytics project?
Answer: We collected 86,000 business license dataset from Kaggle.
Question 2: Why is the random forest and DNN model used?
Answer: DNN captures more complex patterns in the data, and random forest provides a strong baseline.
Question 3: How accurate is the Deep learning model in predicting business licenses?
Answer: While performance varies depending on further tuning, accuracy is approximately 78%.
Question 4: What tools are used to forecast the business license approval?
Answer: This project employs Python, TensorFlow, H2O, and sci-kit-learn to construct business license data predictive models and ensure accuracy.
Question 5: How does deep learning assist in predicting business license outcomes?
Answer: Deep Neural Networks (DNNs) analyze historical data to accurately predict the status of business licenses, improving overall efficiency.
It’s your time:
Now that you’ve learned how to build a predictive analytics system for business licenses, it’s time to take the next step! Whether you're interested in improving the model’s accuracy or applying these techniques to other datasets, this project has provided the perfect foundation.
If you want to explore more, you can check this Predictive analysis content.