Machine Learning Project: Airline Tickets Price Prediction

This project is all about predicting airline ticket prices. In the dataset, there are many columns like Departure time, Arrival time, Date of journey, Duration, and so on. By doing data preprocessing, data analysis, feature selection, and many other techniques we built our cool and fancy machine learning model. And at the end, we applied many ml algorithms to get the very good accuracy of our model.

Understand The Data

Dealing with missing values

For data manipulation, numerical computation, and visualization we imported pandas, NumPy, seaborn, and matplotlib library
Reading data and saving it into the train_data variable.
Previewing data by calling head function.

import pandas as pd                #importing pandas for data manupulation, data analysis 
import numpy as np                 #importing numpy for numerical computation
import seaborn as sns              #importing seaborn for visualization purpose
import matplotlib.pyplot as plt    #importing matplotlib for another visualization tasks
train_data=pd.read_excel('Data_Train.xlsx')   #reading data
train_data.head()                  #calling head to get preview of data

The shape of the entire data frame.

train_data.shape

Getting summation of missing values by calling isna() & sum() function.

train_data.isna().sum()

Dropping missing values using dropna() function and updating data frame by setting inplace=True

train_data.dropna(inplace=True)

Here, doing a cross-check whether missing values get moved or not. In the output section, we can see there are no such missing values.

train_data.isna().sum()

Cleaning data for analysis and modeling purpose

Checking datatype of each & every variable of each and every column available in the data.

train_data.dtypes

This function converts variables to date-time format. It takes col as a parameter, pd.to_datetime() helps to convert col to DateTime and updates that col from train_data by assigning it into train_data[col].

def chanage_into_datetime(col):
    train_data[col]=pd.to_datetime(train_data[col])

All columns of train_data in the form of a list.

train_data.columns

Adding ‘Date_of_Journey’, ‘Dep_Time’, ‘Arrival_Time’ column into a list and looping through the list, after that passing each column into change_into_datetime function, to convert the object into DateTime.

for i in ['Date_of_Journey','Dep_Time','Arrival_Time']:
    change_into_datetime(i)

Checking train_data’s column data type, whether it is converted into DateTime or not.

train_data.dtypes

Here, we are splitting our ‘Date_of_Journey’ column from train_data into a date, month, and year. Because when we will pass this into our machine learning model, that model will not understand which one is the month/day/year. And assigning into a new column named journey_day and journey_month.

train_data['journey_day']=train_data['Date_of_Journey'].dt.day
train_data['journey_month']=train_data['Date_of_Journey'].dt.month

Calling head function to get a rough idea of our data frame named ‘train_data’.

train_data.head()

Now dropping our ‘Date_of_Journey’ column as we fetched everything from that column. By setting inplace=True we are updating our data frame and axis=1 means, we are removing our column vertically.

train_data.drop('Date_of_Journey',axis=1,inplace=True)

You can see there is no such column named ‘Date_of_Journey’.

train_data.head()

How to extract Derived features from data

We created this function to extract hours and minutes. After that, we built a drop function to drop that column from where we fetched hours and minutes. These functions take two parameters as input, one is a data frame, another is a column, and save this extraction into a data frame by concatenating with the column. Finally, updating the data frame by inplace=True.

def extract_hour(df,col):
    df[col+'_hour']=df[col].dt.hour
def extract_hour(df,col):
    df[col+'_minute']=df[col].dt.minute
def drop_column(df,col):
    df.drop(col,axis=1,inplace=True)

Extracting hour and minute from Dep_Time column by calling extract_hour and extract_min function, secondly dropping Dep_Time from the table by calling drop_column function.

extract_hour(train_data,'Dep_Time')
extract_min(train_data,'Dep_Time')
drop_column(train_data,'Dep_Time')

You can see there is no Dep_Time column as we dropped it, and two columns are added named ‘Dep_Time_hour’ and ‘Dep_Time_min’.

train_data.head()

Now, extract hour and minute from the Arrival_Time column and drop this column by calling the same function.

extract_hour(train_data,'Arrival_Time')
extract_min(train_data,'Arrival_Time')
drop_column(train_data,'Arrival_Time')

The Arrival_Time column is dropped and two new columns are added, Arrival_Time_hour and Arrival_Time_minute.

train_data.head()

In the duration column, we can see, for some value minute is not given. So, we fixed it by assigning zero minutes after an hour/zero hour after a minute, if any blank is found.

duration = list(train_data['Duration'])
for i in range(len(duration)):
    if len(duration[i].split(' '))==2:
        pass
    else:
        if 'h' in duration[i]:
            duration[i]=duration[i]+'0m'
        else:
            duration[i]='0h'+duration[i]

Saving new results after operation into the Duration column.

train_data['Duration']=duration

We can see, Duration column is updated.

train_data.head()

Perform Data Pre-processing

We split our Duration column into Duration_hours and Duration_minutes. Two columns were added and we dropped the Duration column. So, our machine learning model can understand the features.

def hour(x):
    return x.split(' ')[0][0:-1]
def minute(x):
    return x.split(' ')[1][0:-1]

Here, I applied the hour and minute function into our Duration column and saved them into new columns named ‘Duration_hours’ and ‘Duration_mins’.

train_data['Duration_hours']=train_data['Duration'].apply(hour)
train_data['Duration_mins']=train_data['Duration'].apply(min)

Calling head to see the changes

train_data.head()

Removed the ‘Duration’ column by calling the drop_column function as it has no use currently.

drop_column(train_data,'Duration')

Checking the entire data columns after the operation.

train_data.head()

The datatype of train_data set.

train_data.dtypes

Converted ‘Duration_hours’ and ‘Duration_mins’ from object type to (int) type. Finally, updating it.

train_data['Duration_hours']=train_data['Duration_hours'].astype(int)
train_data['Duration_mins']=train_data['Duration_mins'].astype(int)

Cross-checked dtype.

train_data.dtypes

For extracting categorical data, numerical data, and continuous features. We looped through each and every column of this dataset and if a column's data type is an object, we considered that categorical column.

cat_col=[col for col in train_data.columns if train_data[col].dtype=='O']
cat_col

Fetching continuous features.

cont_col=[col for col in train_data.columns if train_data[col].dtype!='O']
cont_col

Handle Categorical Data & Feature Encoding

Categorical data is actually two types - one is nominal data and the second one is ordinal data. Nominal data are mainly those who have no hierarchy like the name of the country and ordinal data has some kind of hierarchy. So, for dealing with nominal we performed one-hot encoding and we applied a Feature Encoding class for dealing with ordinal data.

Passing out cat_col into train_data to get all categorical data. Saved into a new data frame.

categorical=train_data[cat_col]
categorical

Now, access the Airline column and count each and every feature of this column.

categorical['Airline'].value_counts()

Sorted train_data in descending order by Price column.

plt.figure(figsize=(15,5))
sns.boxplot(x='Airline',y='Price',data=train_data.sort_values('Price',ascending=False))

Extracting total_stops is the same as before.

plt.figure(figsize=(15,5))
sns.boxplot(x='Total_Stops',y='Price',data=train_data.sort_values('Price',ascending=False))

Performed one-hot encoding to the Airline column as our ml model doesn’t understand the string values.

Airline=pd.get_dummies(categorical['Airline'],drop_first=True)

Overview of Airline column after dummify.

Airline.head()

Source column value count.

categorical['Source'].value_counts()

Extracting distribution of Source with respect of Price.

plt.figure(figsize=(15,5))
sns.boxplot(x='Source',y='Price',data=train_data.sort_values('Price',ascending=False))

Dummifying Source column.

Source=pd.get_dummies(categorical['Source'],drop_first=True)
Source.head()

Value counts of Destination.

categorical['Destination'].value_counts()

Extracting distribution of Destination with respect to Price.

plt.figure(figsize=(15,5))
sns.boxplot(x='Destination',y='Price',data=train_data.sort_values('Price',ascending=False))

Dummifying Destination column.

Destination=pd.get_dummies(categorical['Destination'],drop_first=True)
Destination.head()

How to Perform Label Encoding on the dataset

For doing label encoding we are accessing Route.

categorical['Route_2']=categorical['Route'].str.split('→').str[1]
categorical['Route_3']=categorical['Route'].str.split('→').str[2]
categorical['Route_4']=categorical['Route'].str.split('→').str[3]
categorical['Route_5']=categorical['Route'].str.split('→').str[4]

Here you can see all 5 routes have been added.

categorical.head()

Dropping Route and checking null values in the categorical data frame.

drop_column(categorical,'Route')
categorical.isnull().sum()

Iterating on null columns of Route_3, Route_4, and Route_5 and updating data frame.

for i in ['Route_3','Route_4','Route_5']:
    categorical[i].fillna('None',inplace=True)

An entire data frame.

categorical.isnull().sum()

Printing features of each and every column

for feature in categorical.columns:
    print('{} has total {} categories \n'.format(feature,len(categorical[feature].value_counts())))

As we will see we have lots of features in Route, one hot encoding will not be a better option that’s why applying label encoder into our Route column. For that importing that class.

from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
categorical.columns

for i in ['Route_1', 'Route_2', 'Route_3', 'Route_4','Route_5']:
    categorical[i]=encoder.fit_transform(categorical[i])

Overview of the dataset.

categorical.head()

Additional_Info contains almost 80% no_info, so we can drop this column

drop_column(categorical,'Additional_Info')

Checking value counts of Total_Stops column.

categorical['Total_Stops'].value_counts()

As this is the case of Ordinal Categorical type we perform LabelEncoder. Here Values are assigned with corresponding keys.

dict={'non-stop':0, '2 stops':2, '1 stop':1, '3 stops':3, '4 stops':4}
categorical['Total_Stops']=categorical['Total_Stops'].map(dict)
categorical.head()

Concatenating data frame categorical, Airline, Source, Destination

data_train=pd.concat([categorical,Airline,Source,Destination,train_data[cont_col]],axis=1)
data_train.head()

Dropped Airline, Source, and Destination

drop_column(data_train,'Airline')
drop_column(data_train,'Source')
drop_column(data_train,'Destination')
data_train.head()

Showing maximum columns.

pd.set_option('display.max_columns',35)
data_train.head()

Outliers Detection in Data

This function takes data frames and columns as input. After that, make a distribution plot with respect to price.

def plot(df,col):
    fig,(ax1,ax2)=plt.subplots(2,1)
    sns.distplot(df[col],ax=ax1)
    sns.boxplot(df[col],ax=ax2)
plt.figure(figsize=(30,20))
plot(data_train,'Price')

Here we dealt with outliers and took the median.

data_train['Price']=np.where(data_train['Price']>=40000,data_train['Price'].median(),data_train['Price'])
plt.figure(figsize=(30,20))
plot(data_train,'Price')

We separated our dependent and independent features into x and y variables.

X=data_train.drop('Price',axis=1)
y=data_train['Price']

Select best Features using Feature Selection Technique

Finding out the best feature which will contribute and have good relationships with the target variable. Why apply Feature Selection? To select important features to get rid of the curse of dimensionality ie..to get rid of duplicate features.

I wanted to find mutual information scores or matrices to get to know about the relationship between all features.

Feature Selection using Information Gain.

from sklearn.feature_selection import mutual_info_classif
X.dtypes
mutual_info_classif(X,y)

Renamed data frame and sorted on the basis of importance column.

imp=pd.DataFrame(mutual_info_classif(X,y),index=X.columns)
imp.columns=['importance']
imp.sort_values(by='importance',ascending=False)

Apply Random Forest on Data & Automate your predictions

We are splitting our data into train and test forms so that we can train our desired model.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
from sklearn import metrics
##dump your model using pickle so that we will re-use
import pickle
def predict(ml_model,dump):
    model=ml_model.fit(X_train,y_train)
    print('Training score : {}'.format(model.score(X_train,y_train)))
    y_prediction=model.predict(X_test)
    print('predictions are: \n {}'.format(y_prediction))
    print('\n')
    r2_score=metrics.r2_score(y_test,y_prediction)
    print('r2 score: {}'.format(r2_score))
    print('MAE:',metrics.mean_absolute_error(y_test,y_prediction))
    print('MSE:',metrics.mean_squared_error(y_test,y_prediction))
    print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,y_prediction)))
    sns.distplot(y_test-y_prediction)

Importing random forest class.

from sklearn.ensemble import RandomForestRegressor
predict(RandomForestRegressor(),1)

Play with multiple Algorithms & dump your model

Here, we applied several types of supervised algorithms to get the best accuracy of our model.

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
predict(DecisionTreeRegressor(),0)
predict(LinearRegression(),0)
predict(KNeighborsRegressor(),0)

How to Cross Validate your model

We have hyper-tuned our model. For this, we took into account the following steps

1. Choose the following method for hyperparameter tuning

a.RandomizedSearchCV --> Fast way to Hypertune model

b.GridSearchCV--> Slow way to hyper tune my model

2. Assign hyperparameters in form of a dictionary

3. Fit the model

4. Check best parameters and the best score

from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators=[int(x) for x in np.linspace(start=100,stop=1200,num=6)]

# Number of features to consider at every split
max_features=['auto','sqrt']

# Maximum number of levels in tree
max_depth=[int(x) for x in np.linspace(5,30,num=4)]

# Minimum number of samples required to split a node
min_samples_split=[5,10,15,100]

Create the random grid

random_grid={
    'n_estimators':n_estimators,
    'max_features':max_features,
'max_depth':max_depth,
    'min_samples_split':min_samples_split
}
random_grid

A random search of parameters, using 3 fold cross-validation

rf_random=RandomizedSearchCV(estimator=reg_rf,param_distributions=random_grid,cv=3,verbose=2,n_jobs=-1)

Fitting our X_train, y_train dataset

rf_random.fit(X_train,y_train)
rf_random.best_params_

Predicting our X_test dataset.

prediction=rf_random.predict(X_test)

Visualizing data

sns.distplot(y_test-prediction)

Prediction score of our model.

metrics.r2_score(y_test,prediction)
metrics.mean_absolute_error(y_test,prediction)
metrics.mean_squared_error(y_test,prediction)
np.sqrt(metrics.mean_squared_error(y_test,prediction))

So, this is our model accuracy. Thank you for reading this article.