- Computer vision final year project ideas and guidelines
- Machine Learning Project Environment Setup
- Machine Learning Project: Hotel Booking Prediction [Part 1]
- Machine Learning Project: Hotel Booking Prediction [Part 2]
- Build Your First Machine Learning Project in Python(Step by Step Tutorial)
- Build and Deploy a Restaurant Chatbot with Rasa and Python
- A Quick Guide to Deploy your Machine Learning Models using Django and Rest API
- Learn Time Series Analysis in Python- A Step by Step Guide using the ARIMA Model
- A Quick Guide to Build and Deploy Machine Leaning Models with IBM Watson and Django
- Artificial Intelligence in Self Driving Car and how it works
- Self-Driving car final year project ideas and guidelines
- Self-Driving car research topics and guidelines
- Self-driving car github repositories and projects
- Virtual assistant final year project ideas and guidelines
Machine Learning Project: Airline Tickets Price Prediction
This project is all about predicting airline ticket prices. In the dataset, there are many columns like Departure time, Arrival time, Date of journey, Duration, and so on. By doing data preprocessing, data analysis, feature selection, and many other techniques we built our cool and fancy machine learning model. And at the end, we applied many ml algorithms to get the very good accuracy of our model.
Understand The Data
Dealing with missing values
- For data manipulation, numerical computation, and visualization we imported pandas, NumPy, seaborn, and matplotlib library
- Reading data and saving it into the train_data variable.
- Previewing data by calling head function.
import pandas as pd #importing pandas for data manupulation, data analysis import numpy as np #importing numpy for numerical computation import seaborn as sns #importing seaborn for visualization purpose import matplotlib.pyplot as plt #importing matplotlib for another visualization tasks train_data=pd.read_excel('Data_Train.xlsx') #reading data train_data.head() #calling head to get preview of data
The shape of the entire data frame.
train_data.shape
Getting summation of missing values by calling isna() & sum() function.
train_data.isna().sum()
Dropping missing values using dropna() function and updating data frame by setting inplace=True
train_data.dropna(inplace=True)
Here, doing a cross-check whether missing values get moved or not. In the output section, we can see there are no such missing values.
train_data.isna().sum()
Cleaning data for analysis and modeling purpose
Checking datatype of each & every variable of each and every column available in the data.
train_data.dtypes
This function converts variables to date-time format. It takes col as a parameter, pd.to_datetime() helps to convert col to DateTime and updates that col from train_data by assigning it into train_data[col].
def chanage_into_datetime(col): train_data[col]=pd.to_datetime(train_data[col])
All columns of train_data in the form of a list.
train_data.columns
Adding ‘Date_of_Journey’, ‘Dep_Time’, ‘Arrival_Time’ column into a list and looping through the list, after that passing each column into change_into_datetime function, to convert the object into DateTime.
for i in ['Date_of_Journey','Dep_Time','Arrival_Time']: change_into_datetime(i)
Checking train_data’s column data type, whether it is converted into DateTime or not.
train_data.dtypes
Here, we are splitting our ‘Date_of_Journey’ column from train_data into a date, month, and year. Because when we will pass this into our machine learning model, that model will not understand which one is the month/day/year. And assigning into a new column named journey_day and journey_month.
train_data['journey_day']=train_data['Date_of_Journey'].dt.day train_data['journey_month']=train_data['Date_of_Journey'].dt.month
Calling head function to get a rough idea of our data frame named ‘train_data’.
train_data.head()
Now dropping our ‘Date_of_Journey’ column as we fetched everything from that column. By setting inplace=True we are updating our data frame and axis=1 means, we are removing our column vertically.
train_data.drop('Date_of_Journey',axis=1,inplace=True)
You can see there is no such column named ‘Date_of_Journey’.
train_data.head()
How to extract Derived features from data
We created this function to extract hours and minutes. After that, we built a drop function to drop that column from where we fetched hours and minutes. These functions take two parameters as input, one is a data frame, another is a column, and save this extraction into a data frame by concatenating with the column. Finally, updating the data frame by inplace=True.
def extract_hour(df,col): df[col+'_hour']=df[col].dt.hour def extract_hour(df,col): df[col+'_minute']=df[col].dt.minute def drop_column(df,col): df.drop(col,axis=1,inplace=True)
Extracting hour and minute from Dep_Time column by calling extract_hour and extract_min function, secondly dropping Dep_Time from the table by calling drop_column function.
extract_hour(train_data,'Dep_Time') extract_min(train_data,'Dep_Time') drop_column(train_data,'Dep_Time')
You can see there is no Dep_Time column as we dropped it, and two columns are added named ‘Dep_Time_hour’ and ‘Dep_Time_min’.
train_data.head()
Now, extract hour and minute from the Arrival_Time column and drop this column by calling the same function.
extract_hour(train_data,'Arrival_Time') extract_min(train_data,'Arrival_Time') drop_column(train_data,'Arrival_Time')
The Arrival_Time column is dropped and two new columns are added, Arrival_Time_hour and Arrival_Time_minute.
train_data.head()
In the duration column, we can see, for some value minute is not given. So, we fixed it by assigning zero minutes after an hour/zero hour after a minute, if any blank is found.
duration = list(train_data['Duration']) for i in range(len(duration)): if len(duration[i].split(' '))==2: pass else: if 'h' in duration[i]: duration[i]=duration[i]+'0m' else: duration[i]='0h'+duration[i]
Saving new results after operation into the Duration column.
train_data['Duration']=duration
We can see, Duration column is updated.
train_data.head()
Perform Data Pre-processing
We split our Duration column into Duration_hours and Duration_minutes. Two columns were added and we dropped the Duration column. So, our machine learning model can understand the features.
def hour(x): return x.split(' ')[0][0:-1] def minute(x): return x.split(' ')[1][0:-1]
Here, I applied the hour and minute function into our Duration column and saved them into new columns named ‘Duration_hours’ and ‘Duration_mins’.
train_data['Duration_hours']=train_data['Duration'].apply(hour) train_data['Duration_mins']=train_data['Duration'].apply(min)
Calling head to see the changes
train_data.head()
Removed the ‘Duration’ column by calling the drop_column function as it has no use currently.
drop_column(train_data,'Duration')
Checking the entire data columns after the operation.
train_data.head()
The datatype of train_data set.
train_data.dtypes
Converted ‘Duration_hours’ and ‘Duration_mins’ from object type to (int) type. Finally, updating it.
train_data['Duration_hours']=train_data['Duration_hours'].astype(int) train_data['Duration_mins']=train_data['Duration_mins'].astype(int)
Cross-checked dtype.
train_data.dtypes
For extracting categorical data, numerical data, and continuous features. We looped through each and every column of this dataset and if a column's data type is an object, we considered that categorical column.
cat_col=[col for col in train_data.columns if train_data[col].dtype=='O'] cat_col
Fetching continuous features.
cont_col=[col for col in train_data.columns if train_data[col].dtype!='O'] cont_col
Handle Categorical Data & Feature Encoding
Categorical data is actually two types - one is nominal data and the second one is ordinal data. Nominal data are mainly those who have no hierarchy like the name of the country and ordinal data has some kind of hierarchy. So, for dealing with nominal we performed one-hot encoding and we applied a Feature Encoding class for dealing with ordinal data.
Passing out cat_col into train_data to get all categorical data. Saved into a new data frame.
categorical=train_data[cat_col] categorical
Now, access the Airline column and count each and every feature of this column.
categorical['Airline'].value_counts()
Sorted train_data in descending order by Price column.
plt.figure(figsize=(15,5)) sns.boxplot(x='Airline',y='Price',data=train_data.sort_values('Price',ascending=False))
Extracting total_stops is the same as before.
plt.figure(figsize=(15,5)) sns.boxplot(x='Total_Stops',y='Price',data=train_data.sort_values('Price',ascending=False))
Performed one-hot encoding to the Airline column as our ml model doesn’t understand the string values.
Airline=pd.get_dummies(categorical['Airline'],drop_first=True)
Overview of Airline column after dummify.
Airline.head()
Source column value count.
categorical['Source'].value_counts()
Extracting distribution of Source with respect of Price.
plt.figure(figsize=(15,5)) sns.boxplot(x='Source',y='Price',data=train_data.sort_values('Price',ascending=False))
Dummifying Source column.
Source=pd.get_dummies(categorical['Source'],drop_first=True) Source.head()
Value counts of Destination.
categorical['Destination'].value_counts()
Extracting distribution of Destination with respect to Price.
plt.figure(figsize=(15,5)) sns.boxplot(x='Destination',y='Price',data=train_data.sort_values('Price',ascending=False))
Dummifying Destination column.
Destination=pd.get_dummies(categorical['Destination'],drop_first=True) Destination.head()
How to Perform Label Encoding on the dataset
For doing label encoding we are accessing Route.
categorical['Route_2']=categorical['Route'].str.split('→').str[1] categorical['Route_3']=categorical['Route'].str.split('→').str[2] categorical['Route_4']=categorical['Route'].str.split('→').str[3] categorical['Route_5']=categorical['Route'].str.split('→').str[4]
Here you can see all 5 routes have been added.
categorical.head()
Dropping Route and checking null values in the categorical data frame.
drop_column(categorical,'Route') categorical.isnull().sum()
Iterating on null columns of Route_3, Route_4, and Route_5 and updating data frame.
for i in ['Route_3','Route_4','Route_5']: categorical[i].fillna('None',inplace=True)
An entire data frame.
categorical.isnull().sum()
Printing features of each and every column
for feature in categorical.columns: print('{} has total {} categories \n'.format(feature,len(categorical[feature].value_counts())))
As we will see we have lots of features in Route, one hot encoding will not be a better option that’s why applying label encoder into our Route column. For that importing that class.
from sklearn.preprocessing import LabelEncoder encoder=LabelEncoder() categorical.columns
for i in ['Route_1', 'Route_2', 'Route_3', 'Route_4','Route_5']: categorical[i]=encoder.fit_transform(categorical[i])
Overview of the dataset.
categorical.head()
Additional_Info contains almost 80% no_info, so we can drop this column
drop_column(categorical,'Additional_Info')
Checking value counts of Total_Stops column.
categorical['Total_Stops'].value_counts()
As this is the case of Ordinal Categorical type we perform LabelEncoder. Here Values are assigned with corresponding keys.
dict={'non-stop':0, '2 stops':2, '1 stop':1, '3 stops':3, '4 stops':4} categorical['Total_Stops']=categorical['Total_Stops'].map(dict) categorical.head()
Concatenating data frame categorical, Airline, Source, Destination
data_train=pd.concat([categorical,Airline,Source,Destination,train_data[cont_col]],axis=1) data_train.head()
Dropped Airline, Source, and Destination
drop_column(data_train,'Airline') drop_column(data_train,'Source') drop_column(data_train,'Destination') data_train.head()
Showing maximum columns.
pd.set_option('display.max_columns',35) data_train.head()
Outliers Detection in Data
This function takes data frames and columns as input. After that, make a distribution plot with respect to price.
def plot(df,col): fig,(ax1,ax2)=plt.subplots(2,1) sns.distplot(df[col],ax=ax1) sns.boxplot(df[col],ax=ax2) plt.figure(figsize=(30,20)) plot(data_train,'Price')
Here we dealt with outliers and took the median.
data_train['Price']=np.where(data_train['Price']>=40000,data_train['Price'].median(),data_train['Price']) plt.figure(figsize=(30,20)) plot(data_train,'Price')
We separated our dependent and independent features into x and y variables.
X=data_train.drop('Price',axis=1) y=data_train['Price']
Select best Features using Feature Selection Technique
Finding out the best feature which will contribute and have good relationships with the target variable. Why apply Feature Selection? To select important features to get rid of the curse of dimensionality ie..to get rid of duplicate features.
I wanted to find mutual information scores or matrices to get to know about the relationship between all features.
Feature Selection using Information Gain.
from sklearn.feature_selection import mutual_info_classif X.dtypes mutual_info_classif(X,y)
Renamed data frame and sorted on the basis of importance column.
imp=pd.DataFrame(mutual_info_classif(X,y),index=X.columns) imp.columns=['importance'] imp.sort_values(by='importance',ascending=False)
Apply Random Forest on Data & Automate your predictions
We are splitting our data into train and test forms so that we can train our desired model.
from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2) from sklearn import metrics ##dump your model using pickle so that we will re-use import pickle def predict(ml_model,dump): model=ml_model.fit(X_train,y_train) print('Training score : {}'.format(model.score(X_train,y_train))) y_prediction=model.predict(X_test) print('predictions are: \n {}'.format(y_prediction)) print('\n') r2_score=metrics.r2_score(y_test,y_prediction) print('r2 score: {}'.format(r2_score)) print('MAE:',metrics.mean_absolute_error(y_test,y_prediction)) print('MSE:',metrics.mean_squared_error(y_test,y_prediction)) print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,y_prediction))) sns.distplot(y_test-y_prediction)
Importing random forest class.
from sklearn.ensemble import RandomForestRegressor predict(RandomForestRegressor(),1)
Play with multiple Algorithms & dump your model
Here, we applied several types of supervised algorithms to get the best accuracy of our model.
from sklearn.linear_model import LinearRegression from sklearn.neighbors import KNeighborsRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.tree import DecisionTreeRegressor predict(DecisionTreeRegressor(),0) predict(LinearRegression(),0) predict(KNeighborsRegressor(),0)
How to Cross Validate your model
We have hyper-tuned our model. For this, we took into account the following steps
1. Choose the following method for hyperparameter tuning
a.RandomizedSearchCV --> Fast way to Hypertune model
b.GridSearchCV--> Slow way to hyper tune my model
2. Assign hyperparameters in form of a dictionary
3. Fit the model
4. Check best parameters and the best score
from sklearn.model_selection import RandomizedSearchCV # Number of trees in random forest n_estimators=[int(x) for x in np.linspace(start=100,stop=1200,num=6)] # Number of features to consider at every split max_features=['auto','sqrt'] # Maximum number of levels in tree max_depth=[int(x) for x in np.linspace(5,30,num=4)] # Minimum number of samples required to split a node min_samples_split=[5,10,15,100]
Create the random grid
random_grid={ 'n_estimators':n_estimators, 'max_features':max_features, 'max_depth':max_depth, 'min_samples_split':min_samples_split } random_grid
A random search of parameters, using 3 fold cross-validation
rf_random=RandomizedSearchCV(estimator=reg_rf,param_distributions=random_grid,cv=3,verbose=2,n_jobs=-1)
Fitting our X_train, y_train dataset
rf_random.fit(X_train,y_train) rf_random.best_params_
Predicting our X_test dataset.
prediction=rf_random.predict(X_test)
Visualizing data
sns.distplot(y_test-prediction)
Prediction score of our model.
metrics.r2_score(y_test,prediction) metrics.mean_absolute_error(y_test,prediction) metrics.mean_squared_error(y_test,prediction) np.sqrt(metrics.mean_squared_error(y_test,prediction))
So, this is our model accuracy. Thank you for reading this article.