- Computer vision final year project ideas and guidelines
- Machine Learning Project Environment Setup
- Machine Learning Project: Hotel Booking Prediction [Part 1]
- Build Your First Machine Learning Project in Python(Step by Step Tutorial)
- Build and Deploy a Restaurant Chatbot with Rasa and Python
- A Quick Guide to Deploy your Machine Learning Models using Django and Rest API
- Learn Time Series Analysis in Python- A Step by Step Guide using the ARIMA Model
- A Quick Guide to Build and Deploy Machine Leaning Models with IBM Watson and Django
- Artificial Intelligence in Self Driving Car and how it works
- Self-Driving car final year project ideas and guidelines
- Self-Driving car research topics and guidelines
- Self-driving car github repositories and projects
- Virtual assistant final year project ideas and guidelines
Machine Learning Project: Hotel Booking Prediction [Part 2]
This is the second part of our Hotel Booking Prediction project. Throughout this tutorial, we will discuss outliers, the application of supervised algorithms, cross-validation, and so on.
How to Handle Outliers
If there are some data points that are too far away from the normal one those are exactly called outliers. More specifically, if you have data of 100 persons whose age range is between 1 to 100 but if 100 persons have an age of 700 years, that 100 persons are considered outliers. Here, we are going to build a model that will be badly impacted by outliers.
sas.distplot(dataframe['lead_time']) #Making distribution plot of lead_time by accessing dataframe
import numpy as np def handle_outlier(col): #taking log of lead_time time for greater extent of skewness dataframe[col]=np.log1p(dataframe[col]) handle_outlier('lead_time') #calling the function sas.distplot(dataframe['lead_time']) #showing distribution plot log applied lead_time
We handled outlier for our price feature named ‘adr’, same as before
sas.distplot(dataframe['adr']) #distribution plot of adr
handle_outlier('adr') #handling outlier for adr sas.distplot(dataframe['adr'].dropna()) #distribution plot of adr and handling missing values by dropna
Applying Techniques of Feature Importance
Here, we are applying techniques of feature importance to our data for selecting the most important features because there are tons of features. By doing this we can build fancy/very useful machine learning models.
First, checked null values. We can see on the result that there is only one missing value in ‘adr’.
dataframe.isnull().sum() #checking null values and doing their sum
dataframe.dropna(inplace=True) #dropping null values and updating dataframe y=dataframe['is_canceled'] #predicting independent features -> is_canceled x=dataframe.drop('is_canceled',axis=1) #dropping is_canceled feature from sklearn.linear_model import lasso from sklearn.feature_selection import SelectFromModel #for selecting important features
Alpha is a penalty parameter that means the bigger the value of alpha the less number of features will get selected.
feature_sel_model=SelectFromModel(Lasso(alpha=.005,random_state=0)) #Specifying lasso regression model & setting low alpha value and putting random_state=0 feature_sel_model.fit(x,y) #fitting data to object feature_sel_model.get_support() #getting all the values from list cols=x.columns #all the columns selected_feat=cols[feature_sel_model.get_support()] #adding filters to column print('total features {}'.format(x.shape[1])) #printing total features print('Selected features {}').format(len(selected_feat)) #printing the selecting features
selected_feat #printing the entire features
Applying Logistic Regression on Data and Cross-Validating it
Logistic regression is one of the supervised algorithms and a statistical model. For this, we are going to apply logistic regression to our data and after that will cross-validate it.
from sklearn.model_selection import train_test_split #for splitting data into train and test set X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=0) #taking 25% of data for testing from sklearn.linear_model import LogisticRegression #importing logreg=LogisticRegression() #calling the logisticregression class logreg.fit(X_train,y_train) #fitting training data y_pred=logreg.predict(X_test) #doing prediction on test data y_pred #printing the prediction array
from sklearn.metrics import confusion_matrix #importing confusion matrix confusion_matrix(y_test,y_pred) #confusion matrix of this logistic regression model
from sklearn.metrics import accuracy_score #importing accuracy_score to check accuracy accuracy_score(y_test,y_pred) #checking accuracy_score of y test and prediction
from sklearn.model_selection import cross_val_score #importing cross validation score=cross_val_score(logreg,x,y,cv=10) #applying cross validation for achieving more accurate score score.mean() #achieved new score by calling mean
Applying Multiple Algorithms on Data
Here we are applying different types of supervised algorithms naive Bayes, decision tree, logistic regression, and so on to achieve very good accuracy.
from sklearn.linear_model import LogisticRegression #importing logisticregression from linear_model from sklearn.neighbors import KNeighborsClassifier #importing knn algorithm from sklearn.ensemble import RandomForestClassifier #importing random forest from ensemble from sklearn.tree import DecisionTreeClassifier #importing decision tree classifier models=[] #this blank list is for appending all algorithms models.append(('LogisticRegression',LogisticRegression())) #appending logisticregression and initializing it models.append(('Navie Bayes',GaussianNB())) #appending Naive Bayes and initializing it models.append(('RandomForest',RandomForestClassifier())) #appending random forest and initializing it models.append(('Decision Tree',DecisionTreeClassifier())) #appending decision tree and initializing it models.append(('KNN',KNeighborsClassifier())) #appending knn and initializing it for name,model in models: #iterating over models print(name) #printing name of model model.fit(X_train,y_train) #fitting train set of x & y into model predictions=model.predict(X_test) #predicting test set of x from sklearn.metrics import confusion_matrix #importing confusion matrix print(confusion_matrix(predictions,y_test)) #printing confusion matrix print('\n') print(accuracy_score(predictions,y_test)) #printing accuracy_score print('\n')
So, this is our model accuracy. Thank you for reading this article.