Image

Build A Book Recommender System With TF-IDF And Clustering(Python)

Have you ever thought about the reasons behind the segregation and recommendation of books with similarities? This project is aimed at book clustering and recommendation systems. This attempts to study book metadata and identify patterns using machine learning techniques such as TF-IDF and clustering. From creating visual tools to suggesting users with similar books, this project is extensive and inclusive.

Project Overview

This project focuses on the process of taking raw data from books and making it usable with logical figures. To begin with, the dataset is cleaned and preprocessed to make it suitable for analysis. Following this, we form a TF-IDF matrix to analyze how relevant certain words in describing a book are. After that, K-means and hierarchical clustering are used to organize the books into meaningful clusters.

There is more! We also design an engaging book recommendation system. It recommends readers similar books based on the measure of cosine similarity to help each one of them find their next favorite book. To bring out the results in an attractive way, the findings are presented in compelling tables and grids.

And even better? We do not just end here. Interactive visuals like treemaps and dendrograms make it easier to understand the structure of the dataset. Be it a book lover or a data enthusiast, this project integrates data science with machine learning effortlessly.

Prerequisites

Learners must develop some skills before undertaking this project. Here’s what you should ideally know:

  • Python version 3.7 or higher should installed on your system.
  • Understanding of basic knowledge of Python for data analysis and manipulation
  • Understanding of clustering methods such as KMeans and the dendrogram approach.
  • Familiarity with concepts such as TF-IDF, token and stop words, and their removal in posed queries.
  • Use Seaborn, Plotly, and WordCloud to obtain visual interpretation.
  • Possess a dataset that contains information such as title, genre, and rating of books in CSV format.

Approach

The first stage of the project involves obtaining the dataset of books that provides the book titles, descriptions, ratings, and associated data. The first process that takes place is referred to as data cleaning and preprocessing where completeness of the data is achieved that is inclusive of filling in the missing values and removing duplicates. Then, text features such as description and category are vectorized using TF-IDF resulting in a sparse matrix indicating the importance of the words in the various books.

Next, we place the books, which are similar to each other into KMeans and hierarchical clustering, which contain some coherent meaning. These clusters aid in establishing certain trends and classifying the books that share some characteristics. For purposes of recommendations, we use cosine similarity to search for books that are closely related to some particular title to provide better recommendations.

During the whole procedure, various interactive elements such as treemaps, dendrograms, and grid views are utilized for better understanding and enhanced user experience. This is even further enhanced by the availability of data and recommendations that have been clustered appealingly interfaces making the project interactive and gratifying besides being informative.

Workflow and Methodology

  • Loading Data: The book data with relevant information such as book titles, descriptions, categories, and ratings is imported.
  • Data Cleaning: This involves addressing missing data, eliminating repeated entries, and normalizing the text.
  • Feature Extraction: With the help of TF-IDF, textual data is converted into a numerical version that indicates the significance of words in the context.
  • Clustering Algorithms: Execute KMeans and hierarchical clustering to classify books according to their attributes.
  • Recommendation Generation: Cosine similarity is applied to propose further readings associated with a particular title of interest.
  • Data Visualization: Help in creating interactive visuals such as treemaps and grid arrangements for the book clusters.
  • Insights Presentation: Create visuals that show cluster summaries and the ranking of top books in an appealing manner.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

  • Import the Dataset consisting of book names, summaries, genres, and their ratings.
  • Address missing entries either by filling them in or omitting invalid entries.
  • Eliminate duplicate entries so that each data point is represented once.
  • Engage in text data preprocessing activities which comprise lowercasing, punctuation removal, and stop word extraction.
  • Synthesize text in persons instead of future tense and use TF - IDF classification to extract relevant features.
  • Remove terms due to their frequency and relevance for clustering and analysis.
  • Randomize the dataset to guarantee the appraisal of the data in all aspects.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Library Installation and Resource Acquisition

  • Install the libraries emoji and contractions to be able to work with emojis and text contractions respectively.
  • Obtain the stopwords resource from NLTK to get rid of common English stopwords in the content.
  • Make a statement stopwords in NLTK. Then set the stop words for content preparation.
!pip install emoji
!pip install contractions
# Download required nltk resources
import nltk
nltk.download('stopwords')
# Import the stopwords corpus
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

Importing Libraries for Text Processing, Visualization, and Clustering

This piece of code focuses on libraries that are necessary for performing text preprocessing, data analysis, clustering, and being able to visualize your findings. It incorporates libraries or packages for editing texts, shortening or lengthening the texts, cleaning texts, doing some clustering analysis of both KMeans and hierarchy, and also creating interactive visualizations via word clouds, seaborn, plotly, and matplotlib.

import re
import random
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud
from emoji import demojize  # Ensure the 'emoji' library is installed
from text_unidecode import unidecode
from contractions import fix as expand_contractions
from nltk.corpus import stopwords
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from sklearn.metrics import silhouette_score
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import squareform
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets
from tabulate import tabulate
from matplotlib.colors import LinearSegmentedColormap

STEP 2:

Dataset Loading

Loads a dataset from the specified file path and displays the first 5 rows for preview.

# Load the dataset
file_path = '/content/drive/MyDrive/New 90 Projects/Project_7/books.csv'
original_df = pd.read_csv(file_path)
show_df = original_df.head(5)
show_df

Column Elimination and Visualization Preparation

Eliminates irrelevant columns, applies color themes, and creates a word cloud visualization of the titles of books.

Code Editor