Build A Book Recommender System With TF-IDF And Clustering(Python)

Project Overview

This project focuses on the process of taking raw data from books and making it usable with logical figures. To begin with, the dataset is cleaned and preprocessed to make it suitable for analysis. Following this, we form a TF-IDF matrix to analyze how relevant certain words in describing a book are. After that, K-means and hierarchical clustering are used to organize the books into meaningful clusters.

There is more! We also design an engaging book recommendation system. It recommends readers similar books based on the measure of cosine similarity to help each one of them find their next favorite book. To bring out the results in an attractive way, the findings are presented in compelling tables and grids.

And even better? We do not just end here. Interactive visuals like treemaps and dendrograms make it easier to understand the structure of the dataset. Be it a book lover or a data enthusiast, this project integrates data science with machine learning effortlessly.

Prerequisites

Learners must develop some skills before undertaking this project. Here’s what you should ideally know:

Python version 3.7 or higher should installed on your system.
Understanding of basic knowledge of Python for data analysis and manipulation
Understanding of clustering methods such as KMeans and the dendrogram approach.
Familiarity with concepts such as TF-IDF, token and stop words, and their removal in posed queries.
Use Seaborn, Plotly, and WordCloud to obtain visual interpretation.
Possess a dataset that contains information such as title, genre, and rating of books in CSV format.

Approach

The first stage of the project involves obtaining the dataset of books that provides the book titles, descriptions, ratings, and associated data. The first process that takes place is referred to as data cleaning and preprocessing where completeness of the data is achieved that is inclusive of filling in the missing values and removing duplicates. Then, text features such as description and category are vectorized using TF-IDF resulting in a sparse matrix indicating the importance of the words in the various books.

Next, we place the books, which are similar to each other into KMeans and hierarchical clustering, which contain some coherent meaning. These clusters aid in establishing certain trends and classifying the books that share some characteristics. For purposes of recommendations, we use cosine similarity to search for books that are closely related to some particular title to provide better recommendations.

During the whole procedure, various interactive elements such as treemaps, dendrograms, and grid views are utilized for better understanding and enhanced user experience. This is even further enhanced by the availability of data and recommendations that have been clustered appealingly interfaces making the project interactive and gratifying besides being informative.

Workflow and Methodology

Loading Data: The book data with relevant information such as book titles, descriptions, categories, and ratings is imported.
Data Cleaning: This involves addressing missing data, eliminating repeated entries, and normalizing the text.
Feature Extraction: With the help of TF-IDF, textual data is converted into a numerical version that indicates the significance of words in the context.
Clustering Algorithms: Execute KMeans and hierarchical clustering to classify books according to their attributes.
Recommendation Generation: Cosine similarity is applied to propose further readings associated with a particular title of interest.
Data Visualization: Help in creating interactive visuals such as treemaps and grid arrangements for the book clusters.
Insights Presentation: Create visuals that show cluster summaries and the ranking of top books in an appealing manner.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Import the Dataset consisting of book names, summaries, genres, and their ratings.
Address missing entries either by filling them in or omitting invalid entries.
Eliminate duplicate entries so that each data point is represented once.
Engage in text data preprocessing activities which comprise lowercasing, punctuation removal, and stop word extraction.
Synthesize text in persons instead of future tense and use TF - IDF classification to extract relevant features.
Remove terms due to their frequency and relevance for clustering and analysis.
Randomize the dataset to guarantee the appraisal of the data in all aspects.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Library Installation and Resource Acquisition

Install the libraries emoji and contractions to be able to work with emojis and text contractions respectively.
Obtain the stopwords resource from NLTK to get rid of common English stopwords in the content.
Make a statement stopwords in NLTK. Then set the stop words for content preparation.

!pip install emoji
!pip install contractions
# Download required nltk resources
import nltk
nltk.download('stopwords')
# Import the stopwords corpus
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

Importing Libraries for Text Processing, Visualization, and Clustering

This piece of code focuses on libraries that are necessary for performing text preprocessing, data analysis, clustering, and being able to visualize your findings. It incorporates libraries or packages for editing texts, shortening or lengthening the texts, cleaning texts, doing some clustering analysis of both KMeans and hierarchy, and also creating interactive visualizations via word clouds, seaborn, plotly, and matplotlib.

import re
import random
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud
from emoji import demojize  # Ensure the 'emoji' library is installed
from text_unidecode import unidecode
from contractions import fix as expand_contractions
from nltk.corpus import stopwords
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from sklearn.metrics import silhouette_score
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import squareform
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets
from tabulate import tabulate
from matplotlib.colors import LinearSegmentedColormap

STEP 2:

Dataset Loading

Loads a dataset from the specified file path and displays the first 5 rows for preview.

# Load the dataset
file_path = '/content/drive/MyDrive/New 90 Projects/Project_7/books.csv'
original_df = pd.read_csv(file_path)
show_df = original_df.head(5)
show_df

Column Elimination and Visualization Preparation

Eliminates irrelevant columns, applies color themes, and creates a word cloud visualization of the titles of books.

# Remove columns 'isbn13' and 'isbn10'
books_df = original_df.drop(columns=['isbn13', 'isbn10'])
# Set color variables
primaryColor = '#3cde65'
minColor = "#ffffff"
maxColor = "#2eb04f"
red_minColor = "#ffd6d6"
red_maxColor = "#ff0000"
# Example of setting up a custom colormap
green_cmap = LinearSegmentedColormap.from_list("GreenGradient", [minColor, maxColor])
red_cmap = LinearSegmentedColormap.from_list("RedGradient", [red_minColor, red_maxColor])
# Simple WordCloud
text_data = ' '.join(books_df['title'].astype(str))
wordcloud = WordCloud(width=800, height=400, background_color='white', colormap=green_cmap).generate(text_data)
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud of Book Titles", color=primaryColor)
plt.show()

Dataset Overview

It provides the schema of the data present in the dataset and gives the details about column names, their data types and the number of non-null entries.

books_df.info()

Row Count Display

Prints the total number of rows in the dataset.

# Display the total number of rows
print("Total rows ->", books_df.shape[0])

Null Values Count

Shows the total number of null values in the dataset.

# Display the total number of null values
print("Total null values ->", books_df.isnull().sum().sum())

Identify Columns with Null Values

Lists all columns in the dataset that contain null values.

# Get columns with null values
list_na = books_df.columns[books_df.isnull().any()].tolist()
print("Columns with null values ->", list_na)

Count of each column's null value

Represents the count of null entries present in each of the dataset's columns.

# Display null count for each column
print("Null count for each column ->")
print(books_df.isnull().sum())

Null Value Calculation by Column

Calculates and displays the null value count for each column using a lambda function.

# Calculate the null count for each column
nulls = books_df.apply(lambda x: x.isnull().sum())
nulls

Handle Missing Values with Mean

Fills null values in numeric columns with their column means.

# Print message
print("Using fillna to fill NA values with mean...")
# Select only numeric columns
numeric_books_df = books_df.select_dtypes(include='number')
# Fill NA values with the mean for numeric columns only
books_df[numeric_books_df.columns] = numeric_books_df.fillna(numeric_books_df.mean())

Verify Missing Values After Filling

Fills missing numeric values with column means and prints the total remaining missing values.

# Fill NA values with the mean for numeric columns only
books_df[numeric_books_df.columns] = numeric_books_df.fillna(numeric_books_df.mean())
# Print the total count of missing values after filling with mean
print("Missing values after filling with mean ->", books_df.isnull().sum().sum())

Suppress Warnings

Ignores FutureWarning messages to keep the output clean and focused.

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

STEP 3:

Data Cleaning and Elimination of Duplicates

Substitutes empty values with pre-defined variables, eliminates rows containing erroneous data, eliminates redundancy, and keeps only distinct book titles and descriptions.

# Define replacements for empty strings
no_image_available = "https://d3525k1ryd2155.cloudfront.net/h/848/258/116258848.0.m.jpg"
books_df["thumbnail"].replace("", no_image_available, inplace=True)
books_df["subtitle"].replace("", "unknown", inplace=True)
books_df["authors"].replace("", "unknown", inplace=True)
# Filter out rows with empty strings in specified columns and short descriptions
books_df = books_df[
(books_df["title"] \!= "") &
(books_df["categories"] \!= "") &
(books_df["authors"] \!= "") &
(books_df["description"] \!= "") &
(books_df["description"].str.len() \> 20\)
].dropna()
# Remove duplicates based on 'title' and 'description' columns
books_df = books_df.drop_duplicates(subset=["title"], keep="last")
books_df = books_df.drop_duplicates(subset=["description"], keep="last")
# Print a message
print("Removing duplicates")

Prints a message indicating empty strings were replaced or rows were removed.

# Print message
print("Replaced empty string with something or removed the row")

Prints the total number of rows remaining in the dataset after data cleaning.

# Display the total number of rows after cleaning
print("Total rows after cleaning ->", books_df.shape[0])

Generate Summary Statistics

Creates a function to calculate summary statistical measures (basic statistics mean, median, standard deviation, minimum, maximum value, variance, quartile, and interquartile range) of a given column.

def get_summary(name):
# Extract the column as a Series
X_var = books_df[name]
# Calculate summary statistics
summary_dict = {
'mean_': np.mean(X_var),
'median_': np.median(X_var),
'std': np.std(X_var),
'min_': np.min(X_var),
'max_': np.max(X_var),
'Q1_': np.percentile(X_var, 25),
'Q3_': np.percentile(X_var, 75),
'iqr_range': np.percentile(X_var, 75\) \- np.percentile(X_var, 25\)
}
# Create a DataFrame to store the summary statistics
summary_df = pd.DataFrame([summary_dict], index=[name])
summary_df = summary_df[['min_', 'Q1_', 'iqr_range', 'mean_', 'median_', 'Q3_', 'max_', 'std']]
return summary_df

Histogram with Statistical Markers

Generates a column chart with the help of a histogram, adds the mean, median, and quartiles, and includes a curve of normal distribution.

def draw_column_chart(name, title, bins):
# Remove missing values
df = books_df[name].dropna()
# Calculate statistics for the histogram markers
mean_ = np.mean(df)
median_ = np.median(df)
std_ = np.std(df)
Q1_ = np.percentile(df, 25\)
Q3_ = np.percentile(df, 75\)
iqr = Q3_ \- Q1_
# Plot histogram
plt.figure(figsize=(10, 6))
sns.histplot(df, bins=bins, kde=True, color=primaryColor, stat="density")
# Add vertical lines for statistics
plt.axvline(mean_, color='red', line, label='Mean')
plt.axvline(median_, color='blue', line, label='Median')
plt.axvline(Q1_, color='black', line, label='Q1')
plt.axvline(Q3_, color='black', line, label='Q3')
# Add normal distribution line
x_values = np.linspace(min(df), max(df), 100\)
plt.plot(x_values, (1 / (std_ \* np.sqrt(2 \* np.pi))) \* np.exp(- (x_values \- mean_)\*\*2 / (2 \* std_\*\*2)), color='black')
# Labeling
plt.title(title)
plt.xlabel(name)
plt.ylabel('Density')
plt.legend()
plt.show()

Show basic aggregated statistics on the Average Rating

Prints and formats a summary test for statistical measures of central tendency such as mean, median, and IQR, etc. of the average_rating field.

average_rating_iqr = get_summary("average_rating")
# Display the summary table in a formatted way
print(tabulate(average_rating_iqr, headers='keys', tablefmt='pretty'))

Plots Average Rating Distribution

Plot a histogram of average_rating with a choice of 40 bins and show its measure of central tendency and dispersion including the mean, median, and quartiles.

draw_column_chart('average_rating', 'Average Ratings', bins=40)

Show basic aggregated statistics on the Number of Pages

Prints and formats a summary test for statistical measures of central tendency such as mean, median, and IQR, etc. of the num_pages field.

num_pages_iqr = get_summary("num_pages")
# Print the summary statistics table in a readable format
print(tabulate(num_pages_iqr, headers='keys', tablefmt='pretty'))

Plots Number of Pages Distribution

Plot a histogram of number_pages with a choice of 60 bins and show its measure of central tendency and dispersion including the mean, median, and quartiles.

draw_column_chart('num_pages', 'Number of Pages', bins=60)

Classification of Data into Groups

The categories column is used to group the dataset and summary statistics (mean, median, std) are calculated for average_rating, num_pages, and ratings_count. The first five rows of grouped data are displayed.

def group_category_data(df, column_name='categories'):
"""
Groups data by specified column and provides summary statistics.
Args:
df: Pandas DataFrame containing the data.
column_name: Name of the column to group by (default: 'categories').
Returns:
A DataFrame with grouped data and summary statistics.
"""
grouped_data = df.groupby(column_name).agg({
  'average_rating': ['mean', 'median', 'std'],
'num_pages': ['mean', 'median', 'std'],
'ratings_count': ['mean', 'median', 'std'],
})
return grouped_data
# Example usage:
grouped_categories = group_category_data(books_df)
show_df = grouped_categories.head(5)
show_df

Draw Treemap Visualization

Draw a visually appealing Treetop using Plotly to layer hierarchical data for categories and their average_rating.

def draw_treemap(df, path, values, title):
"""
Draws a treemap visualization using Plotly.
Args:
df: Pandas DataFrame containing the data.
path: List of column names to create the hierarchy of the treemap.
values: Column name representing the values to be used for sizing the treemap.
title: Title of the treemap.
"""
fig = px.treemap(df, path=path, values=values, title=title)
fig.show()
# You can replace 'categories' and 'average_rating' with your desired column names.
draw_treemap(books_df, path=['categories'], values='average_rating', title='Book Categories Treemap')

Show basic aggregated statistics on the Rating Count

Prints and formats a summary test for statistical measures of central tendency such as mean, median, and IQR, etc. of the ratings_count field.

ratings_count_iqr = get_summary("ratings_count")
# Display the summary table in a formatted way
print(tabulate(ratings_count_iqr, headers='keys', tablefmt='pretty'))

Plots Ratings Count Distribution

Plot a histogram of ratings_count with a choice of 40 bins and show its measure of central tendency and dispersion including the mean, median, and quartiles.

draw_column_chart('ratings_count', 'Ratings Count', bins=40)

Treemap of Book Ratings by Categories

Draw a visually appealing Treetop using Plotly to layer hierarchical data for categories and their ratings_count

draw_treemap(books_df, path=['categories'], values='ratings_count', title='Book Ratings by Categories')

Unique Categories Extraction

Retrieves and displays all unique values in the categories column as an array.

categories_array = books_df['categories'].unique()
categories_array

Interactive Book Cover Display by Category

This section allows the user to create a selection drop-down to search for books in a category, index and showcase the book covers and their titles through the use of HTML/CSS, and display the books with the highest user ratings.

# Placeholder URL for 'no image available'
no_image_available = "https://d3525k1ryd2155.cloudfront.net/h/848/258/116258848.0.m.jpg"
# Function to filter top books by category and ratings
def arrange_top_cat(categorie):
if categorie == "All":
df = books_df[books_df['thumbnail'] \!= no_image_available].sort_values(by="ratings_count", ascending=False).head(7)
else:
df = (books_df[(books_df['categories'] == categorie) & (books_df['thumbnail'] \!= no_image_available)]
.sort_values(by="ratings_count", ascending=False)
.head(7))  # Adjust the number of books to display in each row
return df
# Function to create HTML for displaying book covers with titles
def print_book_cover(data_):
# HTML and CSS Styling
html = f"""
\
.title-header {{
text-align: center;
color: #ffffff;
background-color: #3cde65;
padding: 10px 0;
font-size: 1.2em;
font-weight: bold;
width: 100%;
display: inline-block;
margin: 20px 0;
}}
.book-container {{
display: flex;
justify-content: center;
flex-wrap: wrap;
gap: 20px;
padding: 20px;
}}
.book-item {{
background-color: #333;
border-radius: 8px;
padding: 15px;
box-shadow: 0px 4px 12px rgba(0,0,0,0.2);
width: 150px;
text-align: center;
transition: transform 0.3s;
}}
.book-item:hover {{
transform: scale(1.05);
}}
.book-title {{
color: white;
font-size: 14px;
margin-top: 10px;
font-weight: bold;
}}
.book-thumbnail {{
width: 150px;
height: 200px;
border-radius: 5px;
}}
\
\{data_['categories'].iloc[0]}\
\
"""
# Create book cover layout
for _, row in data_.iterrows():
html += f"""
\
\
\{row['title']}\
\
"""
html += "\"
# Display HTML
display(HTML(html))
# Dropdown menu for categories
categories = ['All','Detective and mystery stories', 'Africa, East',
   'Hyland, Morn (Fictitious character)',
'Detective and mystery stories, English', 'Ireland',
"Children's stories, English", 'Literary Collections',
'Imaginary wars and battles', 'Fantasy fiction',
'Hallucinogenic drugs', 'Fiction', 'Authors', 'Conduct of life',
'Alienation (Social psychology)', 'History', 'Juvenile Fiction',
'Literary Criticism', 'Science', 'Biography & Autobiography',
'Family & Relationships', 'Juvenile Nonfiction',
'Business & Economics', 'Poetry', 'Self-Help',
'Sports & Recreation', 'True Crime', 'Religion', 'Psychology',
'Travel', 'Social Science', 'Health & Fitness', 'Music',
'Political science', 'Medical', 'Philosophy',
'Language Arts & Disciplines', 'Education', 'Political Science',
'Antiques & Collectibles', 'Reference', 'Humor',
'American fiction', 'American literature', 'Anger', 'Comedy',
'Gangs', 'Short stories, American', 'Cults', 'Computers', 'Art',
'Existential psychotherapy', 'Body, Mind & Spirit', 'Drama',
'BIOGRAPHY & AUTOBIOGRAPHY', 'Humorous stories, English',
'High schools', 'Dead', 'Families', 'American wit and humor',
'Novelists, American', 'French drama', 'Classical fiction',
'Authors, English', 'Design', 'Adult children', 'Pets',
'Authors, American', 'Performing Arts', 'Cancer',
'Erinyes (Greek mythology)', 'Greek drama (Tragedy)', 'Beowulf',
'Zero (The number)', 'Photography', 'Art museum curators',
'Cooking', 'Bibles', 'Nature', 'Literary Criticism & Collections',
'Young Adult Fiction', 'Diary fiction', 'British',
'Bail bond agents', 'Catholics', 'Bosnia and Hercegovina', 'India',
'Paris (France)', 'FICTION', 'Antisemitism', 'Popular culture',
'Great Britain', 'Apartheid', 'Mathematics', 'Cooking, French',
'Comics & Graphic Novels', 'Bible', 'Short stories',
'Foreign Language Study', 'Horror stories', 'Trials (Witchcraft)',
'Ghost stories', 'Law', 'Architecture', 'Gardening', 'Fairy tales',
'Cider house rules. (Motion picture)', 'Manuscripts',
'Amazon River Region', 'Latin poetry', 'Polish poetry',
'Poets, American',
'Englisch \- Geschichte \- Lyrik \- Aufsatzsammlung',
'Black humor (Literature)', 'Discworld (Imaginary place)',
'Folklore', 'English language', 'Sexual behavior surveys',
'Espionage', "Children's stories", 'Electronic books',
'Alcestis (Greek mythology)', 'Sex customs',
'Technology & Engineering', 'Essentialism (Philosophy)',
'Music trade', 'Computer science', 'Democracy', 'Americans',
'Euthanasia', 'Reducing diets', 'Fantasy fiction, American',
'Adventure stories', 'Explorers', 'Love', 'Games & Activities',
'Games', 'Language and languages', 'Film producers and directors',
'American wit and humor, Pictorial',
'Amyotrophic lateral sclerosis', 'Cats', 'Magic', 'Transportation',
'Finance, Personal', 'JUVENILE FICTION', 'Canada',
'Insane, Criminal and dangerous', 'Study Aids',
'Detective and mystery comic books, strips, etc',
'Theology, Doctrinal', 'Illinois', 'Humorous stories, American',
'Europe', 'Baggins, Frodo (Fictitious character)', 'Black market',
'Shipwrecks', 'Cerebrovascular disease', 'Cities and towns',
'Botanique', 'American poetry', 'House & Home', 'Crafts & Hobbies',
'Building laws', 'LITERARY CRITICISM',
'Human-animal relationships', 'Church work with the poor',
'Death (Fictitious character : Gaiman)', 'Astronomers', 'Girls',
'Otherland (Imaginary place)', 'Consumer behavior',
'Authors, Arab', 'Everest, Mount (China and Nepal)',
'Boats and boating', 'Minimal brain dysfunction in children',
'Spiritual life', 'Meditation']
category_dropdown = widgets.Dropdown(
options=categories,
value="All",
description="Category:",
style={'description_width': 'initial'},
layout=widgets.Layout(width="50%")  # Set dropdown width for better appearance
)
# Function to update book covers based on dropdown selection
def on_category_change(change):
selected_category = change['new']
value_df = arrange_top_cat(selected_category)
# Clear previous output before displaying new category data
clear_output(wait=True)
display(category_dropdown)  # Display dropdown again after clearing
print_book_cover(value_df)
# Display dropdown and set up callback
category_dropdown.observe(on_category_change, names='value')
display(category_dropdown)
# Initial display for "All" category
value_df = arrange_top_cat("All")
print_book_cover(value_df)

Text Cleaning and Preparation

This code aims to clean the text by carrying out the following activities: expanding contractions, substituting emojis or emoticons with words, erasing irrelevant content, and eliminating stop words. It also generates a new column clean_text and randomizes the order of the dataset.

# Function to clean and preprocess text
def tidy_text(text):
# Define helper functions
def replace_emoticons(text):
emoticon_dict = {
":)": "smile", ":(": "sad", ":D": "laugh", ";)": "wink", ":-)": "smile", ":-(": "sad", ":-D": "laugh"
# Add more emoticons as needed
}
for emoticon, replacement in emoticon_dict.items():
text = re.sub(re.escape(emoticon), replacement, text)
return text
# Main cleaning steps
text = unidecode(text)  # Convert non-ASCII characters
text = text.lower()  # Convert to lowercase
text = expand_contractions(text)  # Expand contractions
# Replace symbols and emoticons
text = replace_emoticons(text)
text = demojize(text)  # Convert emojis to text
# Remove specific phrases or words
text = re.sub(r"\\btongue sticking out\\b", "funemoji", text)
# Remove hashtags, HTML tags, emails, URLs, and other unwanted elements
text = re.sub(r"@\\S+|https?://\\S+|www\\.\\S+", "", text)
text = re.sub(r"\\d+", "", text)  # Remove numbers
text = re.sub(r"\\s+", " ", text).strip()  # Remove extra whitespaces
text = re.sub(r"[^\\w\\s]", "", text)  # Remove punctuation
# Remove stopwords and reduce word elongations
tokens = text.split()
tokens = [word for word in tokens if word not in stop_words and len(word) \> 1]
text = " ".join(tokens)
return text
# Apply text cleaning to `combined_text` column
books_df['combined_text'] = books_df['description'].fillna('') + " " + books_df['categories'].fillna('') + " " + books_df['authors'].fillna('')
books_df['clean_text'] = books_df['combined_text'].apply(tidy_text)
# Shuffle dataset
books_data = books_df.sample(frac=1, random_state=73).reset_index(drop=True)

STEP 4:

Generate Term-Document Matrix (DTM)

From the cleaned text, it creates DTM applying the CountVectorizer with the ‘ngram_range’ option as (1, 5), and calculates both ‘term_frequency’ and ‘document_frequency’ storing corresponding values in DataFrame together with IDF.

# Generate custom stopwords
word_nums = [str(i) for i in range(1, 1001)]  # Numbers from 1 to 1000
more_words = ["unknowns", "reprint", "printing", "set"]
stop_words = list(stopwords.words('english')) + word_nums + more_words # Convert stop_words to a list
# Step 1: Create Document-Term Matrix
count_vectorizer = CountVectorizer(
ngram_range=(1, 5),  # N-gram window (1 to 5 grams)
stop_words=stop_words,
lowercase=True,
token_pattern=r"(?u)\\b\\w\\w+\\b"  # Keeps words with 2 or more characters
)
# Fit and transform the `clean_text` column
dtm = count_vectorizer.fit_transform(books_data['clean_text'])
dtm_feature_names = count_vectorizer.get_feature_names_out()
# Display dimensions of the DTM
print("DTM shape:", dtm.shape)
# Step 2: Calculate Term Frequencies and Document Frequencies
tfidf_transformer = TfidfTransformer(use_idf=True)
tfidf_transformer.fit(dtm)  # Fit to get the IDF values
# Term frequency and document frequency
term_frequencies = np.array(dtm.sum(axis=0)).flatten()  # Sum across documents
document_frequencies = np.array((dtm > 0).sum(axis=0)).flatten()  # Non-zero counts per term
# Create a DataFrame to store term, term frequency, and document frequency
tf_mat = pd.DataFrame({
'term': dtm_feature_names,
'term_freq': term_frequencies,
'doc_freq': document_frequencies,
'idf': tfidf_transformer.idf_
})
# Display the dimensions of the term-document frequency matrix
print("TF Matrix shape:", tf_mat.shape)

Preview Term-Document Matrix

Here you can Preview Term Document Matrix Displays the Word Document Matrix with top 5 rows and includes terms, term frequency, document frequency, and the IDF values.

show_df = tf_mat.head(5)
show_df

Term-Document Matrix Filtering Procedures

Applies length, frequency, and document count limits for each term, and cleans up the DTM to comprise only the retained terms.

# Filter terms based on specified conditions
filtered_terms = tf_mat[
(tf_mat['term'].str.len() \> 2\) &              # Term length greater than 2 characters
(tf_mat['doc_freq'] \>= 4\) &                   # Document frequency at least 4
(tf_mat['term_freq'] \>= 5\) &                  # Term frequency at least 5
(tf_mat['doc_freq'] \< (dtm.shape[0] / 2))     # Document frequency less than half of all documents
]
# Get the list of terms to keep based on filtering
terms_to_keep = filtered_terms['term'].tolist()
# Filter the DTM to keep only the selected terms (columns)
filtered_dtm = dtm[:, [count_vectorizer.vocabulary_[term] for term in terms_to_keep if term in count_vectorizer.vocabulary_]]
print("Removing less & more frequent terms")
print("Filtered DTM shape:", filtered_dtm.shape)

Recalculate Term Frequencies for Filtered DTM

This code recalculates Term Frequencies for Filtered DTM. Outlines the process of recalculating term frequency and document frequency for the filtered specific DTM and puts the resulting statistics into a data frame for further examination.

# Calculate term frequencies and document frequencies for filtered DTM
term_frequencies = np.array(dtm.sum(axis=0)).flatten()  # Sum across all documents for each term
document_frequencies = np.array((dtm > 0).sum(axis=0)).flatten()  # Count of non-zero occurrences (document frequency)
# Create a DataFrame to store term, term frequency, and document frequency
tf_mat = pd.DataFrame({
'term': count_vectorizer.get_feature_names_out(),
'term_freq': term_frequencies,
'doc_freq': document_frequencies
})
# Display the shape of tf_mat
print("TF Matrix shape:", tf_mat.shape)

Displays the top 5 rows of the recalculated term-document matrix.

show_df = tf_mat.head(5)
show_df

Identify the Most Frequent Terms

Displays the top 10 most frequent terms in the dataset along with their frequencies.

most_frequent_terms = tf_mat.sort_values(by='term_freq', ascending=False).head(10)
print("Most frequent terms:")
print(most_frequent_terms[['term', 'term_freq']])

STEP 5:

TF-IDF Matrix Generation

It calculates the TF-IDF matrix to find the appropriateness of words in the context based on the term frequency and the degree of occurrence in other documents.

tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(dtm)
# Convert to dense format if needed (this is usually memory-intensive for large datasets)
tfidf_dense = tfidf.toarray()
print("TF-IDF (Term Frequency - Inverse Document Frequency): a handy algorithm that uses the frequency of words to determine how relevant those words are to a given document")

Book Suggestion System

Builds a user-friendly recommendation engine by employing cosine similarity and TF-IDF techniques to recommend books to a particular title entered by the user along with a trendy presentation of the result in HTML.

# Function to recommend books based on cosine similarity
def recommend_books(book_title, tfidf_matrix, books_data, top_n=5):
try:
# Find the index of the book in the DataFrame
book_index = books_data[books_data['title'] == book_title].index[0]
# Calculate cosine similarity between the selected book and all other books
cosine_similarities = cosine_similarity(tfidf_matrix[book_index], tfidf_matrix).flatten()
# Get the indices of the most similar books (excluding the book itself)
related_docs_indices = cosine_similarities.argsort()[::-1][1:top_n + 1]
# Filter relevant columns for displaying
recommended_books = books_data.loc[related_docs_indices, [
'thumbnail', 'title', 'categories', 'published_year', 'average_rating', 'num_pages'
]].copy()
recommended_books['similarity'] = cosine_similarities[related_docs_indices]
return recommended_books
except IndexError:
print(f"Book '{book_title}' not found in the dataset.")
return pd.DataFrame()
# Function to display recommendations with professional styling
def display_recommendations(recommendations, book_title):
clear_output(wait=True)  # Clear previous output to avoid stacking results
display(book_title_input, recommend_button)  # Redisplay input and button after clearing output
if not recommendations.empty:
html = f"""
\
\Books Similar to '{book_title}'\
\
\
\
\
\Thumbnail\
\Title\
\Categories\
\Published Year\
\Average Rating\
\Number of Pages\
\Similarity\
\
\
\
"""
for _, row in recommendations.iterrows():
similarity_color = f"rgba(60, 222, 101, {row['similarity']})"
html += f"""
\
\
\
\
\{row['title']}\
\{row['categories']}\
\{row['published_year']}\
\{row['average_rating']}\
\{row['num_pages']}\
\{row['similarity']:.2f}\
\
"""
html += """
\
\
"""
display(HTML(html))
else:
display(HTML(f"""
\
\No recommendations found\
\We couldn't find any books similar to '{book_title}'. Please try a different title.\
\
"""))
# Interactive input widget
book_title_input = widgets.Text(
value='',
placeholder='Enter a book title...',
description='Book Title:',
style={'description_width': 'initial'},
layout=widgets.Layout(width='400px')
)
# Button to trigger recommendation
recommend_button = widgets.Button(
description="Get Recommendations",
button_,
layout=widgets.Layout(width='200px')
)
# Function to handle button click
def on_recommend_button_click(b):
book_title = book_title_input.value
if book_title:
recommendations = recommend_books(book_title, tfidf, books_data)
display_recommendations(recommendations, book_title)
else:
print("Please enter a book title.")
# Assign button click event
recommend_button.on_click(on_recommend_button_click)
# Display input widget and button
display(book_title_input, recommend_button)

KMeans Clustering of Books

Clusters books into 5 groups using KMeans on the filtered DTM and assigns cluster labels to each book for analysis.

# Choose the number of clusters (k)
n_clusters = 5  # You can experiment with different values
# Initialize and fit the KMeans model
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans.fit(filtered_dtm)
# Get cluster labels for each document
cluster_labels = kmeans.labels_
# Add cluster labels to your DataFrame
books_data['cluster'] = cluster_labels
# Analyze the clusters
for i in range(n_clusters):
print(f"\nCluster {i}:")
cluster_books = books_data[books_data['cluster'] == i]
print(cluster_books[['title', 'categories']].head(10))  # Print titles and categories for examples

Cluster Analysis and Book Display

Conducts KMeans clustering, filters the clusters by most frequent words and provides an eye catching and well formatted readable html statistical table showing the best selling books in every cluster.

# Sample Data (for illustration)
books_data = pd.DataFrame({
'thumbnail': [no_image_available] \* 100,
'title': [f"Book {i}" for i in range(100)],
'categories': ['Category'] \* 100,
'average_rating': np.random.rand(100) \* 5,
'ratings_count': np.random.randint(1, 1000, 100),
'num_pages': np.random.randint(100, 1000, 100\)
})
# Function to print model results with enhanced design
def print_model_results(model_build):
clustering = model_build.labels_
books_data["cluster"] = clustering
n_clusters = len(set(clustering))
# Summary of each cluster's top words (placeholder)
cluster_summary = pd.DataFrame({
'cluster': range(n_clusters),
'size': [np.sum(clustering == i) for i in range(n_clusters)],
'top_words': ['word1, word2, word3'] \* n_clusters
})
# Display Cluster Summary with Styling
html = """
\
\Cluster Summary\
\
\
\
\
\Cluster\
\Size\
\Top Words\
\
\
\
"""
for _, row in cluster_summary.iterrows():
html += f"""
\
\{row['cluster']}\
\{row['size']}\
\{row['top_words']}\
\
"""
html += "\\"
display(HTML(html))
# Function to get top books in a cluster
def get_top_books_on_cluster(cluster_num, n=7):
cluster_books = books_data[books_data["cluster"] == cluster_num]
cluster_books = cluster_books.sort_values(by="ratings_count", ascending=False)
return cluster_books.head(n)
# Display popular books in each cluster with images and enhanced styling
html = "\\Popular Books in Each Cluster\\"
for cluster_num in range(n_clusters):
cluster_books = get_top_books_on_cluster(cluster_num)
html += f"\Cluster {cluster_num}\"
html += """
\
\
\
\Thumbnail\
\Title\
\Categories\
\Average Rating\
\Ratings Count\
\Number of Pages\
\
\
\
"""
for _, row in cluster_books.iterrows():
img_html = f"\"
html += f"""
\
\{img_html}\
\{row['title']}\
\{row['categories']}\
\{row['average_rating']:.2f}\
\{row['ratings_count']}\
\{row['num_pages']}\
\
"""
html += "\\"
display(HTML(html))
# Example usage
vectorizer = TfidfVectorizer(max_features=20)
dtm = vectorizer.fit_transform(["sample text"] * 100)  # Example text data for illustration
model = KMeans(n_clusters=5)
model_build = model.fit(dtm)
print_model_results(model_build)

Hierarchical clustering of literature.

Carries out hierarchical clustering on the depth of the cosine similarity distance matrix, constructs a dendrogram, and clusters the books with a limitation of the data available.

# Assuming `tfidf` is a precomputed TF-IDF matrix
csim = cosine_similarity(tfidf)
distance_matrix = 1 - csim
np.fill_diagonal(distance_matrix, 0)  # Replace NaN with 0 (if any diagonal elements are NaN)
# Function to build hierarchical clustering model
def build_hcclust_model(distance_matrix, n_clusters, num_books): # Add num_books as a parameter
# Perform hierarchical clustering
hc = sch.linkage(squareform(distance_matrix), method="ward")
# Plot the dendrogram
plt.figure(figsize=(10, 7))
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
sch.dendrogram(hc, truncate_mode='lastp', p=n_clusters, show_leaf_counts=False, no_labels=True)
plt.show()
# Form flat clusters, limiting to the number of books in books_data
hc_clustering = sch.fcluster(hc, n_clusters, criterion='maxclust')[:num_books] # Limit the cluster assignments
return {
'type_': 'hc_clust',
'model': hc,
'clustering': hc_clustering
}
# Number of clusters
n_clusters = 25
# Get the number of books in books_data
num_books = len(books_data)
# Build the model, passing the number of books
model_build = build_hcclust_model(distance_matrix, n_clusters, num_books)
# Display clusters
books_data["cluster"] = model_build['clustering']

Cluster Analytics and Popular Books Display

Improves the cluster analysis by providing a summary table with the total number of clusters and the top words in those clusters, and a grid comprising some popular books in those book clusters with their thumbnail images and descriptions.

print(books_data[["title", "cluster"]].head())

This code visualizes and displays clustering results by summarizing each cluster's size and top words, then showcases popular books in each cluster with images, ratings, and categories in a stylish grid layout.

# Function to print model results with enhanced design and grid view for images
def print_model_results(model_build):
if model_build.get('type_') == 'hc_clust':
clustering = model_build['clustering']
else:
clustering = model_build.labels_
books_data["cluster"] = clustering
n_clusters = len(set(clustering))
# Summary of each cluster's top words (placeholder \- you can replace this with actual top words)
cluster_summary = pd.DataFrame({
'cluster': range(n_clusters),
'size': [np.sum(clustering == i) for i in range(n_clusters)],
'top_words': ['word1, word2, word3'] \* n_clusters
})
# Display Cluster Summary with Styling
html = """
\
\Cluster Summary\
\
\
\
\
\Cluster\
\Size\
\Top Words\
\
\
\
"""
for _, row in cluster_summary.iterrows():
html += f"""
\
\{row['cluster']}\
\{row['size']}\
\{row['top_words']}\
\
"""
html += "\\"
display(HTML(html))
# Function to get top books in a cluster
def get_top_books_on_cluster(cluster_num):
cluster_books = books_data[books_data["cluster"] == cluster_num]
cluster_books = cluster_books.sort_values(by="ratings_count", ascending=False) # Sort by rating count
return cluster_books.head(7)
# Display popular books in each cluster with images and enhanced styling
html = "\\Popular Books in Each Cluster\\"
for cluster_num in range(n_clusters):
cluster_books = get_top_books_on_cluster(cluster_num)
html += f"\Cluster {cluster_num}\"
# Displaying books in a grid layout
html += "\"
for _, row in cluster_books.iterrows():
img_html = f"\"
html += f"""
\
{img_html}
\{row['title']}\
\{row['categories']}\
\Rating: {row['average_rating']:.2f}\
\
"""
html += "\"
display(HTML(html))
# Example usage (adjust as needed based on your data and model)
# Assuming 'model_build' is either the KMeans or hierarchical clustering model you built
print_model_results(model_build)

Conclusion

This undertaking is aimed to project the significance of machine learning and its subsets such as NLP when it comes to the analysis of the book data and the categorization of the said data into groups. A more clear relationship between books is revealed in this instance with the use of TF-IDF, KMeans clustering, and hierarchical clustering. The basing of an interactive recommendation system alongside a bibliography demonstrates the feasibility of using cosine similarity in recommending related titles.

With insightful visualizations like treemaps and dendrograms, easily navigated book clusters and their attributes are provided to the users. This workflow demonstrates how exercises such as sourcing out data, carrying out data analysis through feature engineering, and clustering data can convert raw data to useful information. This project being able to enhance the analysis of data and creating systems that can recommend books to users serves as a great introduction to the use of data science and machine learning in the outside world.

Challenges New Coders Might Face

Challenge: Handling noisy or unstructured text data.
Solution: Utilize text cleaning methods, which may include the exclusion of special symbols, figures, and extra spaces.
Challenge: Preprocessing Large Text Data
Solution: Enhance text cleaning processes by employing better libraries such as NLTK and adopting batch processing for the data.
Challenge: Curse of Dimensionality in the high dimensional text datasets affecting clustering and classification results.
Solution: Use TF-IDF vectorization and reduction techniques like (PCA) to control dimensionality.
Challenge: Choosing the Right Number of Clusters
Solution: Employ relevant evaluation metrics such as the silhouette score or dendrograms to establish cluster count.
Challenge: Recommendations insufficient
Solution: Employ vectorized computations, and perform precomputation of the similarity scores for the popular books with regards to the user queries.

Frequently Asked Questions (FAQs)

Question 1: What is the aim of book clustering and recommendation systems?
Answer: The aim is to categorize books with the same features and to suggest titles to the user according to his interests.

Question 2: Which machine learning methods are incorporated in the Book Recommendation System?
Answer: In this project, KMeans clustering and hierarchical clustering techniques are used to identify books that are analogous.

Question 3: What impact does TF-IDF have on the study of book summaries?
Answer: TF-IDF is also good for feature extraction as it does the balancing act of how common but at the same time how special a word is.

Question 4: What is the meaning of cosine similarity and how does it relate to suggesting books to a user?
Answer: Cosine similarity's purpose is to recommend other books based on the similarity of their TF-IDF vectors.

Question 5: How many clusters should I consider when clustering books?
Answer: Apply methods like silhouette scoring or the use of a dendrogram to the given set of data for that purpose.