Image

Build a Collaborative Filtering Recommender System in Python

For this project, we are developing a recommendation system that recommends movies based on users' preferences. We are going to use hybrid methods such as collaborative filtering, content-based filtering, and LightFM. The task is to predict and recommend movies based on the user's preferences and what he is likely to love!

Project Overview

The project develops a movie recommendation system based on different techniques- collaborative filtering, content-based filtering, and hybrid methods with LightFM.

We first clean and prepare two datasets- one for user ratings and another for movie metadata; handle the missing values in the datasets, and then apply Singular Value Decomposition (SVD) to perform collaborative filtering in predicting ratings based on user input. LightFM is then used to combine collaborative methods and content-based ones for improving recommendation quality that takes into account user behavior and movie features. Item-based collaborative filters have also been implemented for finding movies similar to each other using cosine similarity.

Finally, weighted ratings are computed on movies to rank them based on average ratings considering also the number of votes they have received. The end output would thus be a personalized movie recommendation system for the users to find enjoyable movies.

Prerequisites

  • A general understanding of Python programming and usage of data analysis tools such as pandas, and NumPy.
  • Knowledge of concepts in machine learning especially collaborative filtering and content-based filtering.
  • Pre-knowledge of recommendation systems along with the assessment methods such as RMSE and MAE.
  • Knowledge of Python libraries especially LightFM, scikit-learn, and Surprise.
  • Understanding of a cosine similarity and its usage in an item-based collaborative filtering.
  • Understanding of how to take care of and manage data and some of the things that need to be done to prepare the data such as handling missing values.
  • Knowledge of similar validation methods such as cross-validation and the understanding of metrics that describe performance.

Approach

We follow a certain approach in developing this recommendation system, starting with the initial preprocessing phase, which deals with cleaning and handling the missing values in the various user ratings and movies metadata. The next step is Collaborative Filtering, part of which is predicting ratings using Singular Value Decomposition-measured user behaviors with each respective interaction. We take this a step further and combine the collaborative filtering process with content-based filtering by using LightFM, where we include all user's movie tastes such as genres, enabling the recommendation to be at par with the conventional techniques of collaborative filtering. In addition, we are tasked with implementing Item Based Collaborative Filtering by calculating cosine similarity among movies, thus recommending the most similar past ratings by the users on those movies. Finally, we are going to proceed and improve the accuracy of our system on the calculation of Weighted Ratings, which includes both the number of votes and the average rating for each movie. With all these approaches, therefore, movie recommendations to users would be more personalized, accurate, and relevant.

Workflow and methodologies:

Workflow

  • Data Collection: Load the user ratings and the movie metadata into dataframes for analysis.
  • Data Preprocessing: Preprocess the data by dealing with missing data or transforming the data type according to the requirement of the model.
  • Exploratory Data Analysis: Identify the specific patterns by analyzing data distributions, ratings, and movie characteristics.
  • Model Building: SVD for collaborative filtering, LightFM for the hybrid recommendation, and cosine similarity for item-based filtering.
  • Evaluation: Evaluate the recommendations by using some metrics - RMSE, MAE, etc. to understand their quality.
  • Final Recommendations: Generate and then show your user-specific movie recommendations.

Methodology

  • Collaborative Filtering: To predict the user’s rating about the product, use SVD.
  • Hybrid Approach: Integrating CF and CBF using LightFM to provide higher accuracy.
  • Item-Based Filtering: Cosine similarity is used to search and recommend movies of the like based on the preference of the particular user.
  • Weighted Ratings: For improved ranking, settle on vote count and average rating; calculate weighted ratings for each film.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

  • Imported user ratings and movie metadata.
  • Handled missing values in vote_count and vote_average.
  • Converted movie and user IDs to strings.
  • Filtered movies with more than 55 votes.
  • Merged ratings and movie data on movie IDs.
  • Filled missing genre data with empty strings.
  • Built an interaction matrix for LightFM.
  • Calculated weighted ratings based on votes and averages.

Code Explanation

Step 1:

Mounting Drive

This code shows how to connect your Google Drive account to a Colab workspace. It helps to access the files available in the user’s Google Drive by making it present in a particular folder (‘/content/drive’).

from google.colab import drive
drive.mount('/content/drive')

Installing Required Libraries

The code installs the scikit-surprise for collaborative filtering and the lightfm for the hybrid recommendation system. Both libraries are important in building recommendation models.

!pip install scikit-surprise
!pip install lightfm

Importing Libraries

The code imports libraries needed for building and evaluating recommendation systems. These include lightfm for hybrid recommendations, surprises for collaborative filtering, seaborn and matplotlib for visualization purposes and machine learning tools like NearestNeighbors and SVD for model building.

import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from lightfm import LightFM
from surprise import accuracy
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
from surprise import Dataset, Reader
from sklearn.neighbors import NearestNeighbors
from surprise.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from surprise.prediction_algorithms.matrix_factorization import SVD

Step 2:

Loading Datasets

The code loads two datasets: ratings_small.csv for the user's ratings and movies_metadata.csv for the movies’ details. These datasets are important for building up a recommendation system.

# Load ratings data
ratings = pd.read_csv('/content/ratings_small.csv')
# Load movies data
movies = pd.read_csv('/content/movies_metadata.csv')

Previewing Data

This block of code displays the first few rows of the ratings dataset to give a quick overview of its structure.

ratings.head()

This block of code displays the first few rows of the movie's dataset to give a quick overview of its structure.

movies.head()

Checking for Missing Values

This code checks for null values in the movies and ratings dataframes. It prints the number of null values per column for locating any gaps in the dataset.

# Check for missing values
print("Checking Movies dataframe Null values:")
print(movies.isnull().sum())
print("Checking Rating dataframe Null values:")
print(ratings.isnull().sum())

Calculating and Visualizing Average Ratings

The code computes the mean rating per movie and subsequently visualizes their distributions. The histogram with a KDE curve shows how ratings are spread across movies.

# Calculate average ratings per movie
movie_avg_ratings = ratings.groupby('movieId')['rating'].mean()
# Plot the distribution of average ratings per movie
plt.figure(figsize=(10, 6))
sns.histplot(movie_avg_ratings, bins=50, kde=True)
plt.title('Average Ratings Distribution per Movie')
plt.xlabel('Average Rating')
plt.ylabel('Frequency')
plt.show()

Analyzing Ratings per User

The code calculates how many ratings each user has given and visualizes the distribution.

# Number of ratings per user
user_ratings_count = ratings.groupby('userId').size()
plt.figure(figsize=(10, 6))
sns.histplot(user_ratings_count, bins=50, kde=True)
plt.title('Number of Ratings per User')
plt.xlabel('Number of Ratings')
plt.ylabel('Frequency')
plt.show()

Step 3:

Filtering Movies with High Vote Count

The code filters the movies dataframe to include only movies with more than 55 votes. It then displays the id and title of these movies.

movie_md = movies[movies['vote_count']>55][['id','title']]
movie_md.head()

Filtering Ratings for Selected Movies

This code extracts ratings from the ratings dataframe that are related only to movies that have received 55 votes or more. It resets the index of the filtered ratings dataframe to produce a clearer output.

movie_ids = [int(x) for x in movie_md['id'].values]
ratings = ratings[ratings['movieId'].isin(movie_ids)]
ratings.reset_index(inplace=True, drop=True)
ratings.head()

Step 4:

Preparing Data for Surprises

The code prepares the rating data for use in the surprise library. It specifies the rating scale (1–5) and puts the data into a Dataset object for model training.

# Prepare the data for Surprise
reader = Reader(rating_scale=(1, 5))  # Assuming ratings are between 1 and 5
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

Data Splitting and Building the SVD Model

The coding splits the data into training and testing parts with an 80-20 split. It then builds an SVD (singular value decomposition) model and trains it using the training set.

# Split the data into training and test sets
trainset, testset = train_test_split(data, test_size=0.2)
# Build the SVD model
svd = SVD()
svd.fit(trainset)

Recommendation function

The get_recommendations method will generate movie recommendations for a user. It predicts ratings for movies that the user has not yet interacted with. It will return top n recommendations based on predicted ratings in descending order.

def get_recommendations(data, movie_md, user_id, top_n, algo):
# creating an empty list to store the recommended product ids
recommendations = []
# creating an user item interactions matrix
user_movie_interactions_matrix = data.pivot(index='userId', columns='movieId', values='rating')
# extracting those product ids which the user_id has not interacted yet
non_interacted_movies = user_movie_interactions_matrix.loc[user_id][user_movie_interactions_matrix.loc[user_id].isnull()].index.tolist()
# looping through each of the product ids which user_id has not interacted yet
for item_id in non_interacted_movies:
# predicting the ratings for those non interacted product ids by this user
est = algo.predict(user_id, item_id).est
# appending the predicted ratings
movie_name = movie_md[movie_md['id']==str(item_id)]['title'].values[0]
recommendations.append((movie_name, est))
# sorting the predicted ratings in descending order
recommendations.sort(key=lambda x: x[1], reverse=True)
return recommendations[:top_n]

Movie Recommendations

The function get_recommendations is called, and it returns a list of the top ten movies for the user with ID 654. The SVD model employed for training is then used to predict the ratings for movies that the user has not yet rated.

get_recommendations(data=ratings,movie_md=movie_md, user_id=654, top_n=10, algo=svd)

Step 5:

Generating Predictions for Test Sets

This code produces predictions for the test set based on the trained SVD model. It further evaluates the performance of the trained model with the unknown data using the method svd.test.

predictions = svd.test(testset) # This line was added to get predictions from the model

The computation and display of RMSE and MAE

The code calculates the root mean square error (RMSE) and mean absolute error (MAE) to evaluate the system's performance for the SVD model when it compares the predicted rating with the actual ratings obtained from the test set, along with the output.

from sklearn.metrics import mean_squared_error, mean_absolute_error
# Get predictions
predictions = svd.test(testset)
# Convert predictions into a format suitable for regression metrics
predicted_ratings = [pred.est for pred in predictions]
true_ratings = [pred.r_ui for pred in predictions]
# Calculate RMSE and MAE
# If your sklearn version is older than 0.22, use np.sqrt to calculate RMSE
rmse = np.sqrt(mean_squared_error(true_ratings, predicted_ratings))  # Calculate RMSE using np.sqrt
mae = mean_absolute_error(true_ratings, predicted_ratings)
print(f"RMSE: {rmse}")
print(f"MAE: {mae}")

Plotting RMSE and MAE

The code creates a bar plot to visualize the RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) evaluation metrics.

rmse_values = [rmse]  # Example: Assuming you have a single RMSE value
mae_values = [mae]    # Example: Assuming you have a single MAE value
metrics = ['RMSE', 'MAE']
values = [rmse, mae]
plt.figure(figsize=(8, 6))
plt.bar(metrics, values, color=['skyblue', 'lightcoral'])
plt.title('Model Evaluation Metrics')
plt.ylabel('Value')
plt.ylim(0, max(values) + 0.5) # Adjust y-axis limit for better visualization
for i, v in enumerate(values):
plt.text(i, v + 0.05, str(round(v, 3)), ha='center')
plt.show()

Step 6:

Loading Datasets

The code loads two datasets: ratings_small.csv for the user's ratings and movies_metadata.csv for the movies’ details. These datasets are important for building up a recommendation system.

# Load ratings data
ratings = pd.read_csv('/content/ratings.csv')
# Load movies data
movies = pd.read_csv('/content/movies_metadata.csv')

Weighted Rating Computation

The calculate_weighted_rating function measures a weighted rating for movies depending on their number of votes and average ratings. It uses a quantile (m_percentile) to balance the influence of vote count against the overall average rating.

# Weighted Rating Calculation
def calculate_weighted_rating(df, m_percentile=0.9):
m = df['vote_count'].quantile(m_percentile)
C = df['vote_average'].mean()
def weighted_rating(x):
v = x['vote_count']
R = x['vote_average']
return (v / (v + m) * R) + (m / (m + v) * C)
df['weighted_rating'] = df.apply(weighted_rating, axis=1)
return df

Cleaning and Preparing Movie Metadata

The code snippet converts vote_count and vote_average columns into numeric types and has been set to coercible errors to be transformed into NaN. Hence, any row with missing values for the later two columns had to be dropped for clean data.

# Clean and preprocess movie metadata
movies['vote_count'] = pd.to_numeric(movies['vote_count'], errors='coerce')
movies['vote_average'] = pd.to_numeric(movies['vote_average'], errors='coerce')
movies = movies.dropna(subset=['vote_count', 'vote_average'])

Applying Weighted Rating

The calculate_weighted_rating function will be applied to the movies dataframe to create the new column. This new column will be called weighted_rating, which contains adjusted ratings according to vote count and average rating.

# Apply weighted rating
movies = calculate_weighted_rating(movies)

Combine Reviews and Movies Metadata

This code merges the ratings dataframe with the movies dataframe based on the movie ID, and changes both movieId and id columns to be a string to enforce that both dataframes have the same format while merging.

# Merge ratings and metadata
movies['id'] = movies['id'].astype(str)
ratings['movieId'] = ratings['movieId'].astype(str)
merged_data = pd.merge(ratings, movies, left_on='movieId', right_on='id')

Content-Based Similarity

This code calculates content-based similarity between movies based on their genres. First, it fills any absent genre values and then transforms using a TF-IDF vectorizer to obtain numerical features of the genres. Finally, using these features, it computes the cosine similarity between them.

# Content-Based Similarity
tfidf = TfidfVectorizer(stop_words='english')
movies['genres'] = movies['genres'].fillna('')
tfidf_matrix = tfidf.fit_transform(movies['genres'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Step 7:

Integrating Hybrid Recommendation in LightFM

This code sets up a LightFM dataset for hybrid recommendations. It creates a LightFMDataset object by fitting the same with the common user and movie identifiers derived from merged data to prepare the data for model training.

from lightfm.data import Dataset as LightFMDataset 
lightfm_dataset = LightFMDataset()  # Create a LightFM Dataset object
lightfm_dataset.fit(
(x for x in merged_data['userId'].unique()),  # Fit user IDs
(x for x in merged_data['movieId'].unique()),  # Fit item IDs
)

Including Ratings in Interactions

The function builds the interaction matrix for the LightFM model. Thus, interactions are ratings corresponding to matrices of user-item pairs extended with their ratings. Each weight represents the importance of an interaction.

# Include ratings in interactions:
interactions, weights = lightfm_dataset.build_interactions(
[(row['userId'], row['movieId'], row['rating']) for index, row in merged_data.iterrows()]
)

Light FM model training

This code trains a lightfm model with the WARP (Weighted Approximate-Rank Pairwise) loss for 30 epochs based on the interaction matrix, using weights to account for the importance of different ratings.

model = LightFM(loss='warp')
model.fit(interactions, sample_weight=weights, epochs=30) #a

Item-based Collaborative Filtering

The get_similar_movies function finds up to the ten top most similar movies along cosine similarity to a specific movie passed into it as a parameter. It returns the movie titles and their numerator ratings, excluding the movie from recommendations.

# Item-Based Collaborative Filtering
def get_similar_movies(movie_id, cosine_sim=cosine_sim, movies=movies):
idx = movies.index[movies['id'] == movie_id].tolist()[0]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:11]  # Top 10 similar movies
movie_indices = [i[0] for i in sim_scores]
return movies.iloc[movie_indices][['title', 'weighted_rating']]

Getting Similar Movies

This code will fetch the top 10 movies similar to the movie with ID '862' and print the titles and weighted ratings of such most similar movies based on cosine similarity.

# Example: Get similar movies for a given movie
similar_movies = get_similar_movies('862')
print(similar_movies)

Conclusion

The suggested movie recommendation system employs collaborative filtering, and content-based filtering, and uses both methods to provide informative movie recommendations. Singular Value Decomposition (SVD) learned user ratings based on historical interactions and embedded user behavior patterns. The LightFM model is used for making the hybrid recommendation that is a combination of user-item interaction and also the movie features such as genre for better prediction. Further, the item-based collaborative filtering using cosine similarity measure is used to recommend movies based on the measure of similarity in the movie items. To enhance the ranking, there are Weighted Ratings calculated based on the total Votes and the Average Rating on any given movie. In conclusion, this multiple-pronged method ensures that the movie recommendation is extremely personalized, and efficient and can be deployed at a large scale for targeting end users with relevant movie suggestions.

Challenges New Coders Might Face

  • Challenge: Handling noisy or unstructured text data.
    Solution: Utilize text cleaning methods, which may include the exclusion of special symbols, figures, and extra spaces.

  • Challenge: Data Preprocessing and merging
    Solution: Make sure that the data types for IDs are consistent (e.g., movie IDs and user IDs as strings) and clean up the data before merging.

  • Challenge: Model Evaluation
    Solution: Employ multiple evaluation metrics to give a comprehensive view of the model performance.

  • Challenge: Inaccessibility of GPU
    Solution: For debugging or initial testing, consider using smaller datasets or incorporating GPU-based cloud platforms for quick turnaround times.

  • Challenge: Model Complexity.
    Solution: Simplify the approach by testing each method individually and gradually combining them, ensuring each technique adds value to the final recommendations.

Frequently Asked Questions (FAQs)

Question 1: What is a hybrid recommendation system?
Answer: A hybrid recommendation system is taking more than one technique like collaborative filtering, content-based filtering, and so on. This system helps to improve the accuracy of the recommendation, thereby making it more diverse for the products. User interactions and product features are strongly combined traits that serve to give all users personalized suggestions.

Question 2: How Collaborative Filtering is done in a Recommendation System?
Answer: The user's interactive information for example the history of the items purchased by the people is treated and then those patterns are found in user behavioral interaction and the resultant recommendation is produced whose basis is similar to the preferences among users.

Question 3: What is content-based filtering in recommendation systems?
Answer: Content-based filtering is an item recommendation with reference to attributes of the product like features, categories, and customer segments. Again it will take characteristics of an item but it won't consider the user behavior at all during the recommendation.

Question 4: Why is LightFM used to build hybrid recommendation models?
Answer: LightFM is the one powerful Python library designed for the hybrid recommendation system. The use of this library makes it efficient for collaborative and content-based filtering, sparsely populated matrices and ranking them in an optimized fashion for creating personalized recommendations.

Question 5: How do you address data sparsity in collaborative filtering?
Answer: Data sparsity limits collaborative filtering as a result of the sparsity in terms of very few interactions by the users with the items. Remedies to this include using matrix factorization techniques, hybrid models (content-based filtering), and algorithms that fit well in sparse data analysis.

Code Editor