Build a Collaborative Filtering Recommender System in Python

Project Overview

The project develops a movie recommendation system based on different techniques- collaborative filtering, content-based filtering, and hybrid methods with LightFM.

We first clean and prepare two datasets- one for user ratings and another for movie metadata; handle the missing values in the datasets, and then apply Singular Value Decomposition (SVD) to perform collaborative filtering in predicting ratings based on user input. LightFM is then used to combine collaborative methods and content-based ones for improving recommendation quality that takes into account user behavior and movie features. Item-based collaborative filters have also been implemented for finding movies similar to each other using cosine similarity.

Finally, weighted ratings are computed on movies to rank them based on average ratings considering also the number of votes they have received. The end output would thus be a personalized movie recommendation system for the users to find enjoyable movies.

Prerequisites

A general understanding of Python programming and usage of data analysis tools such as pandas, and NumPy.
Knowledge of concepts in machine learning especially collaborative filtering and content-based filtering.
Pre-knowledge of recommendation systems along with the assessment methods such as RMSE and MAE.
Knowledge of Python libraries especially LightFM, scikit-learn, and Surprise.
Understanding of a cosine similarity and its usage in an item-based collaborative filtering.
Understanding of how to take care of and manage data and some of the things that need to be done to prepare the data such as handling missing values.
Knowledge of similar validation methods such as cross-validation and the understanding of metrics that describe performance.

Approach

We follow a certain approach in developing this recommendation system, starting with the initial preprocessing phase, which deals with cleaning and handling the missing values in the various user ratings and movies metadata. The next step is Collaborative Filtering, part of which is predicting ratings using Singular Value Decomposition-measured user behaviors with each respective interaction. We take this a step further and combine the collaborative filtering process with content-based filtering by using LightFM, where we include all user's movie tastes such as genres, enabling the recommendation to be at par with the conventional techniques of collaborative filtering. In addition, we are tasked with implementing Item Based Collaborative Filtering by calculating cosine similarity among movies, thus recommending the most similar past ratings by the users on those movies. Finally, we are going to proceed and improve the accuracy of our system on the calculation of Weighted Ratings, which includes both the number of votes and the average rating for each movie. With all these approaches, therefore, movie recommendations to users would be more personalized, accurate, and relevant.

Workflow and methodologies:

Workflow

Data Collection: Load the user ratings and the movie metadata into dataframes for analysis.
Data Preprocessing: Preprocess the data by dealing with missing data or transforming the data type according to the requirement of the model.
Exploratory Data Analysis: Identify the specific patterns by analyzing data distributions, ratings, and movie characteristics.
Model Building: SVD for collaborative filtering, LightFM for the hybrid recommendation, and cosine similarity for item-based filtering.
Evaluation: Evaluate the recommendations by using some metrics - RMSE, MAE, etc. to understand their quality.
Final Recommendations: Generate and then show your user-specific movie recommendations.

Methodology

Collaborative Filtering: To predict the user’s rating about the product, use SVD.
Hybrid Approach: Integrating CF and CBF using LightFM to provide higher accuracy.
Item-Based Filtering: Cosine similarity is used to search and recommend movies of the like based on the preference of the particular user.
Weighted Ratings: For improved ranking, settle on vote count and average rating; calculate weighted ratings for each film.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Imported user ratings and movie metadata.
Handled missing values in vote_count and vote_average.
Converted movie and user IDs to strings.
Filtered movies with more than 55 votes.
Merged ratings and movie data on movie IDs.
Filled missing genre data with empty strings.
Built an interaction matrix for LightFM.
Calculated weighted ratings based on votes and averages.

Code Explanation

Step 1:

Mounting Drive

This code shows how to connect your Google Drive account to a Colab workspace. It helps to access the files available in the user’s Google Drive by making it present in a particular folder (‘/content/drive’).

from google.colab import drive
drive.mount('/content/drive')

Installing Required Libraries

The code installs the scikit-surprise for collaborative filtering and the lightfm for the hybrid recommendation system. Both libraries are important in building recommendation models.

!pip install scikit-surprise
!pip install lightfm

Importing Libraries

The code imports libraries needed for building and evaluating recommendation systems. These include lightfm for hybrid recommendations, surprises for collaborative filtering, seaborn and matplotlib for visualization purposes and machine learning tools like NearestNeighbors and SVD for model building.

import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from lightfm import LightFM
from surprise import accuracy
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
from surprise import Dataset, Reader
from sklearn.neighbors import NearestNeighbors
from surprise.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from surprise.prediction_algorithms.matrix_factorization import SVD

Step 2:

Loading Datasets

The code loads two datasets: ratings_small.csv for the user's ratings and movies_metadata.csv for the movies’ details. These datasets are important for building up a recommendation system.

# Load ratings data
ratings = pd.read_csv('/content/ratings_small.csv')
# Load movies data
movies = pd.read_csv('/content/movies_metadata.csv')

Previewing Data

This block of code displays the first few rows of the ratings dataset to give a quick overview of its structure.

ratings.head()

This block of code displays the first few rows of the movie's dataset to give a quick overview of its structure.