Recommender systems are methods that predict users’ interests and make meaningful recommendations to them for different items, such as songs to play on Spotify, movies to watch on Netflix, news to read about your favourite newspaper website or  products to purchase on Amazon.

Recommender systems can be distinguished primarily by the type of information that they use. Content-based recommenders rely on attributes of users and/or items, whereas collaborative filtering uses information on the interaction between users and items, expressed in the so-called user-item interaction matrix.

Recommender systems are generally divided into 3 main approaches: content-based, collaborative filtering, and hybrid recommendation systems (see Fig. 1).

Figure 1: Types of recommender systems

What are content-based recommender systems?

Content-based recommender systems generate recommendations by relying on attributes of items and/or users. User attributes can include age, sex, job type and other personal information. Item attributes on the other hand, are descriptive information that distinguishes individual items from each other. In case of movies, this could include title, cast, description, genre and others.

By relying on features, those of users and items, content-based recommender systems are more like a traditional machine learning problem than is the case for collaborative filtering. Content-based method uses item-based or user-based features to predict an action of the user for a given item. User’s action can be a specific rating, a buy decision, like or dislike, a decision to view a movie and similar.

One of the advantages of content-based recommendation is user independence – to make recommendations to a user, it does not require information about other users, unlike collaborative filtering. This makes content-based approach easier to scale. Another benefit is that the recommendations are more transparent, as the recommender can more clearly explain recommendation in terms of the features used.

Content-based approach also has its drawbacks, one is over-specialization – if the user is only interested in specific categories, recommender will have difficulty recommending items outside of this scope, leading to user remaining in its current circle of items/interests. Content-based approaches also often require domain knowledge to produce relevant item and user features.

We will now build an implementation of content-based recommender in python, using the MovieLens dataset.

Content-based recommender system for recommendation of movies

Our recommender system will be able to recommend movies to us, based on movie plots and based on combination of features, such as top actors, director, keywords, producer and screenplay writers of the movies.

First, we load the models:

import pandas as pd

import ast

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

import seaborn as sns

import numpy as np

import matplotlib.pyplot as plt

 

Next, we import data from https://www.kaggle.com/rounakbanik/the-movies-dataset and https://grouplens.org/datasets/movielens/latest/:

df_data = pd.read_csv(‘movies_metadata.csv’, low_memory=False)

One of the pre-processing steps for our recommender involves removing movies which have low number of votes:

df_data = df_data[df_data['vote_count'].notna()]

plt.figure(figsize=(20,5))

sns.distplot(df_data['vote_count'])

plt.title("Histogram of vote counts")

# determine the minimum number of votes that the movie must have to be included 

min_votes = np.percentile(df_data['vote_count'].values, 85)

# exclude movies that do not have minimum number of votes

df = df_data.copy(deep=True).loc[df_data['vote_count'] > min_votes]

Content-based recommender that recommends movies based on similarity of movie plots

Our first content-based recommender will have a goal of recommending movies which have a similar plot to a selected movie.

We will use “overview” feature from our dataset:

# removing rows with missing overview

df = df[df['overview'].notna()]

df.reset_index(inplace=True)


# processing of overviews

def process_text(text):

    # replace multiple spaces with one

    text = ' '.join(text.split())

    # lowercase

    text = text.lower()

    return text

df['overview'] = df.apply(lambda x: process_text(x.overview),axis=1)

To compare movie plots, we first need to compute their numerical representation. There are various approaches we can use, from bag of words, word embeddings to TF-IDF, we will select the latter.

TF-IDF approach

TF-IDF of a word in a document which is part of a larger corpus of documents is a combination of two values. One is term frequency (TF), which measures how frequently the word occurs in the document.

However, some of the words, such as “the” and “is”, occur frequently in all documents and we want to downscale the importance of such words. This is accomplished by multiplying TF with the inverse document frequency.

This ensures that only those words are considered important for the document that are frequent in this document but more rarely present in the rest of the corpus.

To build the TF-IDF representation of movie plots we will use the TfidfVectorizer from scikit-learn. We first fit TfidfVectorizer on train data set of movie plot descriptions and then transform the movie plots into TF-IDF numerical representation:

tf_idf = TfidfVectorizer(stop_words='english')

tf_idf_matrix = tf_idf.fit_transform(df['overview']);

 

Now that we have numerical vectors, representing each movie plot description, we can compute similarity of movies by calculating their pair-wise cosine similarities and storing them in cosine similarity matrix:

# calculating cosine similarity between movies

cosine_similarity_matrix = cosine_similarity(tf_idf_matrix, tf_idf_matrix)

With cosine similarity matrix computed, we can define the function “recommendations” that will return top recommendations for a given movie.

The function first determines the index of the input movie, retrieves the similarities of movies with selected movie, sorts them and returns the titles of movies with the highest similarity to the selected movie.

def index_from_title(df,title):

return df[df['original_title']==title].index.values[0]


# function that returns the title of the movie from its index

def title_from_index(df,index):

return df[df.index==index].original_title.values[0]


# generating recommendations for given title

def recommendations( original_title, df,cosine_similarity_matrix,number_of_recommendations):

index = index_from_title(df,original_title)

similarity_scores = list(enumerate(cosine_similarity_matrix[index]))

similarity_scores_sorted = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

recommendations_indices = [t[0] for t in similarity_scores_sorted[1:(number_of_recommendations+1)]]

return df['original_title'].iloc[recommendations_indices]

 

We can now produce our recommendation for a given film, e.g. ‘Batman’:

recommendations(‘Batman’, df, cosine_similarity_matrix, 10)

3693    Batman Beyond: Return of the Joker    

5962    The Dark Knight Rises                 

7379    Batman vs Dracula                     

5476    Batman: Under the Red Hood            

6654    Batman: Mystery of the Batwoman       

3911    Batman Begins                         

6334    Batman: The Dark Knight Returns, Part

1770     Batman & Robin                        

4725    The Dark Knight                       

709     Batman Returns   

 

Content-based recommender based on keywords, actors, screenplay, director, producer and genres features

The recommender based on overview is of a limited quality as it considers only the movie plot.

We will now explore a different recommender, which gives more focus to other metadata (keywords, actors, director, producer, genres and screenplay authors) when recommending movies.

To use additional metadata, we first need to extract it from separate files, keywords.csv and credits.csv, and merge it with the main pandas dataframe:

df_keywords = pd.read_csv('keywords.csv')

df_credits = pd.read_csv('credits.csv')

# Some ids have irregular format, so we will remove them

df_cb = df_data.copy(deep=True)[df_data.id.apply(lambda x: x.isnumeric())]

df_cb['id'] = df_cb['id'].astype(int)

df_keywords['id'] = df_keywords['id'].astype(int)

df_credits['id'] = df_credits['id'].astype(int)

# Merging keywords, credits of movies with main data set

df_movies_data = pd.merge(df_cb, df_keywords, on='id')

df_movies_data = pd.merge(df_movies_data, df_credits, on='id')

 

Again, we will keep only the movies with highest vote counts using similar code as previously.

We next create a new feature for each movie, which consists of top 4 actors in the movie. We also concatenate and lowercase the name and surname of actors. We namely want e.g. Tom of Tom Hanks to be distinct from Tom in Tom Selleck, replacing both names with tomhanks and tomselleck, respectively:

max_number_of_actors = 4

def return_actors(cast):

actors = []

count = 0

for row in ast.literal_eval(cast) :

if count<max_number_of_actors:

actors.append(row['name'].lower().replace(" ",""))

else:

break

count+=1

return ' '.join(actors)


df_movies['actors']=df_movies.apply(lambda x: return_actors(x.cast),axis=1)

We will now create similar features for director, screenplay and producer of movies. To simplify, we only use first person detected per job type.

def return_producer_screenplay_director(crew,crew_type):

persons = []

for row in ast.literal_eval(crew) :

if row['job'].lower()==crew_type:

persons.append(row['name'].lower().replace(" ",""))

return ' '.join(persons)


df_movies['director']=df_movies.apply(lambda x: return_producer_screenplay_director(x.crew,'director'),axis=1)

df_movies['screenplay']=df_movies.apply(lambda x: return_producer_screenplay_director(x.crew,'screenplay'),axis=1)

df_movies['producer']=df_movies.apply(lambda x: return_producer_screenplay_director(x.crew,'producer'),axis=1)

After generating individual metadata, we merge them in a single feature with the ability to individually weight different features. This allows us to build highly flexible recommenders, as we will see later on.

# relative importance of different features

w_genres = 2

w_keywords = 3

w_actors = 3

w_director = 1

w_producer = 1

w_screenplay = 1

# function for merging features

def concatenate_features(df_row):

genres = []

for genre in ast.literal_eval(df_row['genres']) :

genres.append(genre['name'].lower())

genres = ' '.join(genres)

keywords = []

for keyword in ast.literal_eval(df_row['keywords']) :

keywords.append(keyword['name'])

keywords = ' '.join(keywords)

return ' '.join([genres]*w_genres)+' '+' '.join([keywords]*w_keywords)+' '+' '.join([df_row['actors']]*w_actors)+' '+' '.join([df_row['director']]*w_director)+' '+' '.join([df_row['producer']]*w_producer)+' '+' '.join([df_row['screenplay']]*w_screenplay)

 

df_movies['features'] = df_movies.apply(concatenate_features,axis=1)

# pre-processing text of features

def process_text(text):

# replace multiple spaces with one

text = ' '.join(text.split())

# lowercase

text=text.lower()

return text

df_movies['features'] = df_movies.apply(lambda x: process_text(x.features),axis=1)

 

After generating the feature, we again need to vectorize it. We will not use TF-IDF, as it reduces the importance of words which occur in many documents, in our case this e.g. also includes actors, directors, screenplay writer, producers.

We will thus use CountVectorizer for this purpose.

vect = CountVectorizer(stop_words='english')

vect_matrix = vect.fit_transform(df_movies['features'])

cosine_similarity_matrix_count_based = cosine_similarity(vect_matrix, vect_matrix)

Example recommendations:

recommendations(‘Toy Story’, df_movies, cosine_similarity_matrix_count_based, 10)

4252    Toy Story 3                5823    Toy Story That Time Forgot 785     Small Soldiers             5702    Hawaiian Vacation          1358    Toy Story 2                273     Pinocchio                  1680    The Transformers: The Movie833     Child’s Play               966     Toys                       4836    Ted     

The usage of relative weights to control importance of different metadata allows us to quickly build a new recommender focused on other aspects of movies.

We can e.g. increase weight for the director to recommend movies that are highly likely directed by the same director as the input movie:

w_director = 100

df_movies['features'] = df_movies.apply(concatenate_features,axis=1)

vect = CountVectorizer(stop_words='english')

vect_matrix = vect.fit_transform(df_movies['features'])

cosine_similarity_matrix_count_based = cosine_similarity(vect_matrix, vect_matrix)

recommendations('Toy Story', df_movies, cosine_similarity_matrix_count_based, 8)

4837    Tin Toy                

4860    Knick Knack            

3182    Luxo Jr.               

1358    Toy Story 2            

1012    A Bug’s Life           

4532    Cars 2                 

5423    Mater and the Ghostlight

3268    Cars       

A quick search shows that all of the movies recommender were directed by the director of Toy Story – John Lasseter.

Conclusion

In this article, we have introduced several content-based recommender systems in python, using MovieLens data set.

Recommender systems utilize big data about our interactions with items and try to find patterns which show what items are most popular with users that are similar to us or find items that are most similar to items that we have purchased in the past.

Besides content-based method used in this article, recommenders are also often using collaborative filtering approach or a combination of both, known as hybrid methods, which try to combine both main approaches in a way which minimizes the drawbacks of any of the individual methods. Hybrid recommenders are the most common type of recommenders, found in online platforms today.