Movie Recommendation System - Getting Started

Recommendation Systems are a type of system that aims at improving the quality of search results and provides/suggests things that are more relevant to the search history of the user. They help to understand what might a user prefer and in this tutorial today, we will build an application that will suggest which movie to watch to the user.

Let’s get started!

Also Read: Theoretical Introduction to Recommendation Systems in Python

In this tutorial, we will be using TMDB 5000 Movie Dataset which can be found here. We will load the two datasets mentioned on the website using the following code. We will also join the two datasets on the basis of the ‘id’ column of the two datasets.

import pandas as pd 
import numpy as np 
df1=pd.read_csv('tmdb_5000_credits.csv')
df2=pd.read_csv('tmdb_5000_movies.csv')

df1.columns = ['id','tittle','cast','crew']
df2= df2.merge(df1,on='id')

Next, we will be deciding on a metric to judge which movie is better than the others. One way is that we could use the average ratings of the movie given in the dataset directly. But it won’t be fair because of the inconsistency in the number of voters for a particular movie.

Hence, we will be using the IMDB's weighted rating (wr) which is mathematically described as below –

Movie Recommendation System Score Formula

In the above formula, we have,

v – Number of votes
m – Minimum votes required to be listed
R – Average rating of the movie
C – Mean vote

Let’s compute the value for the qualified movies using the code below by computing the mean average votes and then computing the minimum votes required for the movie by taking only the movies with 90% more voters than the other movies into consideration.

C= df2['vote_average'].mean()
print("Mean Average Voting : ",C)

m= df2['vote_count'].quantile(0.9)
print("\nTaking the movies which have 90% more voters than the other movies")
print("Minimum votes required : ",m)

Now, let us filter out the most popular and recommended movies using the code snippet below.

q_movies = df2.copy().loc[df2['vote_count'] >= m]

But we still haven’t computed the metric for each movie that qualified. We will define a function, weighted_rating where we define a new feature score that will help us to calculate the value for all the qualified movies using the code below.

def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

Finally, let’s sort the whole dataframe on the basis of the score column and look at the most recommended movies out of all the other movies.

q_movies = q_movies.sort_values('score', ascending=False)

Let’s try to visualize the sorted dataset using the code below and know the most popular movies out of the whole dataset.

pop= df2.sort_values('popularity', ascending=False)

import matplotlib.pyplot as plt
plt.figure(figsize=(12,4),facecolor="w")

plt.barh(pop['title'].head(10),pop['popularity'].head(10), 
         align='center',color='pink')
plt.gca().invert_yaxis()

plt.xlabel("Popularity Metric")
plt.title("Name of the most Popular Movies")
plt.show()

Look how nice the plot looks and we can see that out of the top 10 movies, Minions is the most popular and recommended movie.