Working with the IMDB Dataset
This blog post will be showing you how to use IMDB’s publicly available dataset to download and query what movies you are looking for using Python.
IMDB’s public dataset can be accessed by the below link:
Each gz file has different tsv (tab separated value) files so you can download each one and look at the attributes and rows.
Documentation of the dataset can be found here:
For this blog post, I wanted to select movies from the IMDB dataset that were made from January 1st, 2000 to December 31st, 2021 inclusive.
Data Cleaning and Extracting
The steps to clean the data are as follows:
- Extract movies that are only classified as either ‘short’, ‘movie’ or, ‘tvMovie’
- Filter results from 1. to get movies that started in 2000–2021.
- Remove adult films.
- Replace ‘\\N’ string with NA values as these are considered empty and used as a placeholder.
- Drop some columns such as ‘isAdult’ and ‘endYear’.
Below is the python code to do these steps. (I had to first download the IMDB dataset and unzip it which will have some TSV files to work on)
# read in data downloaded from https://datasets.imdbws.com/
imdb_movies = pd.read_csv("title.basics.tsv", sep = '\t', low_memory=False)imdb_titles = imdb_movies[(imdb_movies['titleType'] == 'movie') | \
(imdb_movies['titleType'] == 'tvMovie') | \
(imdb_movies['titleType'] == 'short')]imdb_titles = imdb_titles[(imdb_titles['startYear'] >= '2000') & \
(imdb_titles['startYear'] <= '2021')]imdb_titles = imdb_titles[imdb_titles['isAdult'] == '0']imdb_titles = imdb_titles.replace(to_replace = '\\N', value = 'NA')imdb_titles = imdb_titles.drop(['isAdult', 'endYear'], axis = 1)
The dataset ‘imdb_titles’ below:
So we have over 950k movies that were released from 2000–2021 according to IMDB. We can see how long these movies were, what genres they fell under, title and year released.
Getting More Data: Title Ratings
Next, we’ll examine the title ratings of these movies by reading in the ‘title.ratings.tsv’ file.
title_ratings = pd.read_csv("title.ratings.tsv", sep = '\t', low_memory=False)
Here is a good time to introduce the pandasql package in python. This package lets you write SQL based queries using dataframes as if they were relational tables. Here I’ll be using it to do a left join using the imdb_titles and title_ratings datasets.
Most people go for inner joins when joining tables but doing so in this case will give you less rows. That is because not every movie exists in the ‘title_ratings’ table and for this purpose, I wanted to do a left join which returns the rows in the first table (imdb_titles) and the matching rows in the right table (title_ratings) and get the average rating and number of votes.
from pandasql imoprt sqldf
pysqldf = lambda q: sqldf(q, globals()) # using SQL in python# query to join the title table with the ratings table which has
# average IMDB rating and number of votes on the movieq = """SELECT t.*, r.averageRating, r.numVotes
FROM imdb_titles t
LEFT JOIN title_ratings r
ON t.tconst = r.tconst
"""# Run the SQL query and store the result as a dataframe
movie_and_ratings = pysqldf(q)# Some titles in their AverageRating and numVotes column have NaN values
# let's replace that with zeros
movie_and_ratings = movie_and_ratings.replace(np.nan, 0)
We now have a dataset of all IMDB movies from 2000–2021 that contains title, year released, runtime, genre, average IMDB rating and how many votes the movie got.
Even though IMDB offers other tsv files, the other ones I felt didn’t have attributes I was looking for like budget, main cast, summary of movie, rating, and number of awards.
My next blog post will try to use python’s IMDB data to get more of these attributes. This initial data cleaning helped to get a head start and get the movie ID’s which will come in handy in my next blog post.
Since it’s quite a large dataset to work on, looks like i’ll need more than 1 computer to make this work…
Code and dataset can be found here: