Data Analysis and Visualization: A quick look at movie ratings

Isaac Grove
3 min readFeb 7, 2020

--

Source data can be found here.

The Premise

The IMDb keeps a large dataset of film and TV show ratings. The purpose of this project is to perform some basic exploratory analysis on this data and see if we come up with any fun results.

Data Cleaning

The dataset contains ~6 million titles and ~37 million people involved in making them, so our first task is to slim this down to something more manageable. Let’s go with “movies only” and delete anything with missing information and exclude adult films. We’re left with roughly 500,000 movies.

First insights

If we sort the movies by decade and look at everything from 1940 to 2019, there were many more movies produced in recent decades than past ones— roughly as many in the last 20 years as in the preceding 60.

Might this result in lower overall ratings? If anyone with an iPhone can make a “movie” these days, perhaps overall quality has dropped. Let’s look at the data:

Ratings are not suffering with the recent movie boom.

It seems the “movie boom” hypothesis is not true: ratings have actually improved in recent years, while (other forms of spoilage notwithstanding) it seems that Gen X grew up with the worst movies.

My new career

Given that the current movie climate seems ripe for high ratings, I decided to become a full-time movie producer.

*crickets*

And since Statistics is the quickest way to win the hearts of the masses, I wondered how I might engineer a high IMDb score. My first thought was that I should pick the easiest genre.

Documentaries it is! And definitely not horror movies. You might also note that Family and Animation films aren’t what they used to be, losing an average of 0.5 and 0.6 points, respectively, since the 1970's. Tsk tsk, society.

My second (and last) consideration was to see whether hiring a prolific director would increase my chances of doing well in the ratings. I compiled a list of the 20 most prolific directors in the dataset, assuming that practice would make perfect and that their performance would exceed everyone else’s. Disappointingly…

We conclude that just because one directs a ton of movies (the Top 20 averaged over 100 films each), that does not necessarily improve their quality. I’ll need to rethink my hiring strategy for my upcoming documentary.

CUT!

I hope you’ve enjoyed this quick exercise in data analysis and visualization. I certainly did. It’s empowering to know that we can take a large (3GB) set of data and extract useful, easy-to-digest insights from it. This project also happens to be my first foray into Medium, or blogging of any kind. Hello World! 😄

--

--