This post is part of a series of posts to analyse the digital me.

# Visualize and analyse IMDB ratings with R

In my previous post in this series I shared how I collect and store my personal ratings data from IMDB and enrich it with a few important variables about the movie.

In this post I will explore the data to learn about my movie preferences. I will be using some R packages to structure the data properly and other packages to build plots:

I’m sure my code can be optimised to use less packages or run faster in general. Feel free to reach out to me if you have suggestions on how to do this. I’m learning as I’m going.

## Load data.frame with IMDB rating data

I have loaded my data.frame with my ratings, IMDB ratings and general movie info to analyse. You can see how I collected this data in my previous post.

head(data_imdb)
##           id                       title year
## 1  tt0082138                 Carbon Copy 1981
## 2  tt0258068          The Quiet American 2002
## 3  tt0301976 The United States of Leland 2003
## 4  tt0977214                 Hero Wanted 2008
## 31 tt0498380       Letters from Iwo Jima 2006
##                                                                                                                                                                                                                     description
## 1                                               \n    When a rich white corporate executive finds out that he has an illegitimate black son, things start falling apart for him at home, at work and in his social circles.
## 2                                                                                                         \n    An older British reporter vies with a young U.S. doctor for the affections of a beautiful Vietnamese woman.
## 3                                                          \n    A young man's experience in a juvenile detention center that touches on the tumultuous changes that befall his family and the community in which he lives.
## 4  \n    After he awakens in a hospital, a man tracks down and murders the man that left him and a bank teller for dead during a robbery, only to end up having the slain thief's associates come after him in retaliation.
## 5                                                            \n    When a freak hurricane swamps Los Angeles, nature's deadliest killer rules sea, land, and air as thousands of sharks terrorize the waterlogged populace.
## 31                                                  \n    The story of the battle of Iwo Jima between the United States and Imperial Japan during World War II, as told from the perspective of the Japanese who fought it.
##    myRating                                                url  type
## 1         6 http://www.imdb.com/title/tt0082138/?ref_=rt_li_tt Movie
## 2         7 http://www.imdb.com/title/tt0258068/?ref_=rt_li_tt Movie
## 3         7 http://www.imdb.com/title/tt0301976/?ref_=rt_li_tt Movie
## 4         6 http://www.imdb.com/title/tt0977214/?ref_=rt_li_tt Movie
## 5         4 http://www.imdb.com/title/tt2724064/?ref_=rt_li_tt Movie
## 31        8 http://www.imdb.com/title/tt0498380/?ref_=rt_li_tt Movie
##    imdb_rating imdb_ratings imdb_director imdb_cast
## 1          5.6         2192          <NA>
## 2          7.1        25553          <NA>
## 3          7.1        21513          <NA>
## 4          5.7         6704          <NA>
## 5          3.3        41809          <NA>
## 31         7.9       143364          <NA>
##                              imdb_genres       date
## 1                           Comedy|Drama 2018-09-21
## 2     Drama|Mystery|Romance|Thriller|War 2018-09-21
## 3                                  Drama 2018-09-21
## 4            Action|Crime|Drama|Thriller 2018-09-21
## 31                     Drama|History|War 2018-09-23

Let’s first get some summary data to see what is in my data.frame.

#Number of movies in dataframe
nrow(data_imdb)
## [1] 14
# Average IMDB from all users on movies
round(mean(data_imdb$imdb_rating),2) ## [1] 6.35 #Average rating from me on the same movies round(mean(data_imdb$myRating),2)
## [1] 6.21

What we can immediately see is that I’m not easy to please. My average rating is a lot lower than the one from other IMDB raters.

#My personal top 10 of movies I've rated
head(arrange(data_imdb[,c("title", "myRating")],desc(myRating)),10)
##                          title myRating
## 1        Letters from Iwo Jima        8
## 2           The Quiet American        7
## 3  The United States of Leland        7
## 4           Olympus Has Fallen        7
## 5                  Wild Things        7
## 6              Lethal Weapon 4        7
## 7            NCIS: New Orleans        7
## 8                  Carbon Copy        6
## 9                  Hero Wanted        6
## 10                Last Knights        6
#IMDB users top 10 of movies I've rated
head(arrange(data_imdb[,c("title", "imdb_rating")],desc(imdb_rating)),10)
##                          title imdb_rating
## 1        Letters from Iwo Jima         7.9
## 2                        Drive         7.8
## 3           The Quiet American         7.1
## 4  The United States of Leland         7.1
## 5            NCIS: New Orleans         6.8
## 6              Lethal Weapon 4         6.6
## 7           Olympus Has Fallen         6.5
## 8                  Wild Things         6.5
## 9                 Last Knights         6.2
## 10           London Has Fallen         5.9

From the looks of the top10’s my taste isn’t totally aligned with the IMDB crowd either. A lot of my favorites are not in the IMDB list.

## Comparing my ratings to those of all IMDB users

Let’s dive into those differences a bit more. In order to do this I will create an additional column in my dataframe to show the differences in ratings per movie. I will use the data to show a top10 of biggest positive difference in ratings (I like them a lot more than IMDB users) and the other way around (movies I don’t really like that IMDB users do).

data_imdb$dif_rating <- data_imdb$myRating - data_imdb$imdb_rating #Biggest differences I like a lot more than IMDB users head(arrange(data_imdb[,c("title", "dif_rating")],desc(dif_rating)),10) ## title dif_rating ## 1 Sharknado 0.7 ## 2 Olympus Has Fallen 0.5 ## 3 Wild Things 0.5 ## 4 Carbon Copy 0.4 ## 5 Lethal Weapon 4 0.4 ## 6 Hero Wanted 0.3 ## 7 NCIS: New Orleans 0.2 ## 8 Letters from Iwo Jima 0.1 ## 9 The Quiet American -0.1 ## 10 The United States of Leland -0.1 #Biggest differences IMDB users like a lot more than I do tail(arrange(data_imdb[,c("title", "dif_rating")],desc(dif_rating)),10) ## title dif_rating ## 5 Lethal Weapon 4 0.4 ## 6 Hero Wanted 0.3 ## 7 NCIS: New Orleans 0.2 ## 8 Letters from Iwo Jima 0.1 ## 9 The Quiet American -0.1 ## 10 The United States of Leland -0.1 ## 11 Last Knights -0.2 ## 12 Skyscraper -0.9 ## 13 Drive -1.8 ## 14 London Has Fallen -1.9 Clearly I don’t get the appeal of the Thor movies, The Expandables 2 and 13, but I somehow thought Hobo with a shotgun was a masterpiece when nobody else seemed to get that movie. ## Ratings per decade of movies I want to learn more about my general preferences, not on an individual movie level, but grouped into segments. Am I a big fan of movies from a particular decade? ### Number of movies I watched from each decade I’m visualising this data with the ggplot2 package. A simple way to make graphs and plots based on dataframes. There a lots of resources about this package to learn how it works. theme_data <- theme_light(base_size = 13) + theme(plot.margin=unit(c(0.5,1,1.5,1.2),"cm")) color_bar <- "#999999" color_fill <- "#5c85d6" #spread of movie release decades (data_imdb$year %/% 10) * 10 -> data_imdb\$decade
ggplot(data = data_imdb, aes(x = decade))+
geom_bar(stat = 'count', fill=color_fill, color = color_bar) +
scale_x_discrete(limit = c(1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010)) +
scale_y_continuous(expand = c(0, 0),breaks = seq(0, 325, 25)) +
geom_text(stat='count',aes(label=..count..),vjust=-0.5) +
coord_cartesian(ylim = c(0:325)) +
ylab("Number of movies watched") +
theme_data

Most of the movies I have reviewed are from the past 20 years. I still have to rate my first forties of fifties movie on IMDB…

### My ratings of movies per decade

ggplot(data = data_imdb, aes(x = decade, y = myRating))+
geom_bar(stat = "summary", fun.y = "mean", fill=color_fill, color = color_bar) +
scale_x_discrete(limit = c(1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010)) +
scale_y_discrete(expand = c(0, 0), limits = c(0:10,10,3)) +
stat_summary(aes(label=round(..y..,2)), fun.y=mean, geom="text", vjust = -0.75) +
coord_cartesian(ylim = c(0:10)) +
ylab("Average rating") +
theme_data

### Difference in ratings between me and all IMDB users

ggplot(data = data_imdb, aes(x = decade, y = dif_rating))+
geom_bar(stat = "summary", fun.y = "mean", fill=color_fill, color = color_bar) +
scale_x_discrete(limit = c(1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010)) +
scale_y_discrete(expand = c(0, 0), limits = c(-2:2)) +
stat_summary(aes(label=round(..y..,2)), fun.y=mean, geom="text", vjust = -0.75) +
coord_cartesian(ylim = c(-2:1)) +
ylab("Difference in average rating") +
theme_data

It turns out I’m not giving very high ratings in comparison to all IMDB users, regardless of the decade the movie is from. Based on my ratings per decade and the difference between me and IMDB users, it would make sense to enjoy some more movies from the nineties and older than the nineties. I’m really tough on movies from the past 20 years in comparison to other IMDB raters (-0.37 and -0.43).

On the other hand, I’m watching a lot of movies that were released in the past 20 years. Maybe I should just pick them more carefully. Let’s find out if there is a bigger range of ratings I give to movies from the past 20 years than other years.

### Distribution of my ratings per decade

I will be using a boxplot to visualise the distribution of my ratings per decade. If you are not sure how to read a boxplot, this is a great explanation about reading boxplots for you to check out.

ggplot(data_imdb, aes(x = decade, y = myRating, group = decade)) +
geom_boxplot() +
ylab("My average rating") +
scale_x_discrete(limit = c(1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010)) +
theme_data

The boxplots can be a little oddly shaped, because I can only rate with whole values (only 7 or 8, not 7.5). A full quartile can consist of one value (ie 6 or 7). The median can also be the same value as the upper quartile or lower quartile.

In this boxplot there are not a lot of outliers for the past 20 years. This means I’m consistently watching movies that I rate 6 or 7 stars. It is a lot more positive for the eighties and nineties. I’m also rating movies from those times with lower stars, but that isn’t as common. I guess I should be watching more movies older than 20 years.

In my next post I will analyse this dataframe further. I still have to learn more about my favorite actors, directors and genres. I hope to get some good recommendations out of this for movie night.