Visualize and analyse IMDB ratings with R (part 2)
This post is part of a series of posts to analyse the digital me.
In my previous post about analysing IMDB ratings in this series I explored some of the data I collected about my movie preferences, also compared to the ratings other IMDB raters.
In this post I will dig a little deeper to learn more about my own personal movie preferences.
Does popularity of a movie impact my rating?
Some movies are very mainstream, other are more niche or an “acquired” taste. The popularity of a movie can be determined in IMDB by the number of reviews a movie has received. Let’s first see how the number of IMDB reviews relates to the average IMDB rating. We can do that by building a sctaterplot.
theme_data <- theme_light(base_size = 13) + theme(plot.margin=unit(c(0.5,1,1.5,1.2),"cm"))
color_bar <- "#999999"
color_fill <- "#5c85d6"
#spread of movie release decades
ggplot(data_imdb, aes(x = imdb_ratings, y = imdb_rating)) +
geom_point() +
#scale_x_discrete(limit = c(1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010)) +
scale_y_discrete(limit = c(3:10)) +
ylab("IMDB average rating") +
xlab("Number of IMDB ratings") +
theme_data
In this scatterplot you can see that most of the movies are clustered around a low number of reviews and are receiving average ratings between 6 and 7. It’s interesting to see that the movies with most IMDB ratings are all highly rated. Let’s check out which movies these are.
head(arrange(data_imdb[,c("title", "imdb_ratings", "imdb_rating")],desc(imdb_ratings)),10)
## title imdb_ratings imdb_rating
## 1 The Shawshank Redemption 2000322 9.3
## 2 The Dark Knight 1969523 9.0
## 3 Inception 1750193 8.8
## 4 Fight Club 1600985 8.8
## 5 Pulp Fiction 1565763 8.9
## 6 Forrest Gump 1522234 8.8
## 7 The Lord of the Rings: The Fellowship of the Ring 1441273 8.8
## 8 The Matrix 1435239 8.7
## 9 The Lord of the Rings: The Return of the King 1424377 8.9
## 10 The Godfather 1370383 9.2
So Shawshank Redemption and Dark Knight are the most reviewed movies I have rated myself. And they are both highly rated.
Understanding the probability of number of ratings to impact my rating
The probability can be calculated with a p-value.
cor(as.numeric(data_imdb$imdb_ratings), as.numeric(data_imdb$myRating))
## [1] 0.4478298
As you can see, the p-value is 0.4478298. This means that the number of ratings on IMDB will have a 45% chance of having no effect on my own rating. So there is a pretty high probability that the number of ratings will have an effect on my rating (52%).
My favorite actors
I always think I know which actors I like and put their movies on the top of my watchlist. Let’s check out if I actually rate the movies my favorite actors with the highest scores too.
For each movies I rate, there are a maximum of 15 actors added to the dataset. I have added them all to one column, pipe separated. I will use the data.table and stringr package to split the data in the cast column and learn more about the actors in the movies I reviewed. I will create a new dataframe with a count of the the actors (unique_actor_count) and find out how many actors are in this dataframe (unique_actors). After that, I will create a new dataframe with my top 10 most watched actors and visualise it in a barchart for easy interpretation.
library('data.table')
library('stringr')
as.character(data_imdb$imdb_cast) -> data_imdb$imdb_cast
data.frame(table(unlist(strsplit(data_imdb$imdb_cast, "[|]")))) -> unique_actor_count
nrow(unique_actor_count) -> unique_actors
head(arrange(unique_actor_count,desc(Freq)),10) -> unique_actor_count_top10
ggplot(data = unique_actor_count_top10, aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity", fill=color_fill, color = color_bar)+
coord_flip() +
geom_text(aes(label = Freq), position=position_dodge(width=1), hjust = -0.5) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of movies watched featuring actor") +
xlab("") +
theme_data
It turns out more than 8000 actors have made it to my dataframe and the one I have viewed more than others was Samuel L. Jackson. I’m surprised to see only male actors made it to my top 10. Not sure how that happened…
Average rating of my most popular actors
Next, I want to know if my most popular actors are in highly rated movies. I’ll stick to my top 100 most popular actors and visualise these in a scatterplot. This means I will select the top100 actors, merge them with the dataframe that has all the movies rating data in there and create a scatterplot where I label the points with the actor names using the ggrepel package.
library('ggrepel')
head(arrange(unique_actor_count,desc(Freq)),100) -> unique_actor_count_top100
unique_actor_count_top100$Var1 -> actor_group
df_mean_actor <- data.frame()
for(v in actor_group){
data_imdb %>%
filter(str_detect(imdb_cast, v)) %>%
select(myRating) -> df9
round(mean(df9$myRating),2) ->mean_actor
df10 <- data.table("actor" = v, "mean" = mean_actor)
df_mean_actor<-rbind(df_mean_actor,df10)
}
actor_overview = merge(unique_actor_count_top100, df_mean_actor, by.x=c("Var1"), by.y=c("actor"))
ggplot(data = actor_overview, aes(x = mean, y = Freq)) +
geom_point() +
geom_text_repel(aes(label=ifelse(mean>7.5,as.character(Var1),'')), color = "forestgreen", box.padding = unit(0.8, "lines")) +
geom_text_repel(aes(label=ifelse(mean<5.8,as.character(Var1),'')), color = "red", box.padding = unit(2.2, "lines")) +
ylab("Number of movies watched featuring actor") +
xlab("Average rating") +
theme_data
This visualisation gives me some clear insights:
- Nicolas Cage and Sylvester Stallone movies should be blacklisted for now. I watched several of them, but I rate them poorly
- I give Al Pacino movies high ratings and I have seen a lot of his movies. Safe to watch his other movies too.
- Michael Caine is the hidden gem. I have hardly watched his movies, but I loved them. Time to look for some more of his movies!
- I haven’t watched many movies with Jason Biggs or Ray Stevenson and I shouldn’t start any time soon :)
My favorite movie genres
Just like my favorite actors, I can also learn more about favorite genres. The data is setup in the same way, so for now I will go straight to the visualisations.
as.character(data_imdb$imdb_genres) -> data_imdb$imdb_genres
data.frame(table(unlist(strsplit(data_imdb$imdb_genres, "[|]")))) -> unique_genres_count
c("Addcontentadvisoryforparents", "MPAA", "Seeallcertifications", "Viewcontentadvisory") -> genre_dirt
filter(unique_genres_count, !grepl(paste(genre_dirt, collapse="|"), Var1)) -> unique_genres_count
nrow(unique_genres_count)
## [1] 22
Just like actors, it is possible for a movie to have multiple genres. I have made them unique and can now see that I only have 22 genres in the dataset.
Let’s see what the average rating is per genre and the number of movies I have watched by genre.
head(arrange(unique_genres_count,desc(Freq)),10) -> unique_genres_count_top10
unique_genres_count$Var1 -> genre_group
df_mean_genre <- data.frame()
for(v in genre_group){
data_imdb %>%
filter(str_detect(imdb_genres, v)) %>%
select(myRating) -> df7
round(mean(df7$myRating),2) ->mean_genre
df8 <- data.table("genre" = v, "mean" = mean_genre)
df_mean_genre<-rbind(df_mean_genre,df8)
}
genre_overview = merge(unique_genres_count, df_mean_genre, by.x=c("Var1"), by.y=c("genre"))
ggplot(data = genre_overview, aes(x = Var1, y = mean, fill = Freq)) +
geom_bar(stat = "identity")+
geom_text(aes(label = mean), size = 3, position=position_dodge(width=1), vjust = -1) +
scale_y_continuous(expand = c(0, 0),breaks = seq(0, 10, 1), limits = c(0,8)) +
ylab("Average rating") +
xlab("Genre") +
theme_data +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
I really enjoy a great animation movie (brings out the child in me), but I hardly watch them. Horror is not my preferred genre and that shows in the number of horror movies I have seen and how I rate them.