# Visualize and analyse IMDB ratings with R (part 2)

This post is part of a series of posts to analyse the digital me.

In my previous post about analysing IMDB ratings in this series I explored some of the data I collected about my movie preferences, also compared to the ratings other IMDB raters.

In this post I will dig a little deeper to learn more about my own personal movie preferences.

## Does popularity of a movie impact my rating?

Some movies are very mainstream, other are more niche or an “acquired” taste. The popularity of a movie can be determined in IMDB by the number of reviews a movie has received. Let’s first see how the number of IMDB reviews relates to the average IMDB rating. We can do that by building a sctaterplot.

theme_data <- theme_light(base_size = 13) + theme(plot.margin=unit(c(0.5,1,1.5,1.2),"cm"))
color_bar <- "#999999"
color_fill <- "#5c85d6"
ggplot(data_imdb, aes(x = imdb_ratings, y = imdb_rating)) +
geom_point() +
#scale_x_discrete(limit = c(1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010)) +
scale_y_discrete(limit = c(3:10)) +
ylab("IMDB average rating") +
xlab("Number of IMDB ratings") +
theme_data

In this scatterplot you can see that most of the movies are clustered around a low number of reviews and are receiving average ratings between 6 and 7. It’s interesting to see that the movies with most IMDB ratings are all highly rated. Let’s check out which movies these are.

head(arrange(data_imdb[,c("title", "imdb_ratings", "imdb_rating")],desc(imdb_ratings)),10)
##                                                title imdb_ratings
## 1                           The Shawshank Redemption      2000322
## 2                                    The Dark Knight      1969523
## 3                                          Inception      1750193
## 4                                         Fight Club      1600985
## 5                                       Pulp Fiction      1565763
## 6                                       Forrest Gump      1522234
## 7  The Lord of the Rings: The Fellowship of the Ring      1441273
## 8                                         The Matrix      1435239
## 9      The Lord of the Rings: The Return of the King      1424377
## 10                                     The Godfather      1370383
##    imdb_rating
## 1          9.3
## 2          9.0
## 3          8.8
## 4          8.8
## 5          8.9
## 6          8.8
## 7          8.8
## 8          8.7
## 9          8.9
## 10         9.2

So Shawshank Redemption and Dark Knight are the most reviewed movies I have rated myself. And they are both highly rated.

### Understanding the probability of number of ratings to impact my rating

The probability can be calculated with a p-value.

cor(as.numeric(data_imdb$imdb_ratings), as.numeric(data_imdb$myRating))
## [1] 0.4478298

As you can see, the p-value is 0.4478298. This means that the number of ratings on IMDB will have a 45% chance of having no effect on my own rating. So there is a pretty high probability that the number of ratings will have an effect on my rating (52%).

## My favorite actors

I always think I know which actors I like and put their movies on the top of my watchlist. Let’s check out if I actually rate the movies my favorite actors with the highest scores too.

For each movies I rate, there are a maximum of 15 actors added to the dataset. I have added them all to one column, pipe separated. I will use the data.table and stringr package to split the data in the cast column and learn more about the actors in the movies I reviewed. I will create a new dataframe with a count of the the actors (unique_actor_count) and find out how many actors are in this dataframe (unique_actors). After that, I will create a new dataframe with my top 10 most watched actors and visualise it in a barchart for easy interpretation.

library('data.table')
library('stringr')
data.frame(table(unlist(strsplit(data_imdb$imdb_cast, "[|]")))) -> unique_actor_count nrow(unique_actor_count) -> unique_actors head(arrange(unique_actor_count,desc(Freq)),10) -> unique_actor_count_top10 ggplot(data = unique_actor_count_top10, aes(x = Var1, y = Freq)) + geom_bar(stat = "identity", fill=color_fill, color = color_bar)+ coord_flip() + geom_text(aes(label = Freq), position=position_dodge(width=1), hjust = -0.5) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ylab("Number of movies watched featuring actor") + xlab("") + theme_data It turns out more than 8000 actors have made it to my dataframe and the one I have viewed more than others was Samuel L. Jackson. I’m surprised to see only male actors made it to my top 10. Not sure how that happened… ## My favorite movie genres Just like my favorite actors, I can also learn more about favorite genres. The data is setup in the same way, so for now I will go straight to the visualisations. data.frame(table(unlist(strsplit(data_imdb$imdb_genres, "[|]")))) -> unique_genres_count
filter(unique_genres_count, !grepl(paste(genre_dirt, collapse="|"), Var1)) -> unique_genres_count
nrow(unique_genres_count)
## [1] 22

Just like actors, it is possible for a movie to have multiple genres. I have made them unique and can now see that I only have 22 genres in the dataset.

Let’s see what the average rating is per genre and the number of movies I have watched by genre.

head(arrange(unique_genres_count,desc(Freq)),10) -> unique_genres_count_top10

unique_genres_count$Var1 -> genre_group df_mean_genre <- data.frame() for(v in genre_group){ data_imdb %>% filter(str_detect(imdb_genres, v)) %>% select(myRating) -> df7 round(mean(df7$myRating),2) ->mean_genre
df8 <- data.table("genre" = v, "mean" = mean_genre)
df_mean_genre<-rbind(df_mean_genre,df8)
}

genre_overview = merge(unique_genres_count, df_mean_genre, by.x=c("Var1"), by.y=c("genre"))

ggplot(data = genre_overview, aes(x = Var1, y = mean, fill = Freq)) +
geom_bar(stat = "identity")+
geom_text(aes(label = mean), size = 3, position=position_dodge(width=1), vjust = -1) +
scale_y_continuous(expand = c(0, 0),breaks = seq(0, 10, 1)) +
ylab("Average rating") +
xlab("Genre") +
theme_data +
coord_cartesian(ylim = c(0:8)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

I really enjoy a great animation movie (brings out the child in me), but I hardly watch them. Horror is not my preferred genre and that shows in the number of horror movies I have seen and how I rate them.