This post is part of a series of posts to analyse the digital me.

Getting my IMDB ratings with R and Rvest

I’m a big fan of IMDB and have been for many years. It is a fantastic website with a lot of information about movies, documentaries and tv-series. Besides fact-checking (“who was that guy playing in that one movie?”) it is also a place for me to collect all the ratings I have done on movies in the past few years. The star rating system is pretty effective to see if I already watched a certain movie and what I thought of it.

Getting my IMDB ratings into a dataframe with R

My IMDB ratings are protected by my login, which makes it a bit more complated to get to my reviewed pages. I made them public for all to see by changing a setting. You can see my ratings on this user list page. Unfortunately there isn’t a way to pull in my ratings and movie details through an API. This means I have to scrape the data from the IMDB website and structure it myself. I’m using R packages RVest and Magrittr for this. To clean and structure the data I use data.table and dplyr.

library("rvest")
library("magrittr")
library('data.table')
library('dplyr')

I first load the packages I need before I start scraping

userId <- "ur25468061"
startQuery <- c("1")

I break down the url of the public list I want to scrape (http://www.imdb.com/user/ ur25468061 /ratings?start= 1 &view=detail&sort=ratings_date:desc&defaults=1&my_ratings=restrict). I can easily change the user ID (ur25468061) and number of page (1) to get different data from a different list or page.

df_imdb<-data.frame()
for(v in startQuery){
  paste0("http://www.imdb.com/user/",userId,"/ratings?start=", v, "&view=detail&sort=ratings_date:desc") -> download_url
  export_html <- read_html(download_url)
  html_nodes(export_html, ".year_type") %>% html_text() -> info100Year
  export_html %>% html_nodes( ".info") %>% html_nodes("b") %>% html_nodes("a") %>% html_text() -> info100Title
  export_html %>% html_nodes( ".info") %>% html_nodes("b") %>% html_nodes("a")  %>% html_attr('href')  -> info100URL
  as.numeric(export_html %>% html_nodes( ".rating-your") %>% html_nodes(".value") %>% html_text()) -> info100Rating
  export_html %>% html_nodes( ".item_description") %>% html_text() -> info100Desc
  export_html %>% html_nodes( ".list") %>% html_nodes( ".image") %>% html_nodes( ".hover-over-image") %>% html_nodes("img") %>% html_attr('src')  -> info100Images
  gsub("(|)","",info100Year, fixed=TRUE) -> info100Year
  paste0("http://www.imdb.com", info100URL) -> info100URL
  sapply(strsplit(info100URL, '/'), function(x) x[5]) -> info100id
  
  df <- data.frame("title" = info100Title, "year" = info100Year, "description" = info100Desc, "myRating" = info100Rating, "url" = info100URL, "id" = info100id, stringsAsFactors = FALSE)
  df_imdb<-rbind(df_imdb,df)
}
sapply(strsplit(df_imdb$year, ' '), function(x) x[2]) -> df_imdb$type
df_imdb$type[is.na(df_imdb$type)] <- "Movie"
as.numeric(sapply(strsplit(df_imdb$year, ' '), function(x) x[1])) -> df_imdb$year

I now run a simple for loop through my pages (in this example just page 1) for my userlist. I collect several variables about the movies I have watched:

  • Year of release
  • Title
  • URL of the movie on IMDB
  • My rating on IMDB
  • Short description of movie
  • A URL of an image of the movie
  • IMDB movie ID
  • Type of IMDB item (I also reviewed some tv-series)

As you may see webscraping requires a lot of cleaning data to make it look dataframe perfect and ready to analyse. What I end up with is a pretty cool dataframe of most recently rated movies with a bunch of variables for me to analyse.

Scraping additional movie data

The dataframe is already pretty full of data to analyse. However, I would like some more details about the movies to analyse:

  • Total number of IMDB ratings
  • Average IMDB rating of all users
  • Top cast members
  • Director of the movie
  • Movie genres

Based on the IMDB movie ID’s I collected in the previous dataframe I can get all this data from movie pages on IMDB. All movie pages include the IMDB movie ID in the url: http://www.imdb.com/title/ tt3501112 /

df_imdb$id -> imdb_movie_id

df_imdb_movie <- data.frame()
for(i in imdb_movie_id){
    read_html(paste0("http://www.imdb.com/title/",i,"/")) -> lego
    
  # verschil tussen http://www.imdb.com/title/tt2782844/ en http://www.imdb.com/title/tt2072233/
    as.numeric(gsub(",", "", lego %>% html_nodes(".imdbRating .small") %>% html_text)) -> imdb_num_ratings
    as.numeric(lego %>% html_nodes(".imdbRating strong") %>% html_text) -> imdb_rating
    #lego %>% html_nodes(".title_wrapper .subtext") %>% html_text() -> imdb_time_html
    #gsub(" ", "", sapply(strsplit(imdb_time_html, '\n'), function(x) x[4]), fixed = TRUE) -> imdb_time_string
    #strsplit(imdb_time_string, "h|min") -> imdb_time_hours
    #as.numeric(do.call( rbind, imdb_time_hours)[,1]) * 60 + as.numeric(do.call( rbind, imdb_time_hours)[,2]) -> imdb_time_minutes
    
    #create row of cast, normally 15
    list(t(unlist(lego %>%  html_nodes("#titleCast .itemprop span") %>%  html_text()))) -> imdb_cast
    paste(unlist(imdb_cast), collapse='|') -> imdb_cast
    lego %>% html_nodes(".credit_summary_item .itemprop") %>% html_text() -> imdb_director
    imdb_director[1] -> imdb_director
    list(gsub("Genres:| ", "", lego %>% html_nodes(xpath = "//*[@id=\"titleStoryLine\"]/div[4]") %>% html_nodes("a") %>% html_text())) -> imdb_genres
    paste(unlist(imdb_genres), collapse='|') -> imdb_genres
    df1 <- data.table("imdb_rating" = imdb_rating, "imdb_ratings" = imdb_num_ratings ,"imdb_director" = imdb_director, "id" = i, "imdb_cast" = imdb_cast, "imdb_genres" = imdb_genres ,stringsAsFactors = FALSE)
    df_imdb_movie<-rbind(df_imdb_movie,df1)
}

I created another loop to collect data for all the movies in my previous table. Now I merge the two dataframes to turn them into one big, beautiful dataframe.

df3 = merge(df_imdb, df_imdb_movie, by.x=c("id"), by.y=c("id"))

Next steps to start analysing my IMDB data

I run these queries daily for the most recently added movies and update my dataframe with the new additions. Afterwards, I save my dataframe to a SQL table (no cloud for me yet, I’m too old-school). This way, I will continue to update my database of reviewed movies.

You can send your data to a SQL table with the R package RMySQL. It takes a few lines of R code to read or write your SQL table

# Load the package
library('RMySQL')

#Initiate database connection
con <- dbConnect(MySQL(),
                 user = '-----',
                 password = '-----',
                 host = '-----',
                 dbname='-----')
#read table
dbReadTable(conn = con, name = 'imdb')

#write table
dbWriteTable(conn = con, name = 'imdb', value = DATAFRAMENAME)

In my next post I will explore the dataset and learn some more about my movie likes and dislikes!