A Data-Centric Approach to Watching Movies

Around a week back, I read a question on Quora that was along the lines of “How can I read more books?”. One answer that I found particularly interesting was to make a list of all the books you’ve read. That way, not only do you maintain an organized list of things you’ve read but you feel like you must have the list growing. You, in some way, start on a reading spree even though it might be just to increase the list length.

After a few days that it took for me to absorb that suggestion, I started a Google Spreadsheet to keep track of all the books I’ve read. But I decided to not stop there and so made another sheet of all the movies I’ve seen too. As the list of movies I’ve seen outgrew the list of books I’ve read by a wide margin, something dawned on me- why waste this incredible data by limiting itself to a silly little spreadsheet?

And there it began. I knew right away what I had to do.


A day later, I started working on a rudimentary web application to maintain a database of movies I’ve seen. I knew I had to do some data scrapping and so I chose Python. I hacked out a dead-simple application with Pyramid that just collects the links to the IMDB pages of the movies.

I stored those IMDB links in a simple two-column database. Next stop, scrapping.

I chose mechanize and BeautifulSoup to scrap the movie details from IMDB. The idea is simple- follow the link, collect the movie’s name, year, content rating, duration, release date, rating, director, actors, genre, country, number of Academy Award nominations and wins.

So I wrote a script to go through my database, scrap of all those info from IMDB and put it back in another table. Click here to head over to GitHub if you’re interested to see my scraper.

Next, I needed to enter movies I’ve seen. Because I’d already started that silly, little spreadsheet, I just moved those names over. While the list, by no means, is exhaustive, I believe it still gives a pretty good idea and I can always keep on adding movies as the names come by.

Finally comes the interesting part- the results. As the whole point of all this was to see what interesting results can be derived from the movies I tend to see, I started by listing the movies. A list of movies I’ve seen along with a brief detail of that movie can be found here (if you’re thrown an error, just hit Refresh).

I added a snippet of code to sum up the duration of all the movies in the list. As of writing this, that adds up to 17 days, 17 hours and 55 minutes. See, I told you it’d be fun- it’s already begun!

Then, I was interested to see things like, on average, how long of a movie do I see, what’s the average rating of a movie I see, etc. So I added another page to calculate those results. Here’s a brief summary:

  • Mean Duration: 140.41 minutes (Standard Deviation: 26.8613)
  • Mean Rating: 7.11 (Standard Deviation: 1.309)
  • Mean Release Year: 2004 (Standard Deviation: 10.921)

Also, things like whose movies do I watch the most are particularly interesting to me. So I compiled data like the following too:

  • Most Watched Actor: Shah Rukh Khan (24) followed by Leonardo DiCaprio (17)
  • Most Watched Director: Priyadarshan (11) followed by Martin Scorsese (8)
  • Most Watched Genre: Drama (125) followed by Comedy (49)
  • Most Watched Release Year: 2008 (20) followed by 2010 (19)
  • Oldest: Casablanca (1942) followed by Bicycle Thieves (1948)
  • Longest: Mohabbatein (216) followed by Kabhi Khushi Kabhie Gham… (210)
  • Shortest: Gravity (91) followed by Easy A (92)

Go ahead to this page to see a full-breakdown of all such data.

Because no statistical analysis is complete without graphs, I went over to Google Charts to generate some pretty little charts too. Click here to go to the charts page. Here are some interesting ones:

Even though I’d love to believe that the graph above shows Normal distribution, I’m going to stick with the assumption that it’s “almost Normal” and will appear more and more so as the total number of days I’ve spend watching movies approaches months.

Nothing too surprising here- the movies I watch tend to skew towards recent years. I think this will slightly even out now that I’m watching more and more of old movies.

 

 

 

 

 

 

 

 

 

 

Nothing fancy here. Let’s move on to better things.

As looking for correlation in scatter plots tends to produce some really fascinating results, let’s see if we notice anything significant.

This graph does make sense. All the old movies are rated pretty good which isn’t quite surprising because I’d watch an old movie only if I’ve read somewhere that it’s good. I wouldn’t go as far as saying old movies tend to have higher IMDB ratings in general with just this limited data sample.

While I was desperately hoping that shorter movies would result in higher ratings, the result is inconsequential.


Too bad, there doesn’t seem to be much correlation between ratings and Academy awards. What a sham. Nothing I can do about that.

That’s pretty much it. I was thinking of doing something similar with the books I’ve read but I’ve been hesitating because a) the list is too short, b) although there’s goodreads, I’m a little skeptical that it’d generate results as interesting as what I got from movies.

I’ll keep updating the list and so all the linked stats plus graphs will update themselves. So if you’re interested in the movies I watch, please feel free to keep checking on me.

(If you’re thrown an error, just hit Refresh.)

Leave a Reply

Your email address will not be published. Required fields are marked *