4 август 2012 г.

6 IMDb Top 250 Deconstructed


The chart with the best 250 movies of all time at IMDb is periodically penetrated by overhyped crap, which by itself is a basis for endless discussions about the taste of the average moviegoer and the lack of objectivity when art is concerned. Now we live in a moment like this. 4 years have passed since The Dark Knight was at the top of the chart (and then standing for weeks between Shawshank Redemption and The Godfather before it slowly fell at its current 8th place). Two years ago the same happened with Inception. And now The Dark Knight Rises is at the prestigious 13th place, which is just wrong on so many levels.

Going through the full chart I asked myself - is it really that impossible to extract some meaningful and representative information out of this colossal statistics (hundreds of thousands of votes per movie)? Representative about something other than the mad nerdiness that suffocates the sensible judgment? It's easy to say - to hell with it, anyway there is no objectivity in cinema, so it's not important that The Avengers is rated better than anything Coen Brothers ever directed. If we were listening to the crowd we would be eating in McDonalds every day... The problem is that because of crowd's taste you have McDonalds at every street corner. IMDb Top 250 shows what is popular, and what is popular is what is being made the most. So, it is in everyone's interest to find as many personal favorites as possible in this chart.

Curious to find what exactly is the role of the fanboyism in IMDb Top 250, I've decided to have a deeper look at the statistics, which is available for the users.
First, just a few words about the methodology. It's very easy to extract automatically the html page for every single movie featured in the Top 250, including all the pages with different rating breakdowns (by gender, age group, location, etc.). By using few simple Linux commands (grep, awk and sed), I've extracted all the needed information from the html files into one big text file, which I've imported into MS Excel for easier manipulation of the data and for convenient graphical representation of the results. (For those who are - w00t, few Linux commands and then the MS bullshit? - yeah, that's right, the MS bullshit has better visual output.)

Before I start with the deconstruction of the Top 250, let me show you some interesting statistics that one can obtain from the movie pages. The first 20 movies in the chart as of 1st of August 2012 are shown below (the full chart is available if you click on the table):

So what does the statistics show?
23% of the movies in top 250 are International, while 77% are from USA.
This of course is not a surprise - Hollywood is the biggest movie-producer in the world (I don't consider Bollywood producing movies) and the geographical distribution of the voters doesn't matter. Actually, on average about 66% of the votes are coming from non-USA users (varying from 95% for the French Intouchables (2011) to 40% for The Princess Bride). As seen from the graphics below, this fraction stabilizes between 60% and 70% for the movies close to the very top of the chart:
(One interesting observation related to this: only 10% of the votes for the Iranian A Separation are by USA users. Surprise, surprise...)

The distribution by MPAA ratings favorizes R-rated movies with almost 40%, more than all the other categories put together. Of course, 1/3 of the movies are without a rating just because the current MPAA classification is in use since 1990 (and IMDb applies it to older movies in a funny way).

The distribution by years:
shows that if the trend continues, we'll have 35 movies from 2011-2020 in Top 250, which is a worse result than for the previous two decades. Alas, the mathematics proves that the movies start to suck more and more.

And now, something more interesting: the directors with the largest number of movies featured in the Top 250:
The winner is Hitchcock with 10 movies, followed by Kubrick with 8. Fair enough. Actually, fair enough until you check the percentage of movies made by a certain director, which are in top 250 (shown with green on the figure below; the red bars indicate the total amount of movies directed, excluding shorts, documentaries, TV-productions and collaborations with other directors - this means Grindhouse doesn't count for Tarantino, Duel doesn't count for Spielberg, Doodlebug doesn't count for Nolan and Shine A Light doesn't count for Scorsese):

It seems Christopher Nolan is the director with the most consistently good filmography, because 6 out of his 8 movies are somewhere in the current Top 250. It seems only Tarantino has some chances to reach him, but not in the next few years, because even if Jango Unchained is yet another masterpiece of his (fingers crossed), the result will still be below Nolan's 75%.

So, the main question is to what extent Christopher Nolan deserves this honor - to be the best director who ever lived? Is he really that good? There is no question - his Memento and Prestige are real gems that deserve to be included in the Top 250 - I would put them both among the best 20, but that's just me. However, I hope it's not just me who finds it ridiculous that The Dark Knight is at 8th place, Inception is at 15th; and the shallow, predictable, pretentious conclusion of the Batman trilogy - at 13th place... Apparently, according to the crowd, these three movies are much better than There Will Be Blood and A Clockwork Orange. So what's the force behind this nonsense?

It's obvious - the mindless waves of fanboys coming to IMDBb only to vote for their favorite movie simply destroy the chart. A large chunk of the votes for TDKR and The Avengers are 10/10 for the sake of putting those movies as higher as possible. At the same time the same people vote 1/10 for the concurrent movies, so they could fall behind their favorites. There is absolutely no glimpse of critical thought behind this mechanical pushing-the-button.

How the score is calculated? The formula is simple and given just below the chart at the IMDb page. The rating is a simple weighted average with appropriate normalization: (v×R + m×C) / (v+m), where v is the number of votes, R is the arithmetic average vote for the movie, m is the minimum number of votes for the movie to be considered in the chart (currently 25000), and C is the average score for all movies in IMDb (currently 7.1). The trick here is that R and v are calculated by taking into account only the votes of regular users with certain history behind. What is the threshold - noone knows. This however is the only mechanism IMDb employs in order to suppress to some extent the impact of the one-time voters, who usually give either 1/10 or 10/10.

It's very easy to check the effect of this defense mechanism. We can calculate the rating for each movie by using the same formula, but taking into account all the votes (this is the information which is actually available at the rating page of the movie). The 250 movies will be rearranged according to the "irregular" one-time voters. The change in the place would be a good indication for the amount of fanboyish attitude towards certain movie.

The results speak for themselves:
Among the top 20 movies (ordered by the rate with regular votes only), 14 would drop when the fanboys'/haters' votes are considered. What I find striking is that all of Nolan's movies would be rated significantly higher. They would actually occupy 3 of the top 4 places, The Dark Knight Rises being at second position - an atrocity beyond comprehension.

The simple mathematics reveals the true name of the fanboys/haters - they are nolanites. The only other movie among the first 20 with similar behavior is Fight Club, however I doubt Fincher's name is in play here - there are only two of his movies in the Top 250. Alas, the power of the nolanites is not sufficient to bring down The Shawshank Redemption - I can imagine how sad they are because of this...

What happens in the rest of the chart you can check here. If irregular votes are considered, all Nolan's movies rise in the Top 250 (including Batman Begins with the stunning 58 places). In this chart The Avengers is in top 10, for fuck's sake. On the other hand, all the movies by Kurosawa, Fellini, Hitchcock, Coppola... drop significantly. It seems the anti-fanboys protection system works to some extent. Not perfectly, but it works.

Many other conclusions can be made based on the breakdowns by sex and age group. For example, you'll probably not believe it, but the reason The Avengers is so high in the chart is women's votes (they put it on 4th position in Top 250, while in man's chart is 27th). Yeah, that's the power of vaginas over the Marvel nerds. In female Top 250 the last Harry Potter is 8th, just after The Lion King... I will leave the other gory details for the next time.

At the end I will just show you how the chart would look if only the votes by the top 1000 IMDb users would have been considered. Those are moviegoers who vote systematically and are supposed to have much better reasoning for that. You can find the chart here. Picking up the hot examples - all the movies directed by Christopher Nolan drop by at least 60 positions: The Dark Knight Rises is already 92nd, The Dark Knight is at 70th place... The Avengers are 149th. Much better.

Of course all this is just a rearranging of the movies currently in Top 250. The real chart by the top 1000 voters would include other movies (and most probably The Avengers would not be there anymore). For example all but two of the movies by Coen Bros. have a score better than 6.8 (the lower limit in the current chart). It would be extremely interesting to see this top 250 in its full glory, and I'm sure it's a matter of 15 minutes of code writing. But we'll never get it.

So the conclusion is that the mechanism to suppress the childish behavior of the IMDb users when voting, works to some extent. In principle one can investigate better algorithms for score calculation, more complicated than the simple weighted average - for example, taking into account the shape of the distribution of votes from 1 to 10 and fitting it with Gaussian-like function, which treats correctly the overblown ends.

The second conclusion is that Christopher Nolan is the most overrated and overhyped director in history of movies. Period. When the numbers speak, the nolanites should better shut their yaps.

To be continued...

6 коментара:

  1. While your math may be correct, your logic is flawed.. if chris nolan makes movies everybody likes, why is he overrated? if you apply logic he clearly is underrated...

    ОтговорИзтриване
  2. Отговори
    1. Thanks, this is a great resource to check the trends in time. Unfortunately, there is no history of the votes breakdown.

      Изтриване
    2. If you click on movie title in Movies section, you will see votes day-by-day

      Изтриване
  3. Yes, but we have all the votes and the calculated IMDB average. What would be nice to have is the number of people voted 10,9,8,...,1 for each day, so we could compare the imdb average with the arithmetic average and see how this difference goes with time.

    ОтговорИзтриване