Goal

Movies are an inseparable part of our lives. They are a form of art, loved and consumed by many. The movie industry is very competitive, as the number of movies produced per year grows exponentially, so there is a need to maximise chances of producing successful movies. However, little is known about what people's interests are, as well as how they have changed over time. By utilising the CMU dataset, enhanced by an IMDb rating dataset, the goal is to delve into the key ingredients that form a successful movie. The motivation is to find out which aspects influence a movie’s success as well as the evolution of the corresponding factors in various geographical regions and over different time periods. The results could provide support to movie producers as they will capture the essence of what people desire in a movie.

Image for

Dataset description, dataset extension and Preprocessing

We have been provided with the CMU dataset which can be obtained at the following link. Despite the rich content of the dataset there were several aspects which we sought to improve in order to enrich our data, with features which we deemed relevant to our particular topic.

One of the promising options for such an external source is the IMDb dataset. The dataset was available for us to download at the following link. Thanks to this we were able to merge the two datasets on the name and release date columns, and allowed us to obtain additional features.

Additionally to this we took advantage of the free trial of IMDb Pro. This allowed us to scrape additional relevant data including: budgets, MPAA rating, Box office revenue, Directors, writers, producers, composers, cinematographers and editors.

After a thorough data cleaning procedure described and justified in detail in the Preprocessing.ipynb notebook, we obtain a dataset containing the following columns:

Wikipedia ID Freebase ID Name Release date Runtime Languages Countries Genres IMDb ID averageRating
number of votes Budgets Mpaa Box offices Directors Writers Producers Composers Cinematographers Editors Weighted Rating

EDA and Spatio-temporal analysis

To get more insights and understand trends, we are interested in how the features that generally lead to a successful movie, evolve over time and if they change depending on the country.

We start by visualizing the distribution of movie releases through the years.

The oldest movie from the dataset was released in 1888, and the most recent is from 2016. However, from the histogram, we can observe that a large proportion of our dataset contains movies released after 1990. This drastic increase in the amount of movies produced in the past 40 years is most likely caused by recent technological advancements, which enabled the movie industry to become ubiquitous. Due to this, we expect the result from this period to be a better representation of the real world, when conducting the temporal analysis.

What is a successful movie?

As explained previously, we think that the weighted rating is a more meaningful measurement of success that the average rating, as it considers the rating, but also the number of votes.

For this reason, in this analysis, we decided to consider that a movie is successful if its weighted rating is above a certain threshold. The value for this threshold was determined so that 5% of our dataset are successful movies. Thus, we chose a threshold of 6.61, which corresponds to 3290 movies.

How are the features defining a successful movie evolving over time?

In order to observe how the characteristics of a successful movie are changing over the years, we made comparisons of the features from the overall dataset, that includes successful and non-successful movies, with features from the successful movies only.

Before starting this analysis, we observed that the average rating has decreased in the recent years.

This can be due to people becoming more critical as the amount of released movies is increasing, so there are more movies to compare to. Another reason could be that only the most famous and appreciated movies are referenced for the first decades, while now we have statistics about almost every existing movie.

Are most successful genres changing over time?

For the most descriptive genres determined during the feature analysis part of our project, we want to analyze how their contribution to the success of movies evolve over the years. With this in mind, we calculate the proportion of successful movies over time, for each of these genres. More precisely, for every decade, we compute the number of successful movies divided by the total number of movies for a certain genre.

We can see that for each of these most descriptive genres, except Comedy and Action, the proportion of successful movies is above the general proportion, which is computed including all the genres. In particular, Adventure, Animation and Mystery are among the successful genres nowadays.

Can the box office be predicted from the month a movie was made in?

Then, we are wondering if the month a movie is released in, can influence its success. To answer this question, we plot the average box office received for the movies in each month of the year.

On average, the box office is the highest for movies released in June and July. It can be due to people going to cinema more often during the summer holidays.

Does a difference in the budget influence the success of a movie?

Another production feature we are interested in, is the budget used for the movie. For this, we compare the average budget, computed over all movies, with the average budget that successful movies are allocating.

Since mean and standard deviation are not robust to outliers/extreme values, we decided to use the median, with the interquartile range between 40% and 60% as the error bar. From the comparison between the overall dataset and the one containing only successful movies, we can observe that a higher budget might lead to a successful movie.

Image for

Comparison between the evolution over time of the genre distribution computed on the overall data and the one computed on successful movies only

As the last step of this temporal analysis, we want to check if the trend over the years for the most frequently used genres is different according to whether the movies are successful or not. We plot the distribution over time for the 10 most frequent genres appearing in our dataset, computed on the overall dataset (left) and computed on the successful movies only (right).

From these two plots, we see almost no difference between the distribution of genres in the original dataset compared to the dataset containing only the successful movies. This might be due to the fact that when a movie is successful, movies that are released not long after may try to take inspiration from it, for example its genre, its plot, etc... Thus, movies released in the same period of time might be similar, whether they become successful or not.

Image for

Comparison between the evolution over time of the language distribution computed on the overall data and the one computed on successful movies only

We want to do a similar analysis as the one we did for the genres, but with the 10 most frequent languages from our dataset.

English seems to be a factor of success.

Image for

Is the success of a movie depending on the country it is produced in?

In this other part of the project, we are interested in getting insights if movies have more chance to be successful depending on the country they are produced in. To visualize this, we plot the proportion of successful movies by country. More precisely, for each country, we compute the ratio of the number of successful movies, by the total number of movies produced in this country.

From the world map above, Libia and Mauritania are the counties with the highest proportion of successful movies. But from the following world map that represents the number of movies released per country, we can see that very few movies are released there. Thus, the high proportion of successful movies might be due to this country being more selective.

Image for

Among the other countries, the parts of the world with high proportion of successful movies seem to be the ones with a lot of movie releases (eg. North America, Europe). This might come from the fact that movie industries where many movies are produced, are maybe better developed.

To visualize the spatial distribution of movie releases, we plot the number of movies produced in each country.

Feature analysis

Working with the plethora of features that could be thought of about a movie could be a difficult task. Identifying the features which bring the most value to our analysis is important in order to center the work around them. A Random Forest was utilized for the task. The discoveries were intriguing, however as expected the most important feature would be the genre. Other descriptive properties turn out to be the Runtime, Budget as well as the Release date, hinting a trend in the rating of movies during time. Less impactful are features such as Language, Country and people working behind the scenes.

What are the ingredients for a successful movie?

To answer this question we ran a linear regression model using the most descriptive features, which we found (Runtime, Budget, Genre and Country). This was done in order to find the coefficients associated with each feature, the entire result can be observed in our notebook.

Тhe coefficient for runtime is rather high, indicating that people enjoy watching longer films. This could be caused by various factors, but the most prominent would be the fact that a story could be told best by giving more pieces of information to the audience and for that, the movie would need overall more time. The location of our movie could be the United Kingdom, Japan or New Zealand, as many of the household movies have been shot there throughout the years.

For genre, interestingly the Animation genre tends to outperform the rest, but Drama is a close second and may be more appropriate, depending on the target audience.

Directors play a major role in the creation of a movie, therefore selecting the right one is vital for its success.

If we decide to go with the Animation genre, then our data shows that in present days, Rich Moore's animations perform best (Zootopia, Wreck-it Ralph). Going to the greatest decade, we see that our choice will have to be Roger Allers (director of Lion King and The Little Mermaid).

Looking at Drama, Olivier Nakache and Éric Toledano are a clear duo, which we would like to have, considering movies from the past 10 years. However, the greatest decade would have to go to Frank Darabont in his period between 1990 and 2000, responsible for The Shawshank Redemption and The Green Mile.

Directors have their ups and downs throughout their careers, so if we want to be certain of the dependability of a director, then we would choose Yoshiaki Kawajiri for the Animation genre and Éric Rohmer for the Drama genre.