Movies are an inseparable part of our lives. They are a form of art, loved and consumed by many. The movie industry is very competitive, as the number of movies produced per year grows exponentially, so there is a need to maximise chances of producing successful movies. However, little is known about what people's interests are, as well as how they have changed over time. By utilising the CMU dataset, enhanced by an IMDb rating dataset, the goal is to delve into the key ingredients that form a successful movie. The motivation is to find out which aspects influence a movie’s success as well as the evolution of the corresponding factors in various geographical regions and over different time periods. The results could provide support to movie producers as they will capture the essence of what people desire in a movie.

Image for

Dataset description, dataset extension and Preprocessing

We have been provided with the CMU dataset which can be obtained at the following link. Despite the rich content of the dataset there were several aspects which we sought to improve in order to enrich our data, with features which we deemed relevant to our particular topic.

One of the promising options for such an external source is the IMDb dataset. The dataset was available for us to download at the following link. Thanks to this we were able to merge the two datasets on the name and release date columns, and allowed us to obtain additional features.

Additionally to this we took advantage of the free trial of IMDb Pro. This allowed us to scrape additional relevant data including: budgets, MPAA rating, Box office revenue, Directors, writers, producers, composers, cinematographers and editors.

After a thorough data cleaning procedure described and justified in detail in the Preprocessing.ipynb notebook, we obtain a dataset containing the following columns:

Wikipedia ID Freebase ID Name Release date Runtime Languages Countries Genres IMDb ID averageRating
number of votes Budgets Mpaa Box offices Directors Writers Producers Composers Cinematographers Editors Weighted Rating

EDA and Spatio-temporal analysis

To get more insights and understand trends, we are interested in how the features that generally lead to a successful movie, evolve over time and if they change depending on the country.

We start by visualizing the distribution of movie releases through the years.

The oldest movie from the dataset was released in 1888, and the most recent is from 2016. However, from the histogram, we can observe that a large proportion of our dataset contains movies released after 1990. This drastic increase in the amount of movies produced in the past 40 years is most likely caused by recent technological advancements, which enabled the movie industry to become ubiquitous. Due to this, we expect the result from this period to be a better representation of the real world, when conducting the temporal analysis.