Comparing the Success of Superhero and Japanese Animated Movies

Soham Nagaokar

When the world entered the 21st century, we entered a rise of globalization in almost every industry, and movies are no exception. Considering the recent success of both Superhero and Japanese Animated Movies in global markets, a new question has dawned upon us: which set of films is more successful overall? We will measure success in a variety of methods in order to obtain a more comprehensive and holistic result, and in the process, I will guide you, the reader, through the Data Science Pipeline!

Data Collection

Before we start, we must first import all the packages we will need for the entirety of this project. These packages will allow us to import, store, clean, and analyze our data.

In [79]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import warnings

warnings.filterwarnings('ignore')

Source

I personally will be getting the movie data this publicly available dataset on kaggle.com. For the uninitated, Kaggle is a data compilation website focused on providing free, comprehensive, and transparent datasets for public use. The specific datasets that this tutorial will use provide rating data from IMDb, as well as box office information. These datasets are provided by Stefano Leone, a Masters of Science student in Data Analytics from the National College of Ireland; considering that he is a respected member of the data science field, we can safely assume that the data he provides is accurate.

Loading in Datasets

Now that we have established our source, the next step is to load in the dataset into Python so that we can actually begin to manipulate it.

In [80]:
imdb_ratings = pd.read_csv("raw_data/IMDb ratings.csv") #Loads in dataset of ratings given for all movies on IMDb

imdb_names = pd.read_csv("raw_data/IMDb movies.csv") #The previous dataset identifies movies by a unique
#identification number instead of a movie title, so this dataset links those id numbers to the actual movie titles
In [81]:
#Here is a preview of both datasets, provided by the head() command
imdb_ratings.head()
Out[81]:
imdb_title_id weighted_average_vote total_votes mean_vote median_vote votes_10 votes_9 votes_8 votes_7 votes_6 ... females_30age_avg_vote females_30age_votes females_45age_avg_vote females_45age_votes top1000_voters_rating top1000_voters_votes us_voters_rating us_voters_votes non_us_voters_rating non_us_voters_votes
0 tt0000009 5.9 154 5.9 6.0 12 4 10 43 28 ... 5.7 13.0 4.5 4.0 5.7 34.0 6.4 51.0 6.0 70.0
1 tt0000574 6.1 589 6.3 6.0 57 18 58 137 139 ... 6.2 23.0 6.6 14.0 6.4 66.0 6.0 96.0 6.2 331.0
2 tt0001892 5.8 188 6.0 6.0 6 6 17 44 52 ... 5.8 4.0 6.8 7.0 5.4 32.0 6.2 31.0 5.9 123.0
3 tt0002101 5.2 446 5.3 5.0 15 8 16 62 98 ... 5.5 14.0 6.1 21.0 4.9 57.0 5.5 207.0 4.7 105.0
4 tt0002130 7.0 2237 6.9 7.0 210 225 436 641 344 ... 7.3 82.0 7.4 77.0 6.9 139.0 7.0 488.0 7.0 1166.0

5 rows × 49 columns

In [82]:
imdb_names.head()
Out[82]:
imdb_title_id title original_title year date_published genre duration country language director ... actors description avg_vote votes budget usa_gross_income worlwide_gross_income metascore reviews_from_users reviews_from_critics
0 tt0000009 Miss Jerry Miss Jerry 1894 1894-10-09 Romance 45 USA None Alexander Black ... Blanche Bayliss, William Courtenay, Chauncey D... The adventures of a female reporter in the 1890s. 5.9 154 NaN NaN NaN NaN 1.0 2.0
1 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 1906 12/26/06 Biography, Crime, Drama 70 Australia None Charles Tait ... Elizabeth Tait, John Tait, Norman Campbell, Be... True story of notorious Australian outlaw Ned ... 6.1 589 $2,250 NaN NaN NaN 7.0 7.0
2 tt0001892 Den sorte drøm Den sorte drøm 1911 8/19/11 Drama 53 Germany, Denmark NaN Urban Gad ... Asta Nielsen, Valdemar Psilander, Gunnar Helse... Two men of high rank are both wooing the beaut... 5.8 188 NaN NaN NaN NaN 5.0 2.0
3 tt0002101 Cleopatra Cleopatra 1912 11/13/12 Drama, History 100 USA English Charles L. Gaskill ... Helen Gardner, Pearl Sindelar, Miss Fielding, ... The fabled queen of Egypt's affair with Roman ... 5.2 446 $45,000 NaN NaN NaN 25.0 3.0
4 tt0002130 L'Inferno L'Inferno 1911 3/6/11 Adventure, Drama, Fantasy 68 Italy Italian Francesco Bertolini, Adolfo Padovan ... Salvatore Papa, Arturo Pirovano, Giuseppe de L... Loosely adapted from Dante's Divine Comedy and... 7.0 2237 NaN NaN NaN NaN 31.0 14.0

5 rows × 22 columns

Data Tidying

Although we have our data loaded into Python, we cannot immediately jump into analyzing it just yet. This is because the dataset is not yet in an optimally usable state. In other words, analyzing these three datasets right now, even with our advanced data science methods, would be very unwieldly and challenging, due to both the number and size of the data. Thus, we will working on minimizing both of these factors.

Joining Datasets

While it's cool that we have two separate tables of data, cross referencing multiple datasets during data analysis is an annoyance at best, and could result in runtime errors, convuluted code, or incorrect conclusions at worst. Therefore, the first thing we will do is merging these 2 datasets into 1.

In [83]:
data = imdb_names.join(imdb_ratings.set_index("imdb_title_id"), on="imdb_title_id") #The join command allows
#us to do this easily, the index is referencing the column that will be used to determine if 2 rows are the same 
#(ie if a row from imdb names and a row from imdb ratings both have the same title id, they are considered the same
#row and are merged)

data.head()
Out[83]:
imdb_title_id title original_title year date_published genre duration country language director ... females_30age_avg_vote females_30age_votes females_45age_avg_vote females_45age_votes top1000_voters_rating top1000_voters_votes us_voters_rating us_voters_votes non_us_voters_rating non_us_voters_votes
0 tt0000009 Miss Jerry Miss Jerry 1894 1894-10-09 Romance 45 USA None Alexander Black ... 5.7 13.0 4.5 4.0 5.7 34.0 6.4 51.0 6.0 70.0
1 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 1906 12/26/06 Biography, Crime, Drama 70 Australia None Charles Tait ... 6.2 23.0 6.6 14.0 6.4 66.0 6.0 96.0 6.2 331.0
2 tt0001892 Den sorte drøm Den sorte drøm 1911 8/19/11 Drama 53 Germany, Denmark NaN Urban Gad ... 5.8 4.0 6.8 7.0 5.4 32.0 6.2 31.0 5.9 123.0
3 tt0002101 Cleopatra Cleopatra 1912 11/13/12 Drama, History 100 USA English Charles L. Gaskill ... 5.5 14.0 6.1 21.0 4.9 57.0 5.5 207.0 4.7 105.0
4 tt0002130 L'Inferno L'Inferno 1911 3/6/11 Adventure, Drama, Fantasy 68 Italy Italian Francesco Bertolini, Adolfo Padovan ... 7.3 82.0 7.4 77.0 6.9 139.0 7.0 488.0 7.0 1166.0

5 rows × 70 columns

One important part of data science is always double checking your data for inaccuracies, even if your source is a trusted one. Upon closer examination of the data, I found that the formatting of some the movie titles was not to my liking, so to make things more comprehensible, I invoked a replace command that changes some of the titles to more widely understood ones.

In [84]:
data["original_title"].replace({"Iron Man Three" : "Iron Man 3",
                                "Birds of Prey: And the Fantabulous Emancipation of One Harley Quinn" : "Birds of Prey"},
                               inplace=True)

Column Dropping

Now that we have all our data in one table, you may have noticed that we have a lot of columns - 70 to be exact. That's a lot! It's way more than what we actually need! In order to keep our dataset manageable and reduce runtimes, we will remove any columns that we believe have unecessary data for us to use during our data analysis. First, let's get a look at what each column is for so that we have a good idea of what we need to eliminate.

In [85]:
data.columns #Returns list of all column names
Out[85]:
Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics', 'weighted_average_vote',
       'total_votes', 'mean_vote', 'median_vote', 'votes_10', 'votes_9',
       'votes_8', 'votes_7', 'votes_6', 'votes_5', 'votes_4', 'votes_3',
       'votes_2', 'votes_1', 'allgenders_0age_avg_vote',
       'allgenders_0age_votes', 'allgenders_18age_avg_vote',
       'allgenders_18age_votes', 'allgenders_30age_avg_vote',
       'allgenders_30age_votes', 'allgenders_45age_avg_vote',
       'allgenders_45age_votes', 'males_allages_avg_vote',
       'males_allages_votes', 'males_0age_avg_vote', 'males_0age_votes',
       'males_18age_avg_vote', 'males_18age_votes', 'males_30age_avg_vote',
       'males_30age_votes', 'males_45age_avg_vote', 'males_45age_votes',
       'females_allages_avg_vote', 'females_allages_votes',
       'females_0age_avg_vote', 'females_0age_votes', 'females_18age_avg_vote',
       'females_18age_votes', 'females_30age_avg_vote', 'females_30age_votes',
       'females_45age_avg_vote', 'females_45age_votes',
       'top1000_voters_rating', 'top1000_voters_votes', 'us_voters_rating',
       'us_voters_votes', 'non_us_voters_rating', 'non_us_voters_votes'],
      dtype='object')

Now that we can see what columns we have, it's time to start removing the unecessary ones. For this analysis, I will keep basic identification, box office earnings, and review score data, and remove any column that doesn't serve any of those purposes. We will drop columns using the drop command, which we actually used earlier when joining datasets. I will also take this time to rename some of the columns

In [86]:
data.drop(columns=['imdb_title_id', 'title', 'date_published',
       'duration', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'metascore',
       'reviews_from_users', 'reviews_from_critics', 'weighted_average_vote',
       'total_votes', 'median_vote', 'votes_10', 'votes_9',
       'votes_8', 'votes_7', 'votes_6', 'votes_5', 'votes_4', 'votes_3',
       'votes_2', 'votes_1', 'allgenders_0age_avg_vote',
       'allgenders_0age_votes', 'allgenders_18age_avg_vote',
       'allgenders_18age_votes', 'allgenders_30age_avg_vote',
       'allgenders_30age_votes', 'allgenders_45age_avg_vote',
       'allgenders_45age_votes', 'males_allages_avg_vote',
       'males_allages_votes', 'males_0age_avg_vote', 'males_0age_votes',
       'males_18age_avg_vote', 'males_18age_votes', 'males_30age_avg_vote',
       'males_30age_votes', 'males_45age_avg_vote', 'males_45age_votes',
       'females_allages_avg_vote', 'females_allages_votes',
       'females_0age_avg_vote', 'females_0age_votes', 'females_18age_avg_vote',
       'females_18age_votes', 'females_30age_avg_vote', 'females_30age_votes',
       'females_45age_avg_vote', 'females_45age_votes',
       'top1000_voters_rating', 'top1000_voters_votes', 'us_voters_rating',
       'us_voters_votes', 'non_us_voters_rating', 'non_us_voters_votes'], inplace=True) #Drops columns

data.rename(columns=
            {"original_title" : "title",
             "usa_gross_income" : "us_income",
             "worlwide_gross_income" : "glo_income"}, inplace=True) #Some quick column renaming 

data.head()
Out[86]:
title year genre country budget us_income glo_income mean_vote
0 Miss Jerry 1894 Romance USA NaN NaN NaN 5.9
1 The Story of the Kelly Gang 1906 Biography, Crime, Drama Australia $2,250 NaN NaN 6.3
2 Den sorte drøm 1911 Drama Germany, Denmark NaN NaN NaN 6.0
3 Cleopatra 1912 Drama, History USA $45,000 NaN NaN 5.3
4 L'Inferno 1911 Adventure, Drama, Fantasy Italy NaN NaN NaN 6.9

Fixing the Years

Later in this tutorial, we will need to filter our data by year. However, to filter by year, we need the years in a numeric format, like floats, as opposed to the current format of strings, so we will work on this conversion first.

In [87]:
ind = [] #Will store indicies of movies that have a nonnumeric year

#Notes indicies of rows that have nonnumeric years and adds them to the ind array
for index, row in data.iterrows():
    if not str(row["year"]).isnumeric() and str(row["year"]) != "nan":
        ind.append(index)

data.drop(index=ind, inplace=True) #Drops movies that have nonnumeric year

data = data.astype({"year" : "float"}, copy=False) #Converts years to float, a numeric format, so we can properly 
#filter them later

Dropping Rows

Now that we've minimized the number of columns, it's time to do the same for the rows. However, we can't drop the rows we don't want in 1 command, as we would need to the names of all ~85000 movies we want to drop, which would be unreasonable. Instead, we will essentially do the opposite operation - we will specify which movies we want to keep by index, via the loc command using various boolean conditions. Specifically, we will only use the movies that are superhero movies (movies with superhero titles from the following franchises: Marvel Cinematic Universe, DC Extended Universe, Fox X-Men, Sony Spider-Man, Blade, The Incredibles, The Nolan Dark Knight Trilogy, The Tim Burton Batman Movies, and the Salkind Superman Movies) or Japanese Animated Movies (movies made in Japan with "Animation" as a genre).

Additionally, a lot of our superhero movies use very common titles, and consequently, there are many movies that share titles with the superhero movies that we are interested in, that are not actually superhero movies, so I will use this opportunity to remove those rows via their indicies. The indicies were found through a conjuction of filtering titles on the original CSV file and filtering rows in the Pandas Dataframe by title. Check out the links I provided if you would like more information on that process.

In [88]:
data.drop(index=[63248, 46365, 20689, 15275, 6886, 1113, 30382, 2266, 15912, 6008, 16288, 40943, 42466, 4401], 
          inplace=True) #The following are indicies of rows that have movies with identical titles with the ones we
#want to study, but are not actually superhero movies.

superheroes = ["Iron Man", "The Incredible Hulk", "Iron Man 2", "Thor", "Captain America: The First Avenger",
         "The Avengers", "Iron Man 3", "Thor: The Dark World", "Captain America: The Winter Soldier",
         "Guardians of the Galaxy", "Avengers: Age of Ultron", "Ant-Man", "Captain America: Civil War", 
          "Doctor Strange", "Guardians of the Galaxy Vol. 2", "Spider-Man: Homecoming", "Thor: Ragnarok",
          "Black Panther", "Avengers: Infinity War", "Ant-Man and the Wasp", "Captain Marvel", "Avengers: Endgame",
            "Spider-Man: Far from Home", "Man of Steel", "Batman v Superman: Dawn of Justice", "Suicide Squad",
            "Wonder Woman", "Justice League", "Aquaman", "Shazam!", "Birds of Prey", "Spider-Man", "Spider-Man 2",
            "Spider-Man 3", "The Amazing Spider-Man", "The Amazing Spider-Man 2", "Spider-Man: Into the Spider-Verse",
            "Venom", "X-Men", "X2", "X-Men: The Last Stand", "X-Men Origins: Wolverine", "The Wolverine", "Logan",
            "X-Men: First Class", "X-Men: Days of Future Past", "X-Men: Apocalypse", "Dark Phoenix", "Deadpool",
            "Deadpool 2", "The New Mutants", "Batman Begins", "The Dark Knight", "The Dark Knight Rises",
            "The Incredibles", "Blade", "Superman", "Superman II", "Superman III", "Supergirl", 
            "Superman IV: The Quest for Peace", "Superman Returns", "Batman", "Batman Returns", "Batman Forever", 
            "Batman & Robin"
             ] #List of every superhero movie

data = data.loc[(data["title"].isin(superheroes)) | 
         ( (data["country"] == "Japan") & (data["genre"].str.contains("Animation")) )] #Filters movies

# A valid superhero movie is defined a movie with a title from the list above
# A valid Japanese animated movie is defined as a movie with a country of Japan that has Animation listed as one of 
# the genres

data.head()
Out[88]:
title year genre country budget us_income glo_income mean_vote
9546 Hakuja den 1958.0 Animation, Family, Fantasy Japan NaN NaN NaN 6.6
10120 Shônen Sarutobi Sasuke 1959.0 Animation, Family, Adventure Japan NaN NaN NaN 7.1
10473 Saiyûki 1960.0 Animation, Adventure, Family Japan NaN NaN NaN 6.2
11704 Wanpaku ôji no orochi taiji 1963.0 Animation, Family, Adventure Japan NaN NaN NaN 7.0
12221 Garibaa no uchû ryokô 1965.0 Animation, Adventure, Fantasy Japan NaN NaN NaN 6.2

Money Conversion

One thing to note about the money columns (budget, US Income, and Worldwide Income) is that some of the Japanese movies have their money in terms of Yen rather than Dollars. Since we need everything in the same money system, we will now go about converting them into US Dollars. We will be using the Yen to US Dollars conversion rate of 109.36 USD/Yen (source, they will have it listed as 0.0091 Yen/USD, which is equivalent). Note that for the simplicity of analysis, we will not be taking inflation into account, but if you would like to read more about inflation, go here for Japanese inflation or here for US inflation. Also, most of the dollar amounts have the actual dollar sign right next to them, which will later prevent the program from reading monetary information as numbers, so we will also use this opportunuity to fix that as well.

In [89]:
def encode_money(bud): # Converts yen to dollars and numerizes dollar amounts
    if type(bud) == str:
        if bud[:3] == "JPY":
            return int(bud[4:]) / 109.36
        else:
            return int(bud[1:].replace(",", ""))
    else:
        return bud

data["budget"] = [encode_money(row["budget"]) for index, row in data.iterrows()]
data["us_income"] = [encode_money(row["us_income"]) for index, row in data.iterrows()]
data["glo_income"] = [encode_money(row["glo_income"]) for index, row in data.iterrows()]


# Here, we iterate through the money columns, and if the value is a string whose value begins with JPY, then
# the value is a Yen amount, and will be converted to USD. Else, the budget value is numerized

Miscellaneous Column Manipulation

Here, we will just do some last minute edits regarding the columns to prepare ourselves for data analysis.

In [90]:
data["Genre"] = [1 if row["country"] != "Japan" else 0 for index, row in data.iterrows()] # Encodes Genre column with
# 0 for superhero movies and 1 for Japanese animated films

data.drop(columns=["genre", "country"], inplace=True) # Drops old genre and country columns. With above encoding they
# are no longer necessary for film distinguishment

data.reset_index(drop=True, inplace=True) # We reset the indicies to make EDA easier later

data.head()
Out[90]:
title year budget us_income glo_income mean_vote Genre
0 Hakuja den 1958.0 NaN NaN NaN 6.6 0
1 Shônen Sarutobi Sasuke 1959.0 NaN NaN NaN 7.1 0
2 Saiyûki 1960.0 NaN NaN NaN 6.2 0
3 Wanpaku ôji no orochi taiji 1963.0 NaN NaN NaN 7.0 0
4 Garibaa no uchû ryokô 1965.0 NaN NaN NaN 6.2 0

An Aside Regarding Imputation

Those of you more well versed in the Data Science Pipeline might have noticed that some movies, especially some of the older ones, have missing data, and may be wondering how we will imputate, or replace, that data. However, given the circumstances of this data, I do not believe we can accurately imputate the missing data. Unlike data on, say, athletes or health crises, similar movies made during the same year have no guarentee of having similar budgets, incomes, or ratings. In order to make our analysis as accurate as possible, we will avoid any imputation, and ignore movies with missing data for analyses. This is called a Complete Case Analysis, and the reason this results in unbiased data is detailed in this article. (Our data is most likely MCAR, as I was able to find both niche and popular movies that lacked some data).

Exploratory Data Analysis

After all that work, we can now proceed to the actual data analysis, where can view the relationship between certain variables in order to draw our own conclusions about our data. We will determine which set of movies is more successful based on analysis of profit and review scoresl. We will also use basic data analysis and hypothesis to approach these topics.

Analyzing Rating Scores via Basic Data Analysis

First, we will analyze the ratings scores of both superhero and Japanese animated films, which we will do by constructing pie charts of score breakdowns of both categories.

In [91]:
# Create bins that dictate pie slices. A movie with mean_vote v is a member of bin i if i ≤ v < i+0.5 is true
bins_anime = {}
bins_super = {}
for i in range(3, 10):
    bins_anime[i] = 0
    bins_anime[i+0.5] = 0
    bins_super[i] = 0
    bins_super[i+0.5] = 0
    
# Place movies in bins. The dictionary tells you how many movies are in each bin
for index, row in data[data["mean_vote"].notnull()][["Genre", "mean_vote"]].iterrows():
    bins = bins_anime if row[0] == 0 else bins_super
    for b in bins:
        if row[1] >= b and row[1] < b + 0.5:
            bins[b] += 1
            
bins_anime
Out[91]:
{3: 0,
 3.5: 1,
 4: 0,
 4.5: 4,
 5: 4,
 5.5: 29,
 6: 62,
 6.5: 139,
 7: 225,
 7.5: 143,
 8: 54,
 8.5: 4,
 9: 1,
 9.5: 0}
In [92]:
bins_super
Out[92]:
{3: 0,
 3.5: 2,
 4: 0,
 4.5: 1,
 5: 1,
 5.5: 3,
 6: 4,
 6.5: 9,
 7: 22,
 7.5: 10,
 8: 11,
 8.5: 3,
 9: 0,
 9.5: 0}

Looking at these dictionaries, you may notice that some of the bins have no movies in them. These empty bins will clutter our pie plot, so we will first get rid of them before plotting.

In [93]:
# Noting empty bins
remove = []
for key, value in bins_anime.items():
    if value == 0:
        remove.append(key)
for key in remove:
    bins_anime.pop(key)

# Removing empty bins
remove = []
for key, value in bins_super.items():
    if value == 0:
        remove.append(key)
for key in remove:
    bins_super.pop(key)
        
# Now let's make the pie plots
fig, (a, b) = plt.subplots(1, 2)
axs= [a, b]
for index, bins in zip(range(0, 2), [bins_anime, bins_super]):
    #Construct graph and labels
    axs[index].pie(bins.values(), labels=bins.keys(), autopct='%1.1f%%') #Plots pie plot
    axs[index].set_title(("Japanese Animated" if index % 2 == 0 else "Superhero") + " Movie Score Breakdown")
    plt.gcf().set_size_inches(15, 15)

From the graphs, we see that Japanese animated movies have a higher percentage of movies rated from 6 - 7.9 with 1 outlier having a score of 9+, while superhero movies have a higher percentage of movies rated from 3.5 - 5.9 and 8 - 8.9. What this implies is that while superhero films can reach higher ratings and can be of higher quality, that is not a guarentee. Meanwhile, while Japanese films do not reach the highest highs of the superhero genre, they also don't frequently reach the lowest lows, indicating that Japanese animation has greater score consistency overall.

However, the bins we used were arbitrarily decided by us; perhaps we accidentally introduced bias? Let's construct some histograms with autogenerated bins just to be safe.

In [94]:
# Select subset of data where mean_vote data exists and take mean vote and genre data 
voting = data[data["mean_vote"].notnull()][["mean_vote", "Genre"]]

#Construct histograms
hist0 = voting[voting["Genre"] == 0].hist("mean_vote")
hist1 = voting[voting["Genre"] == 1].hist("mean_vote")

for x in hist0[0]: #Label Japanese Animated Graph
    x.set_title("Japanese Animated Movie Score Breakdown")
    x.set_xlabel("Mean Rating")
    x.set_ylabel("Freq")
    
for x in hist1[0]: #Label Superhero Graph
    x.set_title("Superhero Movie Score Breakdown")
    x.set_xlabel("Mean Rating")
    x.set_ylabel("Freq")
    
#Display Graphs
hist0
hist1
Out[94]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff9bad0c080>]],
      dtype=object)

Upon looking at these histograms, we can reach the same conclusions as before: both genres of movies peak at about a score of 7, and while superhero movies have a higher proportion of movies that have scores > 8 as compared to Japanese animated movies, they also have a higher proportion of movies with scores < 6, indicating that Japanese animated movies have better consistency overall with their rating scores. Since we reached the same conclusions, we can assume that our bin construction method from earlier was not biased!

Since we now have the histograms here anyways, lets talk about graph shape. The Japanese animated movie histogram has a symmetric shape, with both the left and right sides of the graph being approximately the same, indicating that the proportion of poorly rated Japanese animated movies is approximately equal to the number of highly rated Japanese animated movies. With both sides quickly approaching 0, we can reaffirm our earlier hypothesis that Japanese animated movies have high consistency. Meanwhile, the Superhero movie rating histogram has a left skewed shape, given by how the left side of the graph is much lower (in terms of frequency) as compared to the right side, indicating that the superhero movies could have multiple high outliers that is skewing the data from a symmetric distribution. As a next step, we could test if the distribution was truly normal, which could hypothetically be done with scikit's normaltest command.

Analyzing Box Office Performance via Basic Data Analysis

Next, let's analyze the box office performance made by both superhero and Japanese animated films, by creating a scatterplots that plot budget, USA gross income, and worldwide gross income over time to see which set of films uses and makes the most money.

In [95]:
# First we take different subsets of our data for each scatterplot. Each data subset observes a different dependent
# variable, and alternates between Superhero and Japanese data. Each data subset also includes the associated year
# data for plotting, and datapoints that do not have the data of the particular dependent variable we're observing
# are ignored
dfs = [
    data[data["budget"].notnull()][data["Genre"] == 0][["year", "budget"]],
    data[data["budget"].notnull()][data["Genre"] == 1][["year", "budget"]],
    data[data["us_income"].notnull()][data["Genre"] == 0][["year", "us_income"]],
    data[data["us_income"].notnull()][data["Genre"] == 1][["year", "us_income"]],
    data[data["glo_income"].notnull()][data["Genre"] == 0][["year", "glo_income"]],
    data[data["glo_income"].notnull()][data["Genre"] == 1][["year", "glo_income"]],
]

#Create scatterplots
fig, ((a, b), (c, d), (e, f)) = plt.subplots(3, 2)
axs= [a, b, c, d, e, f]
for index, dep_var, df in zip(range(0, 6), 
                            ["budget", "budget", "us_income", "us_income", "glo_income", "glo_income"], dfs):
    #Constructs trendline
    x = range(int(min(df["year"])), 2020)
    y = np.poly1d(np.polyfit(df["year"], df[dep_var], 1))(x) 
    
    #Construct graph and labels
    plt.gcf().set_size_inches(15, 15)
    axs[index].scatter(x=df["year"], y=df[dep_var]) #Plots scatterplot
    axs[index].set_title(("Japanese Animated" if index % 2 == 0 else "Superhero") + " Movie " + dep_var + 
                         " Over Time " + "(n=" + str(len(df.index)) + ")"
    )
    axs[index].set_xlabel("Year")
    axs[index].set_ylabel(dep_var)
    
    #Plots trendline
    axs[index].plot(x, y, c="r")

When it comes to the financial side, it seems as though superhero movies have Japanese animation beat. Trendlines for superhero movies have steeper slopes and higher scales on the Y axis than their Japanese counterparts, indicating that superhero movies not only have higher budgets and perform better at the box office, but also that those monetary amounts have been increasing at a higher rate over time for superhero movies. Thus, superhero movies have seen greater monetary success over time as compared to Japanese animated movies. Keep this discussion in mind, we'll come back to it later.

However, this isn't the end of the monetary analysis - perhaps Japanese animated movies use their limited budget more efficiently ? To determine this, let's plot budget vs Global Income.

In [96]:
# Select subset of data where budget and Global Income data exists and select budget, Global Income, and movie 
# category data
bud_eff = data[(data["budget"].notnull() & data["glo_income"].notnull())][["budget", "glo_income", "Genre"]]

# We will create a new column called Budget Efficiency, defined as a movie's global income divided by its budget
bud_eff["Bud Eff"] = [row["glo_income"] / row["budget"] for index, row in bud_eff.iterrows()]

#Construct histograms
hist0 = bud_eff[bud_eff["Genre"] == 0].hist("Bud Eff", bins=25)
hist1 = bud_eff[bud_eff["Genre"] == 1].hist("Bud Eff")

for x in hist0[0]: #Label Japanese Animated Graph
    x.set_title("Japanese Animated Budget Efficiency")
    x.set_xlabel("Budget Efficiency")
    x.set_ylabel("Freq")
    
for x in hist1[0]: #Label Superhero Graph
    x.set_title("Superhero Budget Efficiency")
    x.set_xlabel("Budget Efficiency")
    x.set_ylabel("Freq")
    
#Display Graphs
hist0
hist1
Out[96]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff9bb52c240>]],
      dtype=object)

Looking at these histograms, it becomes quickly evident that Japanese animated movies have better budget efficiency than superhero movies, given by how the budget efficiencies of the former lie mostly between 0 and 25, while the latter mostly varies from 1 - 5. Japanese animated movies even have some high outliers at around 100 and 400, which superhero movies do not. Clearly, while superhero movies can definetly generate a higher profit, this is, in large part, thanks to their significantly higher budget, while Japanese animated movies can turn out a more modest profit with a lot less upfront cost. As a next step, we could attempt to standardize the data to obtain more normal looking histograms, the details of which are outlined in this article

Verifying Box Office Performance with Hypothesis Testing

Finally, we will use hypothesis tests to verify our conclusions regarding budget and income for superhero and Japanese animated films from earlier. A hypothesis test is essentially when we want to test if some hypothesis about our data is correct, so we assume the hypothesis is incorrect and run tests on the data. We will also keep a standard alpha value in mind, which will represent the highest probability of getting a Type I Error (falsely rejecting the null hypothesis) that we deem acceptable. At the end of the test, we will generate an empirical P value, which represents our actual probability of falsely rejecting H0. If this P value is less than our alpha value, then the probability of us falsely rejecting H0 is lower than what we deem reasonable, so we could safely assume our original (nonnull/alternative) hypothesis is correct. The topic of our test will be about formally proving our conclusions from earlier regarding budget, US Income, and global Income; specifically, whether genre (superhero vs Japanese animated) affects these variabels. Let's start by stating our hypotheses:

H_0: There is no linear relationship between budget, US Income, Global Income, and year and genre

H_A: There is a linear relationship between budget, US Income, Global Income, and year and genre

Let's take a look at the scatterplots to at least verify that there appears to be a linear relationship

In [97]:
# Similar to earlier, we take subsets of our data based on our dependent variable, only this time, all data will be
# graphed in 1 plot, so we don't need to distinguish by genre
dfs = [
    data[data["budget"].notnull()][["year", "budget"]],
    data[data["us_income"].notnull()][["year", "us_income"]],
    data[data["glo_income"].notnull()][["year", "glo_income"]]
]

#Create scatterplots
fig, ((a, b), (c, d)) = plt.subplots(2, 2)
axs= [a, b, c]
for index, dep_var, df in zip(range(0, 3), ["budget", "us_income", "glo_income",], dfs):
    #Constructs trendline
    x = range(int(min(df["year"])), 2020)
    y = np.poly1d(np.polyfit(df["year"], df[dep_var], 1))(x)
    
    #Construct graph and labels
    plt.gcf().set_size_inches(15, 13)
    axs[index].scatter(x=df["year"], y=df[dep_var]) #Plots scatterplot
    axs[index].set_title(dep_var + " Over Time " + "(n=" + str(len(df.index)) + ")")
    axs[index].set_xlabel("Year")
    axs[index].set_ylabel(dep_var)
    
    #Plots trendline
    axs[index].plot(x, y, c="r")

As all 3 scatterplots of all data show some semblance of a linear relationship, we can assume that some linear relationship most likely exists between year and budget, US Income, and Global Income.

Now normally, we would need to add a dummy column to our dataframe. A dummy column is binary column, set to either 0 or 1, based on the value of a certain categorical variable. In this case, that variable is genre, with 0 representing a Japanese Animated movie and 1 representing a superhero movie. However, our Genre column already acts as a dummy column, so we can skip this step.

The final thing we need to do before completing the hypothesis test is to create a new column, year_genre, which, for a given movie, will represent the year multiplied by the genre value. This will be the final independent variable for our linear model.

In [98]:
data["year_genre"] = data["year"] * data["Genre"]

Now we can conduct the actual hypothesis test for all 3 of our independent variables. Do not be alarmed by the overload of input, most of the information is extraneous. Instead, focus on the coefficients and PR(>F) (P Value).

In [99]:
# Budget Hypothesis Test Results
budget_test = ols("budget ~ year + Genre + year_genre",data).fit() #Fits model
print(budget_test.summary()) #Prints OLS Regression Summary
print("\nANOVA Test\n")
print(anova_lm(budget_test), "\n") #Prints ANOVA Variance Summary for Linear Regression

# US Income Hypothesis Test Results
us_test = ols("us_income ~ year + Genre + year_genre",data).fit() #Fits model
print(us_test.summary()) #Prints OLS Regression Summary
print("\nANOVA Test\n")
print(anova_lm(us_test), "\n") #Prints ANOVA Variance Summary for Linear Regression

# Global Income Hypothesis Test Results
glo_test = ols("glo_income ~ year + Genre + year_genre",data).fit() #Fits model
print(glo_test.summary()) #Prints OLS Regression Summary
print("\nANOVA Test\n")
print(anova_lm(glo_test)) #Prints ANOVA Variance Summary for Linear Regression
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 budget   R-squared:                       0.679
Model:                            OLS   Adj. R-squared:                  0.670
Method:                 Least Squares   F-statistic:                     74.17
Date:                Mon, 17 May 2021   Prob (F-statistic):           7.90e-26
Time:                        15:04:15   Log-Likelihood:                -2088.6
No. Observations:                 109   AIC:                             4185.
Df Residuals:                     105   BIC:                             4196.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -7.029e+08   1.38e+09     -0.509      0.612   -3.44e+09    2.03e+09
year        3.592e+05   6.89e+05      0.521      0.603   -1.01e+06    1.73e+06
Genre      -6.568e+09   1.84e+09     -3.563      0.001   -1.02e+10   -2.91e+09
year_genre  3.336e+06   9.19e+05      3.629      0.000    1.51e+06    5.16e+06
==============================================================================
Omnibus:                        8.923   Durbin-Watson:                   1.866
Prob(Omnibus):                  0.012   Jarque-Bera (JB):               17.139
Skew:                           0.213   Prob(JB):                     0.000190
Kurtosis:                       4.895   Cond. No.                     1.05e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.05e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

ANOVA Test

               df        sum_sq       mean_sq           F        PR(>F)
year          1.0  2.096592e+17  2.096592e+17   78.372101  2.275405e-14
Genre         1.0  3.503278e+17  3.503278e+17  130.955007  3.579922e-20
year_genre    1.0  3.523767e+16  3.523767e+16   13.172090  4.408209e-04
Residual    105.0  2.808935e+17  2.675176e+15         NaN           NaN 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              us_income   R-squared:                       0.683
Model:                            OLS   Adj. R-squared:                  0.678
Method:                 Least Squares   F-statistic:                     130.7
Date:                Mon, 17 May 2021   Prob (F-statistic):           3.49e-45
Time:                        15:04:15   Log-Likelihood:                -3672.4
No. Observations:                 186   AIC:                             7353.
Df Residuals:                     182   BIC:                             7366.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1.645e+08   1.61e+09      0.102      0.919   -3.01e+09    3.34e+09
year       -8.059e+04   8.01e+05     -0.101      0.920   -1.66e+06     1.5e+06
Genre      -1.273e+10    2.7e+09     -4.714      0.000   -1.81e+10    -7.4e+09
year_genre  6.469e+06   1.34e+06      4.812      0.000    3.82e+06    9.12e+06
==============================================================================
Omnibus:                       88.830   Durbin-Watson:                   1.751
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              882.073
Skew:                           1.498   Prob(JB):                    2.89e-192
Kurtosis:                      13.239   Cond. No.                     9.32e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.32e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

ANOVA Test

               df        sum_sq       mean_sq           F        PR(>F)
year          1.0  1.524752e+17  1.524752e+17   18.069655  3.398132e-05
Genre         1.0  2.961759e+18  2.961759e+18  350.994592  2.486135e-44
year_genre    1.0  1.953989e+17  1.953989e+17   23.156490  3.124938e-06
Residual    182.0  1.535751e+18  8.438191e+15         NaN           NaN 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             glo_income   R-squared:                       0.699
Model:                            OLS   Adj. R-squared:                  0.697
Method:                 Least Squares   F-statistic:                     334.6
Date:                Mon, 17 May 2021   Prob (F-statistic):          2.96e-112
Time:                        15:04:15   Log-Likelihood:                -8867.8
No. Observations:                 436   AIC:                         1.774e+04
Df Residuals:                     432   BIC:                         1.776e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1.161e+08   1.82e+09      0.064      0.949   -3.45e+09    3.69e+09
year        -4.88e+04   9.04e+05     -0.054      0.957   -1.83e+06    1.73e+06
Genre      -4.395e+10   4.31e+09    -10.194      0.000   -5.24e+10   -3.55e+10
year_genre   2.22e+07   2.15e+06     10.345      0.000     1.8e+07    2.64e+07
==============================================================================
Omnibus:                      444.135   Durbin-Watson:                   1.568
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            50349.683
Skew:                           4.036   Prob(JB):                         0.00
Kurtosis:                      55.023   Cond. No.                     1.13e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

ANOVA Test

               df        sum_sq       mean_sq           F         PR(>F)
year          1.0  9.063103e+17  9.063103e+17   33.072311   1.682004e-08
Genre         1.0  2.367075e+19  2.367075e+19  863.773096  4.263308e-105
year_genre    1.0  2.932570e+18  2.932570e+18  107.012874   1.479632e-22
Residual    432.0  1.183848e+19  2.740390e+16         NaN            NaN

To complete this statistical test, we will compare alpha to the P value. The P value is defined as the probability of getting a Type I Error (falsely rejecting H0) based on our sample, which is less than alpha for all parameters. Alpha is the highest probability of getting a Type I Error that we deem acceptable; since no alpha value was provided for this project, I will set it to 0.05, as this is the standard value used in Statistics research and can be used to provide a reasonable conclusion at our level of academia. Since all P values (defined as the PR(>F) values in the ANOVA table) are < 0.05, our P values are significantly less than alpha, so our probability of falsely rejecting H0 is at an acceptable rate (less than our benchmark), so we can safely reject H0 in favor of HA.

Essentially, we can safely conclude that there exists linear models consisting of year and genre that accurately predicts movie budget, US Income, and Global Income. They are given by the following:

$$Budget = (-7.029 * 10^8) + (3.592 * 10^5) * x_0 + (-6.568 * 10^9) * x_1 + (3.336 * 10^6) * x_2$$$$US Income = (1.645 * 10^8) + (-8.059 * 10^4) * x_0 + (-1.273 * 10^{10}) * x_1 + (6.469 * 10^6) * x_2$$$$Global Income = (1.161 * 10^8) + (-4.88 * 10^4) * x_0 + (-4.395 * 10^{10}) * x_1 + (2.22 * 10^7) * x_2$$$$x_0 = year$$$$x_1 = Genre$$$$x_2 = year * Genre$$

As one final test of our linear models, lets look at the violinplot of the residuals. Violinplots are a combination of boxplots and area plots, and can be applied over multiple categories (years). A residual is defined as the actual y value minus the predicted y value by the model, and the graph of residuals can tell us many details regarding the accuracy of the model.

In [100]:
# Returns a prediction given inputs
predict = lambda y, g, yg, model : model.predict({"year" : y, "Genre" : g, "year_genre" : yg})[0]

# Creating Budget Residuals
data["budget_res"] = [row["budget"] - predict(row["year"], row["Genre"], row["year_genre"], budget_test) 
                      for index, row in data.iterrows()]

# Creating US Income Residuals
data["us_res"] = [row["us_income"] - predict(row["year"], row["Genre"], row["year_genre"], us_test) 
                      for index, row in data.iterrows()]

# Creating Worldwide Income Residuals
data["glo_res"] = [row["glo_income"] - predict(row["year"], row["Genre"], row["year_genre"], glo_test) 
                      for index, row in data.iterrows()]

# Places years into different bins, will place the violinplots nicer
data["year_bin"] = np.digitize(data["year"], range(1970, 2020, 5))

In case you'd like to learn more about how we separated the years into bins

Bin 1: 1970 - 1974

Bin 2: 1975 - 1979

Bin 3: 1980 - 1984

Bin 4: 1985 - 1989

Bin 5: 1990 - 1994

Bin 6: 1995 - 1999

Bin 7: 2000 - 2004

Bin 8: 2005 - 2009

Bin 9: 2010 - 2014

Bin 10: 2015 - 2019

In [101]:
# Graphing Budget Residuals
sns.violinplot(x="year_bin", y="budget_res", data=data).set_title("Budget Residuals")
Out[101]:
Text(0.5, 1.0, 'Budget Residuals')

The suboptimal violinplot of residuals for budget shows all assumptions about linear models are partially met.

1.) The data mostly has a linear relationship, as the median residual of all years is about 0 (with notable exceptions for bins 2 and 10), so most data abided by the linear model, while data in those bins may have slightly skewed from it.

2.) The residuals are semi-independent, as median residuals mostly don't change as time progresses.

3.) Residuals have somewhat constant variance, as most years have no extreme skewness of residual data (with notable exceptions from bins 4 and 8), which would lead to clustering. Data from those 2 bins, however, may have some clustering.

4.) Finally, the residuals are mostly normally distributed, as the residual data for most years is unimodal and symmetric (excluding bins 4 and 8). Normal distribution may be compromised in those two bins.

In general, this residual plot shows that the linear model for budget is slightly inaccurate; while it is true that, for linear regression, it is okay if 1 - 2 assumptions don't completely hold, it is concerning when all 4 residual plot conditions have some caveats to them. While I do not think these caveats are enough to completely invalidate our conclusions from earlier regarding budget (ie. that budget over time will change based on whether Japanese animated or superhero movies are being analyzed), it would be beneficial to do more analysis on budgetary data. We could run machine learning to see if a different regression algorithm produces a different model, or alternatively, perhaps we should consider using another type of regression, like logistic to see if that's a better fit for our data

In [102]:
# Graphing US Income Residuals
sns.violinplot(x="year_bin", y="us_res", data=data).set_title("US Income Residuals")
Out[102]:
Text(0.5, 1.0, 'US Income Residuals')

I will analyze US Income residuals simultaneously with Global Income residuals, as both sets of data have produced near perfect residual plots.

In [103]:
# Graphing Global Income Residuals
sns.violinplot(x="year_bin", y="glo_res", data=data).set_title("Global Income Residuals")
Out[103]:
Text(0.5, 1.0, 'Global Income Residuals')

The violinplot of residuals for both US and Global Income prove all assumptions about linear models are now met.

1.) The data has a linear relationship, as the median residual of all years is about 0, so most data abided by the linear model.

2.) The residuals are independent, as median residuals don't change in any significant way as time progresses.

3.) Residuals have constant variance, as no year has extreme skewness of residual data, which would lead to clustering. (Do not be concerned with the large lines emanating from many of the later bins, as these mark the min/max residual. For skewness, we must look at the median residual (white dot) in relation to the black box (1st and 3rd quartiles), and for all bins in both of these residual plots, the white dot is in between the black box, indicating little residual skewness)

4.) Finally, the residuals are normally distributed, as the residual data for every year is unimodal and symmetric.

This seems to imply that for US and Global Income, the linear model we designed fits amazingly to our data. Therefore, our conclusions from earlier in this tutorial are valid, and we can safely assume that a movie Genre will have a direct impact on US and Global Income, along with year.

Conclusion

Superhero Movies "Win" by Several Metrics...

After looking at several metrics across rating, budget, US Income, and Global Income, it is clear that superhero movies are more successful, on average, than Japanese animated films. Superhero films were found to have reached higher rating scores, earn more in the box office (both US and Globally), and have the brightest financial future as their linear models state that their box office performance has increased the most over time. While many, especially in the US, may consider the superhero genre oversaturated, there is no denying its success.

...but Japanese Animated Movies aren't Far Behind

However, we simply cannot toss aside Japanese animated movies. This genre was found to have the higher rating consistency, as they had the highest number of average/good movies, lower budgets, meaning that there is less upfront cost to produce Japanese animated movies (although that may be, in part, due to lower animator wages and lack union + goverment regulations), and the higher budget efficiency, meaning that they could achieve greater box office success with less initial budget.

Takeaways

While superhero movies tend to be large spectacles with large budgets which hope to turn amazing reviews and massive revenue, Japanese animated movies tend to take a more subtle approach, favoring lower budgets and consistentcy instead. Both genres have their merits to them and are enjoyed by millions around the world today. If you take anything away from this tutorial, I hope you learned that conclusions from Data Science are not black and white but rather nuanced and complex. But mostly, I hope this tutorial was a fun and informative way for you to learn about the Data Science Pipeline!