Movie Genre Data Analysis¶

Introduction¶

This research explores patterns in the film industry by examining how genres, budgets, revenues, profits, popularity, and voting averages shape movie performance. The study is guided by key questions such as which genres dominate in production, deliver the highest financial returns, and capture the most audience approval, as well as how revenue and budget trends vary across different years. To deepen the analysis, several hypotheses are tested to investigate whether high ratings, popularity, or large budgets truly drive profit and revenue, and whether certain genres or production years offer consistent long-term efficiency. Together, these questions and hypotheses aim to highlight not only which films succeed, but also what factors make them sustainable investments in the movie market.

Research Questions (Q):¶

  1. Which genres are the most common (number of movies made)?
  2. Which genres have high avg. budget and revenue?
  3. Which genres have high avg. profit?
  4. Which genres have high avg. popularity?
  5. Which genres have highest number of movies with an voting avg. >=8?
  6. Which production years saw the highest total revenue, and how does this align with budget trends?

Research Hypotheses (H):¶

  1. The best movies according to vote avg. return high profit and revenue.
  2. The best movies according to popularity return high profit and revenue.
  3. Highly budgeted movies return high revenue and profit.
  4. Highly budgeted movies have a high popularity.
  5. Profit per genre per year
  6. Movies released in certain years achieve higher revenue/ROI than others
  7. Genres with consistently high ROI are better long-term targets than high-budget genres
In [550]:
import pandas as pd

movies = pd.read_csv('/Users/justiceoppongtuah/Library/CloudStorage/OneDrive-Personal/Pandas Project/imdb_movies.csv')
In [551]:
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
In [552]:
pd.set_option('display.max.rows', 11000)
pd.set_option('display.max.columns', 22)
In [553]:
movies.head()
Out[553]:
id imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. monster|dna|tyrannosaurus rex|velociraptor|island Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.500000 2015 137999939.280026 1392445892.523800
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. future|chase|post-apocalyptic|dystopia|australia An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.100000 2015 137999939.280026 348161292.489031
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You based on novel|revolution|dystopia|sequel|dyst... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.300000 2015 101199955.472019 271619025.407628
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. android|spaceship|jedi|space opera|3d Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.500000 2015 183999919.040035 1902723129.801820
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home car race|speed|revenge|suspense|car Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.300000 2015 174799923.088033 1385748801.470520

Data Cleaning¶

1. Checking for duplicates:¶

In [554]:
movies[movies.duplicated()]
Out[554]:
id imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
2090 42194 tt0411951 0.596430 30000000 967000 TEKKEN Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian... NaN Dwight H. Little Survival is no game martial arts|dystopia|based on video game|mart... In the year of 2039, after World Wars destroy ... 92 Crime|Drama|Action|Thriller|Science Fiction Namco|Light Song Films 3/20/10 110 5.000000 2010 30000000.000000 967000.000000

Deleting duplicates:¶

In [555]:
movies.drop_duplicates(inplace = True)
In [556]:
movies[movies.duplicated()]
Out[556]:
id imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj

2. Deleting null data¶

a. from the info 'genres' has (10866 - 10843) rows being null

In [557]:
movies.dropna(subset = ['genres'], inplace = True)
In [558]:
movies.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10842 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10842 non-null  int64  
 1   imdb_id               10834 non-null  object 
 2   popularity            10842 non-null  float64
 3   budget                10842 non-null  int64  
 4   revenue               10842 non-null  int64  
 5   original_title        10842 non-null  object 
 6   cast                  10767 non-null  object 
 7   homepage              2931 non-null   object 
 8   director              10800 non-null  object 
 9   tagline               8036 non-null   object 
 10  keywords              9367 non-null   object 
 11  overview              10839 non-null  object 
 12  runtime               10842 non-null  int64  
 13  genres                10842 non-null  object 
 14  production_companies  9826 non-null   object 
 15  release_date          10842 non-null  object 
 16  vote_count            10842 non-null  int64  
 17  vote_average          10842 non-null  float64
 18  release_year          10842 non-null  int64  
 19  budget_adj            10842 non-null  float64
 20  revenue_adj           10842 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.8+ MB

Calculating and adding 'profit' field to the data¶

In [559]:
movies['profit'] = movies['revenue'] - movies['budget']

Extracting Fields to Work with:¶

In [ ]:
movies_genre = movies[['popularity', 'budget', 'revenue', 'original_title', 'runtime', 'genres', 'release_year', 'vote_count', 'vote_average', 'profit']]
movies_genre.head()
In [561]:
from pandas import Series, DataFrame

Expanding the data by 'genres' per row¶

This is to make it more useful for our use cases:

In [562]:
movies_genre = (movies_genre.assign(genre=movies_genre['genres'].str.split('|')).explode('genre').drop(columns='genres'))
In [563]:
movies_genre
Out[563]:
popularity budget revenue original_title runtime release_year vote_count vote_average profit genre
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Action
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Adventure
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Science Fiction
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Thriller
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Action
... ... ... ... ... ... ... ... ... ... ...
10863 0.065141 0 0 Beregis Avtomobilya 94 1966 11 6.500000 0 Mystery
10863 0.065141 0 0 Beregis Avtomobilya 94 1966 11 6.500000 0 Comedy
10864 0.064317 0 0 What's Up, Tiger Lily? 80 1966 22 5.400000 0 Action
10864 0.064317 0 0 What's Up, Tiger Lily? 80 1966 22 5.400000 0 Comedy
10865 0.035919 19000 0 Manos: The Hands of Fate 74 1966 15 1.500000 -19000 Horror

26955 rows × 10 columns

In [564]:
movies_genre.head(10)
Out[564]:
popularity budget revenue original_title runtime release_year vote_count vote_average profit genre
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Action
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Adventure
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Science Fiction
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Thriller
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Action
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Adventure
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Science Fiction
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Thriller
2 13.112507 110000000 295238201 Insurgent 119 2015 2480 6.300000 185238201 Adventure
2 13.112507 110000000 295238201 Insurgent 119 2015 2480 6.300000 185238201 Science Fiction

Research Questions (Q):¶

1. Which genres are the most common (number of movies made)?

In [565]:
genres_count = pd.DataFrame(movies_genre.groupby('genre').original_title.nunique()).sort_values('original_title', ascending = True)
In [566]:
genres_count
Out[566]:
original_title
genre
Western 163
TV Movie 167
Foreign 188
War 269
History 331
Music 403
Documentary 520
Animation 697
Mystery 796
Fantasy 899
Science Fiction 1207
Family 1211
Crime 1337
Adventure 1442
Horror 1580
Romance 1686
Action 2339
Thriller 2841
Comedy 3750
Drama 4672
In [567]:
genres_count['original_title'].plot.pie(title = 'Movies Per Genre in %', autopct = '%1.1f%%', figsize = (9,9))
Out[567]:
<Axes: title={'center': 'Movies Per Genre in %'}, ylabel='original_title'>
No description has been provided for this image

The analysis above shows that Drama, Comedy, and Thriller are the most common genres, with Drama leading at 4,672 movies. The pie chart highlights how these genres dominate film production compared to less common ones like Western, TV Movie, and Foreign.

In [ ]:
 

2. Which genres have high avg. budget and revenue?

In [569]:
genres_avg = genres_count.groupby('genre').mean()
genres_avg
Out[569]:
original_title
genre
Action 2339.000000
Adventure 1442.000000
Animation 697.000000
Comedy 3750.000000
Crime 1337.000000
Documentary 520.000000
Drama 4672.000000
Family 1211.000000
Fantasy 899.000000
Foreign 188.000000
History 331.000000
Horror 1580.000000
Music 403.000000
Mystery 796.000000
Romance 1686.000000
Science Fiction 1207.000000
TV Movie 167.000000
Thriller 2841.000000
War 269.000000
Western 163.000000
In [570]:
genres_avg = movies_genre.groupby('genre').mean(numeric_only = True)
pd.options.display.float_format = '{:2f}'.format

genres_avg
Out[570]:
popularity budget revenue runtime release_year vote_count vote_average profit
genre
Action 0.926274 27727820.331376 72794732.002517 104.917785 2000.056208 392.993708 5.787752 45066911.671141
Adventure 1.154259 37543694.530931 113137861.069341 106.173351 1999.388851 513.125085 5.940585 75594166.538409
Animation 0.852182 23159781.606581 75256062.223176 68.181688 2003.995708 303.000000 6.403147 52096280.616595
Comedy 0.592607 13297915.618244 37526242.072238 96.745057 2000.816240 176.436330 5.905167 24228326.453994
Crime 0.744930 17663801.124815 42368661.645495 106.917282 1999.491137 278.805022 6.124889 24704860.520679
Documentary 0.181432 577149.148077 2041106.994231 102.651923 2008.313462 35.105769 6.908462 1463957.846154
Drama 0.591495 11880717.773529 29232255.725840 110.478151 2000.929202 182.544538 6.165546 17351537.952311
Family 0.786668 23359337.420796 72433176.373680 89.603574 2000.774167 272.320877 5.997563 49073838.952884
Fantasy 0.992840 32612585.348253 96313657.081878 100.736900 2000.292576 420.741266 5.863537 63701071.733624
Foreign 0.191496 1451434.925532 1520459.835106 107.228723 2001.398936 16.627660 5.981383 69024.909574
History 0.575936 18594919.302395 32011793.215569 136.206587 1997.497006 183.772455 6.410479 13416873.913174
Horror 0.465357 6226529.210751 16822808.624313 94.424557 2001.117288 120.059866 5.337447 10596279.413561
Music 0.487321 9438627.549020 28571768.691176 105.137255 2000.196078 124.340686 6.480392 19133141.142157
Mystery 0.690012 16119270.062963 40217566.661728 105.928395 1999.850617 236.998765 5.946790 24098296.598765
Romance 0.592082 12531271.847547 35691972.327103 106.891355 2000.438668 166.070678 6.042874 23160700.479556
Science Fiction 1.001548 24972680.524003 70140558.034174 99.419854 1999.978845 437.096013 5.665582 45167877.510171
TV Movie 0.270896 267664.670659 251497.005988 91.982036 2004.634731 34.365269 5.788024 -16167.664671
Thriller 0.741563 17207693.769178 41728417.543860 103.247678 2001.693498 255.484348 5.750671 24520723.774682
War 0.727683 20891886.103704 47605183.300000 127.625926 1996.096296 270.733333 6.297778 26713297.196296
Western 0.590615 18974107.975758 28568709.284848 117.575758 1986.915152 205.739394 6.083030 9594601.309091
In [571]:
genres_avg.sort_values('budget', ascending = True, inplace = True)
In [572]:
genres_avg[['budget', 'revenue']].plot.barh(title = 'Budget and Revenue Per Genre', color = ('DarkBlue', 'DarkCyan'), figsize = (8,7))
Out[572]:
<Axes: title={'center': 'Budget and Revenue Per Genre'}, ylabel='genre'>
No description has been provided for this image

The analysis above reveals that Adventure, Fantasy, and Action genres stand out with the highest average budgets and revenues, reflecting their blockbuster nature. A horizontal bar chart was used to compare these values across genres, making it clear that large-scale productions typically dominate financial performance in the film industry.

In [ ]:
 

3. Which genres have high avg. profit?

In [629]:
genres_avg
Out[629]:
popularity budget revenue runtime release_year vote_count vote_average profit
genre
Documentary 0.181432 577149.148077 2041106.994231 102.651923 2008.313462 35.105769 6.908462 1463957.846154
Foreign 0.191496 1451434.925532 1520459.835106 107.228723 2001.398936 16.627660 5.981383 69024.909574
TV Movie 0.270896 267664.670659 251497.005988 91.982036 2004.634731 34.365269 5.788024 -16167.664671
Horror 0.465357 6226529.210751 16822808.624313 94.424557 2001.117288 120.059866 5.337447 10596279.413561
Music 0.487321 9438627.549020 28571768.691176 105.137255 2000.196078 124.340686 6.480392 19133141.142157
History 0.575936 18594919.302395 32011793.215569 136.206587 1997.497006 183.772455 6.410479 13416873.913174
Western 0.590615 18974107.975758 28568709.284848 117.575758 1986.915152 205.739394 6.083030 9594601.309091
Drama 0.591495 11880717.773529 29232255.725840 110.478151 2000.929202 182.544538 6.165546 17351537.952311
Romance 0.592082 12531271.847547 35691972.327103 106.891355 2000.438668 166.070678 6.042874 23160700.479556
Comedy 0.592607 13297915.618244 37526242.072238 96.745057 2000.816240 176.436330 5.905167 24228326.453994
Mystery 0.690012 16119270.062963 40217566.661728 105.928395 1999.850617 236.998765 5.946790 24098296.598765
War 0.727683 20891886.103704 47605183.300000 127.625926 1996.096296 270.733333 6.297778 26713297.196296
Thriller 0.741563 17207693.769178 41728417.543860 103.247678 2001.693498 255.484348 5.750671 24520723.774682
Crime 0.744930 17663801.124815 42368661.645495 106.917282 1999.491137 278.805022 6.124889 24704860.520679
Family 0.786668 23359337.420796 72433176.373680 89.603574 2000.774167 272.320877 5.997563 49073838.952884
Animation 0.852182 23159781.606581 75256062.223176 68.181688 2003.995708 303.000000 6.403147 52096280.616595
Action 0.926274 27727820.331376 72794732.002517 104.917785 2000.056208 392.993708 5.787752 45066911.671141
Fantasy 0.992840 32612585.348253 96313657.081878 100.736900 2000.292576 420.741266 5.863537 63701071.733624
Science Fiction 1.001548 24972680.524003 70140558.034174 99.419854 1999.978845 437.096013 5.665582 45167877.510171
Adventure 1.154259 37543694.530931 113137861.069341 106.173351 1999.388851 513.125085 5.940585 75594166.538409
In [574]:
genres_avg.sort_values('profit', ascending = True, inplace = True)
In [575]:
genres_avg['profit'].plot.barh(title = 'Profit Per Genre', color = 'DarkCyan', figsize = (8,7))
Out[575]:
<Axes: title={'center': 'Profit Per Genre'}, ylabel='genre'>
No description has been provided for this image

The analysis reveals that Adventure, Fantasy, Animation, and Family stand out as the most profitable genres, with Animation and Family even surpassing Science Fiction in average profit. A horizontal bar chart was used to visualize these results, showing blockbuster and family-oriented genres leading in financial returns.

In [ ]:
 

4. Which genres have high avg. popularity?

In [576]:
genres_avg.sort_values('popularity', ascending = True, inplace = True)
In [577]:
genres_avg['popularity'].plot.barh(title = 'Popularity Per Genre', color = 'DarkCyan', figsize = (8,7))
Out[577]:
<Axes: title={'center': 'Popularity Per Genre'}, ylabel='genre'>
No description has been provided for this image

The analysis shows that Adventure, Fantasy and Science Fiction are the genres with the highest average popularity, indicating strong audience appeal. A horizontal bar chart was used to visualize the comparison, highlighting how these genres consistently attract the most attention.

In [ ]:
 

5. Which genres have highest number of movies with an voting avg. >=8?

In [578]:
movies_genre.head(5)
Out[578]:
popularity budget revenue original_title runtime release_year vote_count vote_average profit genre
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Action
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Adventure
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Science Fiction
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Thriller
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Action
In [579]:
vote_fifty = movies_genre[(movies_genre['vote_count'] >= 50) & (movies_genre['vote_average'] >= 8)]
vote_zero = movies_genre[movies_genre['vote_average'] >= 8]
In [580]:
genres_vote = pd.DataFrame(vote_zero.groupby('genre').vote_average.nunique()).sort_values('vote_average', ascending = True)
genres_vote
Out[580]:
vote_average
genre
Family 1
Action 2
Mystery 2
History 2
Fantasy 2
Adventure 2
War 2
Horror 3
Romance 3
Science Fiction 3
Thriller 4
Animation 4
Crime 4
Comedy 5
Drama 6
Music 8
Documentary 9
In [581]:
genres_vote['vote_average'].plot.barh(title = 'Vote Average by Genre', color = 'DarkCyan', figsize = (8,7))
Out[581]:
<Axes: title={'center': 'Vote Average by Genre'}, ylabel='genre'>
No description has been provided for this image

The analysis shows that Documetary, Music and Drama have the highest number of movies with a voting average of 8 or above. Meanwhile, Family, Action and Mystery have the lowest. A horizontal bar chart was used to display the counts, making it clear which genres consistently produce highly rated films.

In [ ]:
 

6. Which production years saw the highest total revenue, and how does this align with budget trends?

In [582]:
import numpy as np
import matplotlib.pyplot as plt
In [583]:
movies_genre
Out[583]:
popularity budget revenue original_title runtime release_year vote_count vote_average profit genre
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Action
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Adventure
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Science Fiction
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Thriller
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Action
... ... ... ... ... ... ... ... ... ... ...
10863 0.065141 0 0 Beregis Avtomobilya 94 1966 11 6.500000 0 Mystery
10863 0.065141 0 0 Beregis Avtomobilya 94 1966 11 6.500000 0 Comedy
10864 0.064317 0 0 What's Up, Tiger Lily? 80 1966 22 5.400000 0 Action
10864 0.064317 0 0 What's Up, Tiger Lily? 80 1966 22 5.400000 0 Comedy
10865 0.035919 19000 0 Manos: The Hands of Fate 74 1966 15 1.500000 -19000 Horror

26955 rows × 10 columns

In [584]:
# Yearly totals
yearly_totals = (movies_genre.groupby('release_year')[['revenue', 'budget']].sum(min_count=1).dropna().sort_index())
In [585]:
# Efficiency and ranks
yearly_totals['rev_to_budget_ratio'] = np.where(yearly_totals['budget'] > 0,yearly_totals['revenue'] / yearly_totals['budget'],np.nan)
In [586]:
yearly_totals['revenue_rank'] = yearly_totals['revenue'].rank(ascending=False, method='min')
yearly_totals['budget_rank']  = yearly_totals['budget'].rank(ascending=False, method='min')
yearly_totals['rank_gap(rev-budget)'] = yearly_totals['revenue_rank'] - yearly_totals['budget_rank']
In [587]:
# Top 10 years by total revenue
q6_top10 = yearly_totals.sort_values('revenue', ascending=False).head(10)
In [588]:
display(yearly_totals.head(12))
display(q6_top10)
revenue budget rev_to_budget_ratio revenue_rank budget_rank rank_gap(rev-budget)
release_year
1960 417915000 64140844 6.515583 55.000000 56.000000 -1.000000
1961 1201673874 117537000 10.223792 47.000000 54.000000 -7.000000
1962 726409692 184194630 3.943707 53.000000 48.000000 5.000000
1963 552004390 237328000 2.325913 54.000000 45.000000 9.000000
1964 1178849290 134471644 8.766527 49.000000 52.000000 -3.000000
1965 1525459848 212590345 7.175584 46.000000 46.000000 0.000000
1966 195736689 136164000 1.437507 56.000000 51.000000 5.000000
1967 2177966116 324282200 6.716268 42.000000 42.000000 0.000000
1968 817108595 238439000 3.426908 51.000000 44.000000 7.000000
1969 743971370 132203174 5.627485 52.000000 53.000000 -1.000000
1970 1692997298 480535838 3.523145 45.000000 40.000000 5.000000
1971 1200690537 253348000 4.739294 48.000000 43.000000 5.000000
revenue budget rev_to_budget_ratio revenue_rank budget_rank rank_gap(rev-budget)
release_year
2015 82160071197 23020835159 3.568944 1.000000 8.000000 -7.000000
2014 72059584262 23255945482 3.098545 2.000000 7.000000 -5.000000
2012 72019644724 24044807311 2.995227 3.000000 5.000000 -2.000000
2009 71844890702 26207420963 2.741395 4.000000 4.000000 0.000000
2013 70798442190 26406690003 2.681080 5.000000 3.000000 2.000000
2011 70142112650 26566819277 2.640215 6.000000 2.000000 4.000000
2010 65304621240 28482377815 2.292808 7.000000 1.000000 6.000000
2007 60426073803 23263499212 2.597463 8.000000 6.000000 2.000000
2008 54764768294 22167781392 2.470467 9.000000 9.000000 0.000000
2004 50004105495 21297896216 2.347842 10.000000 12.000000 -2.000000
In [589]:
# Correlation of yearly totals (how tightly budgets track revenue)
q6_spearman = yearly_totals[['revenue','budget']].corr(method='spearman').iloc[0,1]
print(f"[Q6] Spearman(revenue vs budget) across years = {q6_spearman:.3f}")
[Q6] Spearman(revenue vs budget) across years = 0.975
In [590]:
# Plot: revenue vs budget over time
plt.figure(figsize=(11,6))
plt.plot(yearly_totals.index, yearly_totals['revenue'], label='Total Revenue')
plt.plot(yearly_totals.index, yearly_totals['budget'],  label='Total Budget')
plt.title('Total Revenue & Budget by Year')
plt.xlabel('Year'); plt.ylabel('Amount'); plt.legend(); plt.tight_layout(); plt.show()
No description has been provided for this image

The analysis shows that the highest total revenues were recorded in recent blockbuster years, with 2015, 2014, and 2012 leading the list. The trend also reveals that while budgets rose steadily during these periods, revenue growth often outpaced budget increases, resulting in strong revenue-to-budget ratios. A line chart comparing total revenue and total budget by year was used to illustrate this alignment.

In [ ]:
 

Q7) Which genres deliver the highest average ROI, and is high ROI consistent over time?

ROI (Return on Investment) • Definition: A percentage that measures the efficiency of an investment relative to its cost.

•	Formula:
ROI = [{Profit (or Gain) from Investment} \ {Cost of Investment}]  * 100%

•	Type: Measured in percentage (%).

•	Purpose: Shows how effectively your investment generates returns — it answers “Was the investment worth it?”
In [591]:
import matplotlib.pyplot as plt
import seaborn as sns
In [592]:
# ROI at movie level (guard against division by zero)
movies_genre['profit'] = movies_genre['revenue'] - movies_genre['budget']
movies_genre['roi'] = np.where(movies_genre['budget'] > 0, movies_genre['profit'] / movies_genre['budget'], np.nan)
movies_genre.head()
Out[592]:
popularity budget revenue original_title runtime release_year vote_count vote_average profit genre roi
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Action 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Adventure 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Science Fiction 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Thriller 9.090192
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Action 1.522909
In [593]:
# Mean ROI per (genre, year)
roi_by_year = (
    movies_genre
    .dropna(subset=['release_year', 'roi'])
    .groupby(['genre', 'release_year'], as_index=False)['roi'].mean()
    .rename(columns={'release_year': 'year', 'roi': 'mean_roi_year'})
)
roi_by_year.head(15)
Out[593]:
genre year mean_roi_year
0 Action 1960 2.726250
1 Action 1961 0.844444
2 Action 1962 18.171717
3 Action 1963 9.715914
4 Action 1964 34.685714
5 Action 1965 2.030697
6 Action 1966 -0.014614
7 Action 1967 4.678831
8 Action 1968 4.209170
9 Action 1969 1.927660
10 Action 1970 1.375660
11 Action 1971 14.932931
12 Action 1972 12.283831
13 Action 1973 21.699363
14 Action 1974 6.609024
In [594]:
# Consistency by genre: overall mean ROI, volatility across years, years observed
roi_consistency = (
    roi_by_year
    .groupby('genre', as_index=False)
    .agg(
        mean_roi=('mean_roi_year', 'mean'),
        roi_volatility=('mean_roi_year', 'std'),
        years_observed=('year', 'nunique')
    )
    .fillna({'roi_volatility': 0})
    .sort_values(['mean_roi', 'years_observed'], ascending=[False, False])
)

display(roi_consistency.head(20))
genre mean_roi roi_volatility years_observed
7 Family 5097.366486 36013.022664 50
14 Romance 1834.896107 13610.464854 56
1 Adventure 1659.294028 12374.474298 56
0 Action 1217.297322 9074.426205 56
6 Drama 822.291712 5678.375641 56
17 Thriller 77.977464 527.720212 55
3 Comedy 56.189440 329.346892 56
11 Horror 23.207555 79.766938 56
13 Mystery 22.797339 104.184279 51
5 Documentary 8.312802 24.843905 22
15 Science Fiction 4.964676 16.941324 53
12 Music 4.784395 8.749801 45
8 Fantasy 4.497297 18.673400 47
2 Animation 4.263699 9.198865 43
4 Crime 3.578558 5.809079 55
19 Western 1.896628 4.866633 39
18 War 1.868804 2.589981 48
10 History 1.344613 2.230760 48
16 TV Movie 0.200000 3.174902 7
9 Foreign -0.667293 0.621465 17
In [595]:
# Visual: mean ROI vs volatility
plt.figure(figsize=(8,6))
sns.scatterplot(data=roi_consistency, x='mean_roi', y='roi_volatility')
plt.title('Genre ROI: Mean vs Volatility'); plt.xlabel('Mean ROI'); plt.ylabel('ROI Volatility (Std Dev)')
plt.tight_layout(); 
plt.show()
No description has been provided for this image

The analysis shows that Family, Romance, and Adventure deliver the highest average ROI, although they also exhibit very high volatility across years. Genres such as Thriller, Comedy, and Horror offer more moderate but relatively stable returns. A scatter plot of mean ROI versus volatility was used to highlight the trade-off between profitability and consistency, helping to identify which genres balance high returns with lower risk.

In [ ]:
 

Research Hypotheses (H):¶

1. The best movies according to vote avg. return high profit and revenue.

In [596]:
movies_genre.head()
Out[596]:
popularity budget revenue original_title runtime release_year vote_count vote_average profit genre roi
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Action 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Adventure 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Science Fiction 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Thriller 9.090192
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Action 1.522909
In [597]:
movies_counted = movies_genre[movies_genre['vote_count'] >= 50]
movies_counted.corr(method='spearman', numeric_only=True)
Out[597]:
popularity budget revenue runtime release_year vote_count vote_average profit roi
popularity 1.000000 0.496554 0.606165 0.231128 0.091893 0.776718 0.213364 0.516448 0.412798
budget 0.496554 1.000000 0.723696 0.355018 0.024279 0.573096 -0.036132 0.343588 0.016027
revenue 0.606165 0.723696 1.000000 0.346976 -0.075445 0.699428 0.126505 0.841594 0.713213
runtime 0.231128 0.355018 0.346976 1.000000 -0.136733 0.269755 0.260482 0.223173 0.101995
release_year 0.091893 0.024279 -0.075445 -0.136733 1.000000 0.176190 -0.134809 -0.075017 -0.146544
vote_count 0.776718 0.573096 0.699428 0.269755 0.176190 1.000000 0.307466 0.598132 0.490226
vote_average 0.213364 -0.036132 0.126505 0.260482 -0.134809 0.307466 1.000000 0.217354 0.298911
profit 0.516448 0.343588 0.841594 0.223173 -0.075017 0.598132 0.217354 1.000000 0.837730
roi 0.412798 0.016027 0.713213 0.101995 -0.146544 0.490226 0.298911 0.837730 1.000000
In [598]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(9,7))
sns.regplot(x = 'vote_average', y = 'profit', data = movies_counted, line_kws = {'color':'cyan'})
Out[598]:
<Axes: xlabel='vote_average', ylabel='profit'>
No description has been provided for this image
In [599]:
plt.figure(figsize=(9,7))

sns.regplot(x = 'vote_average', y = 'revenue', data = movies_counted, line_kws = {'color':'cyan'})
Out[599]:
<Axes: xlabel='vote_average', ylabel='revenue'>
No description has been provided for this image

****The correlation analysis shows a positive but relatively weak relationship between vote average and both profit and revenue. This suggests that while highly rated movies can generate strong financial returns, ratings alone are not the dominant driver of profit and revenue. A regression plot was used to visualize this relationship, showing the general upward trend but also the wide spread of data points.****

In [ ]:
 

2. The best movies according to popularity return high profit and revenue.

In [600]:
movies_counted.corr(method='spearman', numeric_only=True)
Out[600]:
popularity budget revenue runtime release_year vote_count vote_average profit roi
popularity 1.000000 0.496554 0.606165 0.231128 0.091893 0.776718 0.213364 0.516448 0.412798
budget 0.496554 1.000000 0.723696 0.355018 0.024279 0.573096 -0.036132 0.343588 0.016027
revenue 0.606165 0.723696 1.000000 0.346976 -0.075445 0.699428 0.126505 0.841594 0.713213
runtime 0.231128 0.355018 0.346976 1.000000 -0.136733 0.269755 0.260482 0.223173 0.101995
release_year 0.091893 0.024279 -0.075445 -0.136733 1.000000 0.176190 -0.134809 -0.075017 -0.146544
vote_count 0.776718 0.573096 0.699428 0.269755 0.176190 1.000000 0.307466 0.598132 0.490226
vote_average 0.213364 -0.036132 0.126505 0.260482 -0.134809 0.307466 1.000000 0.217354 0.298911
profit 0.516448 0.343588 0.841594 0.223173 -0.075017 0.598132 0.217354 1.000000 0.837730
roi 0.412798 0.016027 0.713213 0.101995 -0.146544 0.490226 0.298911 0.837730 1.000000
In [601]:
import matplotlib.pyplot as plt 

plt.figure(figsize = (9,7))
sns.regplot(x = 'popularity', y = 'revenue', data = movies_counted, line_kws = {'color':'cyan'})
plt.show()
No description has been provided for this image

The correlation results show that popularity has a strong positive relationship with both revenue (0.61) and profit (0.52). This means that highly popular movies generally achieve higher commercial success, confirming the hypothesis. A regression plot was used to illustrate this trend, with a clear upward slope demonstrating how popularity drives financial performance.

In [ ]:
 

3. Highly budgeted movies return high profit.

In [602]:
plt.figure(figsize = (9,7))
sns.regplot(x = 'budget', y = 'profit', data = movies_counted, line_kws = {'color':'cyan'})
plt.show()
No description has been provided for this image

The analysis shows a moderate positive relationship between budget and profit, indicating that higher-budget movies often achieve greater profits. However, the wide spread of data points in the regression plot highlights that large budgets do not always guarantee high returns.

In [ ]:
 

4. Highly budgeted movies have a high popularity.

In [603]:
plt.figure(figsize = (9,7))
sns.regplot(x = 'budget', y = 'popularity', data = movies_counted, line_kws = {'color':'cyan'})
plt.show()
No description has been provided for this image

The analysis indicates only a weak positive relationship between budget and popularity, suggesting that higher-budget movies may attract more attention but not consistently so. The regression plot shows an upward trend, but the scatter of points reveals that popularity is influenced by other factors beyond budget.

In [ ]:
 

5. Profit per Genre per Year¶

In [604]:
movies_genre.head()
Out[604]:
popularity budget revenue original_title runtime release_year vote_count vote_average profit genre roi
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Action 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Adventure 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Science Fiction 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Thriller 9.090192
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Action 1.522909
In [605]:
time_genre = pd.DataFrame(movies_genre.groupby(['release_year', 'genre'])['profit'].mean())
time_genre.head(10)
Out[605]:
profit
release_year genre
1960 Action 6363125.000000
Adventure 431000.000000
Comedy 5258750.000000
Crime 0.000000
Drama 8245619.384615
Family 3333333.333333
Fantasy -375000.000000
Foreign 0.000000
History 9600000.000000
Horror 4389007.428571
In [631]:
pivot_genre = pd.pivot_table(time_genre, values = 'profit', index = ['genre'], columns = ['release_year'])
pivot_genre
Out[631]:
release_year 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 ... 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
genre
Action 6363125.000000 2457142.857143 10437500.000000 14999691.250000 24280000.000000 16104504.777778 324785.714286 21732786.428571 8868478.833333 5861449.300000 7735913.454545 ... 32379154.185714 38048623.212500 45977444.052632 47642764.404040 57555510.740741 34571030.355140 51968078.252174 71910257.424242 47431725.628099 69943650.015504 90399046.747664
Adventure 431000.000000 39130002.333333 21214285.714286 9428395.000000 24123400.000000 25124276.333333 791563.636364 49281873.857143 15122159.400000 16651809.400000 11406377.666667 ... 70756754.320755 64895200.090909 83916731.083333 55016526.809524 113364919.736111 98125646.677966 113343224.564516 162549574.300000 100638113.402985 120373306.985075 144787868.971014
Animation NaN 211880014.000000 NaN 0.000000 0.000000 0.000000 -105000.000000 100921706.000000 0.000000 0.000000 25837628.500000 ... 40307312.153846 39047734.974359 67147670.531250 51907242.575758 48862898.250000 58455658.640000 63420098.934783 69231546.875000 93718795.261905 64313463.250000 83691174.384615
Comedy 5258750.000000 22198001.400000 1599974.800000 765591.923077 7080111.375000 3264285.714286 -149550.000000 10135087.470588 -1277777.777778 -125000.000000 11901488.583333 ... 22147617.440559 24944308.174194 30674848.549669 21483012.426036 24741982.808081 23154807.875740 27241358.720930 25631999.136364 25068455.091429 26648786.864865 33703819.067901
Crime 0.000000 18828411.000000 3709948.666667 369526.750000 1723307.300000 -22500.000000 -275560.000000 12899678.714286 6762174.600000 31482963.000000 -2125000.000000 ... 15680793.688889 21551432.962264 16700048.400000 31258696.854839 11016642.307692 13490412.320000 25395271.625000 29522925.555556 23248589.507042 5903014.230769 61286524.372549
Documentary NaN NaN NaN NaN NaN NaN 0.000000 0.000000 NaN NaN 0.000000 ... 316012.705882 6772940.000000 5229344.631579 321262.230769 -1603933.560000 187417.771429 2567196.816327 1102306.591837 321880.225806 13104.136986 531061.859649
Drama 8245619.384615 4912698.375000 4373796.190476 1587307.692308 3726105.550000 11933809.800000 1014793.062500 18045095.000000 6354859.150000 10317995.538462 19753262.842105 ... 17180823.510989 12994800.142132 13953425.421320 16736384.017167 15209056.888393 13510589.176190 10830976.523364 23418944.870690 13344942.608696 13697098.665493 16315310.438462
Family 3333333.333333 42576002.800000 4000000.000000 -333333.333333 37818181.750000 51671428.666667 -341828.571429 64281204.000000 -640000.000000 0.000000 17225085.666667 ... 51714774.929825 35281751.276923 80470561.733333 44483026.142857 62009322.100000 98999767.490909 65483435.472222 69321493.023810 89321956.437500 56295182.744186 85974520.750000
Fantasy -375000.000000 0.000000 -126.000000 -500000.000000 24068181.750000 NaN 0.000000 -3000000.000000 -2000000.000000 0.000000 0.000000 ... 64388182.000000 57967661.441176 93461583.425532 51441997.767442 96235822.102041 83235328.522727 94486677.847826 122245621.970588 79363628.000000 112966482.638889 86576948.303030
Foreign 0.000000 0.000000 -126.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN NaN -25000000.000000 ... -57761.300000 -250.000000 -1978098.117647 5591672.111111 -1032352.941176 -1185845.384615 -403761.428571 0.000000 NaN NaN NaN
History 9600000.000000 2333333.333333 11250000.000000 4908750.000000 0.000000 -3440000.000000 0.000000 NaN 4433333.333333 28102963.000000 18954430.333333 ... 19308745.700000 6680655.250000 10272090.153846 7683159.875000 3519452.875000 14883053.714286 12320119.375000 6216918.769231 23672850.857143 22025802.133333 4193780.266667
Horror 4389007.428571 600000.000000 231600.000000 898558.777778 -10833.333333 -60000.000000 -6333.333333 -16250.000000 7520356.500000 -175000.000000 -12000.000000 ... 17915983.446429 9281245.571429 16102357.113924 1813709.855263 7410959.733333 9365626.051282 5780349.653846 6799669.932692 14019309.127451 6162654.752381 5394573.384000
Music 0.000000 18828411.000000 8000000.000000 NaN 30354655.200000 76757143.000000 NaN 7200000.000000 2266666.666667 0.000000 0.000000 ... 16848210.461538 31348176.333333 40089857.400000 6433060.950000 36892810.647059 -700418.900000 31950506.882353 27459603.590909 78735.100000 12672735.785714 17045008.303030
Mystery NaN 0.000000 -245000.000000 1568014.666667 1216209.750000 0.000000 1550000.000000 8459992.666667 25626555.750000 -4000000.000000 2499906.000000 ... 12485711.068966 26821791.500000 56994329.941176 14934154.931034 14554527.098039 45215039.968750 33299398.657895 11465554.363636 3487391.552632 21897562.888889 10063931.666667
Romance 5225000.000000 214285.714286 -400025.200000 4388698.500000 6722222.222222 50547239.200000 1216666.666667 17960342.636364 5444314.000000 -312500.000000 19933333.333333 ... 27892262.129032 14888539.914286 16750359.121622 35814305.607143 26014869.542169 22446537.105882 25097807.626667 24848927.860759 8434191.753846 15024969.250000 22447977.175439
Science Fiction -316666.666667 0.000000 -31000.000000 -375000.000000 -300000.000000 0.000000 917866.666667 -16250.000000 18043865.750000 0.000000 5333239.333333 ... 63536368.000000 7707033.766667 26902035.439024 20509952.403846 70332541.295775 42764127.844444 52067849.625000 63413513.685185 80084403.885246 86164859.403226 83321303.523256
TV Movie NaN NaN NaN NaN NaN 0.000000 NaN NaN NaN 0.000000 0.000000 ... 0.000000 -1275000.000000 0.000000 0.000000 -1500000.000000 -312500.000000 0.000000 0.000000 -200000.000000 0.000000 -150000.000000
Thriller 5028842.000000 NaN 7902857.142857 8612335.300000 14029426.555556 11285049.363636 650000.000000 26892953.000000 6762174.600000 23658164.333333 18097830.200000 ... 22759650.239130 24651056.850877 19789641.168000 26262034.385827 13872156.605096 14726113.059701 19873485.568493 22469518.868750 22021043.091429 20129150.273743 32869106.847953
War 0.000000 11450000.000000 15000000.000000 1000000.000000 2546757.333333 11020224.375000 2000000.000000 20300000.000000 4325000.000000 -4000000.000000 33032658.200000 ... 2339931.600000 3074035.285714 54239461.666667 7426139.555556 20431563.083333 -8446301.571429 7767908.111111 23710293.600000 21060731.285714 60608265.086957 54677314.333333
Western 484166.666667 -566666.666667 13266666.666667 0.000000 0.000000 -950000.000000 181714.285714 3000000.000000 5000000.000000 14993349.000000 1924075.750000 ... 32218107.000000 -16803693.000000 -4254397.400000 -3394303.000000 NaN -39203367.000000 41056818.666667 77092059.500000 -57224233.333333 -2261668.500000 72856619.000000

20 rows × 56 columns

In [607]:
sns.set(rc = {'figure.figsize' : (13,7)})
sns.heatmap(pivot_genre, cmap = 'PuBuGn', linewidths = .2)
plt.title('Genres by Profit per Year')
Out[607]:
Text(0.5, 1.0, 'Genres by Profit per Year')
No description has been provided for this image

The analysis of profit per genre per year shows how profitability trends vary widely across both genres and time. Some genres, like Adventure, Fantasy, and Animation, display consistently strong profits across multiple decades, while others such as Documentary, TV Movie, and Foreign remain comparatively weak or inconsistent. A heatmap was used to visualize these results, making it easy to identify periods of peak profitability as well as fluctuations in performance over time.

In [ ]:
 

6. Movies released in certain years achieve higher revenue/ROI than others¶

In [608]:
movies_genre.head()
Out[608]:
popularity budget revenue original_title runtime release_year vote_count vote_average profit genre roi
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Action 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Adventure 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Science Fiction 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Thriller 9.090192
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Action 1.522909
In [609]:
# Using average per year to reduce the effect of volume changes

year_avgs = (
    movies_genre.loc[np.isfinite(movies_genre['roi'])]
     .groupby('release_year')[['revenue','roi']]
     .mean()
     .dropna()
     .reset_index()
     .rename(columns={'release_year':'year'})
)
In [610]:
year_avgs.head()
Out[610]:
year revenue roi
0 1960 15829791.666667 6.127400
1 1961 40055795.800000 8.848848
2 1962 19961705.485714 5.181289
3 1963 15149947.885714 2.608494
4 1964 39796993.862069 7.035513
In [ ]:
 
In [ ]:
 
In [612]:
# Trend strength (non-parametric)

spearman_revenue = year_avgs[['year','revenue']].corr(method='spearman').iloc[0,1]
spearman_roi = year_avgs[['year','roi']].corr(method='spearman').iloc[0,1]

print(f"Spearman(year vs avg revenue) = {spearman_revenue:.3f}")
print(f"Spearman(year vs avg ROI)     = {spearman_roi:.3f}")
Spearman(year vs avg revenue) = 0.910
Spearman(year vs avg ROI)     = -0.446
In [619]:
# Top 5 years by avg revenue and avg ROI

top5_rev_years = year_avgs.nlargest(5, 'revenue')[['year','revenue']]
top5_roi_years = year_avgs.nlargest(5, 'roi')[['year','roi']]

display(top5_rev_years)
display(top5_roi_years)
year revenue
55 2015 154679453.148077
54 2014 134690522.286252
52 2012 124013528.616056
53 2013 112221979.003205
41 2001 106517145.637975
year roi
26 1986 33074.737704
25 1985 1162.371159
51 2011 120.688962
39 1999 52.301377
17 1977 49.531412
In [ ]:
 
In [620]:
fig, axes = plt.subplots(1,2, figsize=(14,6))

# Top 5 revenue years
axes[0].bar(top5_rev_years['year'].astype(str), top5_rev_years['revenue'], color='darkcyan')
axes[0].set_title('Top 5 Years by Avg Revenue')
axes[0].set_ylabel('Avg Revenue')

# Top 5 ROI years
axes[1].bar(top5_roi_years['year'].astype(str), top5_roi_years['roi'], color='darkblue')
axes[1].set_title('Top 5 Years by Avg ROI')
axes[1].set_ylabel('Avg ROI')

plt.tight_layout(); plt.show()
No description has been provided for this image
In [633]:
fig, ax1 = plt.subplots(figsize=(10,6))

ax1.plot(year_avgs['year'], year_avgs['revenue'], color='darkcyan', label='Revenue')
ax1.set_xlabel('Year'); ax1.set_ylabel('Average Revenue', color='darkcyan')
ax1.tick_params(axis='y', labelcolor='darkcyan')

ax2 = ax1.twinx()
ax2.plot(year_avgs['year'], year_avgs['roi'], color='darkblue', label='ROI')
ax2.set_ylabel('Average ROI', color='darkblue')
ax2.tick_params(axis='y', labelcolor='darkblue')

plt.title('Average Revenue vs ROI per Year')
fig.tight_layout(); plt.show()
No description has been provided for this image

The analysis highlights that 2015, 2014, and 2012 were the strongest years in terms of average revenue, reflecting the era of blockbuster dominance. In contrast, the highest ROI values were achieved in earlier breakout years such as 1986, 1985, and 1977, when lower budgets produced disproportionately high returns. A dual-axis line chart was used to display these trends, clearly showing that while revenues have generally risen over time, ROI peaks appear sporadically in specific years.

In [ ]:
 

7. Genres with consistently high ROI are better long-term targets than high-budget genres¶

In [637]:
movies_genre.head()
Out[637]:
popularity budget revenue original_title runtime release_year vote_count vote_average profit genre roi
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Action 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Adventure 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Science Fiction 9.090192
0 32.985763 150000000 1513528810 Jurassic World 124 2015 5562 6.500000 1363528810 Thriller 9.090192
1 28.419936 150000000 378436354 Mad Max: Fury Road 120 2015 6185 7.100000 228436354 Action 1.522909
In [638]:
# Average budget vs average ROI by genre

genre_budget = movies_genre.groupby('genre')['budget'].mean().rename('avg_budget')
genre_roi    = movies_genre.groupby('genre')['roi'].mean().rename('avg_roi')
budget_roi   = pd.concat([genre_budget, genre_roi], axis=1).dropna()
In [639]:
budget_roi.head(5)
Out[639]:
avg_budget avg_roi
genre
Action 27727820.331376 714.999015
Adventure 37543694.530931 1126.819397
Animation 23159781.606581 2.268490
Comedy 13297915.618244 52.962398
Crime 17663801.124815 1.752887
In [640]:
# Rank correlation: are high budgets the same as high ROI?

kendall_tau = budget_roi['avg_budget'].rank().corr(budget_roi['avg_roi'].rank(), method='kendall')
print(f"[H6] Kendall tau(rank budget vs rank ROI) = {kendall_tau:.3f}")
[H6] Kendall tau(rank budget vs rank ROI) = 0.168
In [642]:
# “Efficient” genres: above-median ROI at or below median budget

b_med   = budget_roi['avg_budget'].median()
roi_med = budget_roi['avg_roi'].median()
efficient_genres = (
    budget_roi[(budget_roi['avg_roi'] >= roi_med) & (budget_roi['avg_budget'] <= b_med)]
    .sort_values('avg_roi', ascending=False)
    .reset_index()
)
display(efficient_genres)
genre avg_budget avg_roi
0 Romance 12531271.847547 1218.318745
1 Drama 11880717.773529 479.537545
2 Mystery 16119270.062963 53.486201
3 Comedy 13297915.618244 52.962398
4 Thriller 17207693.769178 38.658836
5 Horror 6226529.210751 35.107782
6 Documentary 577149.148077 10.820285
In [627]:
plt.figure(figsize=(10,6))
plt.scatter(budget_roi['avg_budget'], budget_roi['avg_roi'], color='darkcyan')

plt.axvline(b_med, color='red', linestyle='--', label='Median Budget')
plt.axhline(roi_med, color='blue', linestyle='--', label='Median ROI')

plt.title('Avg Budget vs Avg ROI by Genre')
plt.xlabel('Average Budget')
plt.ylabel('Average ROI')
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

The analysis shows that genres such as Romance and Drama achieve exceptionally high ROI despite having only moderate budgets, while others like Horror and Documentary also perform efficiently at lower costs. The scatter plot of average budget versus average ROI highlights this pattern, with efficient genres appearing in the quadrant of low budget and high ROI, making them better long-term investment targets than high-budget categories.

Reccomendation¶

Based on the analysis, studios and investors should prioritize genres with consistently high ROI at moderate or low budgets, such as Romance, Drama, Horror, and Documentary, as they provide strong returns with less financial risk. While blockbuster genres like Action and Adventure generate high revenues, they also require large budgets and carry greater risk. Historical trends show that revenues have steadily increased in recent years, but the highest ROI spikes often occur in earlier or niche markets, suggesting opportunities in lower-cost productions. Balancing high-grossing blockbuster projects with a portfolio of efficient, ROI-driven genres can help ensure both profitability and sustainability. This mixed strategy would reduce dependence on costly hits while capitalizing on proven efficient categories.

In [ ]:
 

Conclusion¶

In conclusion, the findings demonstrate that while blockbuster genres and high-budget productions dominate in terms of revenue, they do not always guarantee the best financial efficiency. Genres with moderate or low budgets but consistently high ROI, such as Romance, Drama, and Horror, present more sustainable long-term opportunities. Revenue trends show steady growth over the years, but ROI has been more volatile, with peaks in select periods and genres. Overall, success in the film industry depends not only on scale and popularity but also on strategic investment in genres that balance profitability with financial risk.