Movie Genre Data Analysis¶
Introduction¶
This research explores patterns in the film industry by examining how genres, budgets, revenues, profits, popularity, and voting averages shape movie performance. The study is guided by key questions such as which genres dominate in production, deliver the highest financial returns, and capture the most audience approval, as well as how revenue and budget trends vary across different years. To deepen the analysis, several hypotheses are tested to investigate whether high ratings, popularity, or large budgets truly drive profit and revenue, and whether certain genres or production years offer consistent long-term efficiency. Together, these questions and hypotheses aim to highlight not only which films succeed, but also what factors make them sustainable investments in the movie market.
Research Questions (Q):¶
- Which genres are the most common (number of movies made)?
- Which genres have high avg. budget and revenue?
- Which genres have high avg. profit?
- Which genres have high avg. popularity?
- Which genres have highest number of movies with an voting avg. >=8?
- Which production years saw the highest total revenue, and how does this align with budget trends?
Research Hypotheses (H):¶
- The best movies according to vote avg. return high profit and revenue.
- The best movies according to popularity return high profit and revenue.
- Highly budgeted movies return high revenue and profit.
- Highly budgeted movies have a high popularity.
- Profit per genre per year
- Movies released in certain years achieve higher revenue/ROI than others
- Genres with consistently high ROI are better long-term targets than high-budget genres
import pandas as pd
movies = pd.read_csv('/Users/justiceoppongtuah/Library/CloudStorage/OneDrive-Personal/Pandas Project/imdb_movies.csv')
movies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10866 entries, 0 to 10865 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10866 non-null int64 1 imdb_id 10856 non-null object 2 popularity 10866 non-null float64 3 budget 10866 non-null int64 4 revenue 10866 non-null int64 5 original_title 10866 non-null object 6 cast 10790 non-null object 7 homepage 2936 non-null object 8 director 10822 non-null object 9 tagline 8042 non-null object 10 keywords 9373 non-null object 11 overview 10862 non-null object 12 runtime 10866 non-null int64 13 genres 10843 non-null object 14 production_companies 9836 non-null object 15 release_date 10866 non-null object 16 vote_count 10866 non-null int64 17 vote_average 10866 non-null float64 18 release_year 10866 non-null int64 19 budget_adj 10866 non-null float64 20 revenue_adj 10866 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.7+ MB
pd.set_option('display.max.rows', 11000)
pd.set_option('display.max.columns', 22)
movies.head()
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | monster|dna|tyrannosaurus rex|velociraptor|island | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.500000 | 2015 | 137999939.280026 | 1392445892.523800 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | future|chase|post-apocalyptic|dystopia|australia | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.100000 | 2015 | 137999939.280026 | 348161292.489031 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | based on novel|revolution|dystopia|sequel|dyst... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.300000 | 2015 | 101199955.472019 | 271619025.407628 |
| 3 | 140607 | tt2488496 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | http://www.starwars.com/films/star-wars-episod... | J.J. Abrams | Every generation has a story. | android|spaceship|jedi|space opera|3d | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 5292 | 7.500000 | 2015 | 183999919.040035 | 1902723129.801820 |
| 4 | 168259 | tt2820852 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | http://www.furious7.com/ | James Wan | Vengeance Hits Home | car race|speed|revenge|suspense|car | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 2947 | 7.300000 | 2015 | 174799923.088033 | 1385748801.470520 |
Data Cleaning¶
1. Checking for duplicates:¶
movies[movies.duplicated()]
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2090 | 42194 | tt0411951 | 0.596430 | 30000000 | 967000 | TEKKEN | Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian... | NaN | Dwight H. Little | Survival is no game | martial arts|dystopia|based on video game|mart... | In the year of 2039, after World Wars destroy ... | 92 | Crime|Drama|Action|Thriller|Science Fiction | Namco|Light Song Films | 3/20/10 | 110 | 5.000000 | 2010 | 30000000.000000 | 967000.000000 |
Deleting duplicates:¶
movies.drop_duplicates(inplace = True)
movies[movies.duplicated()]
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj |
|---|
2. Deleting null data¶
a. from the info 'genres' has (10866 - 10843) rows being null
movies.dropna(subset = ['genres'], inplace = True)
movies.info()
<class 'pandas.core.frame.DataFrame'> Index: 10842 entries, 0 to 10865 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10842 non-null int64 1 imdb_id 10834 non-null object 2 popularity 10842 non-null float64 3 budget 10842 non-null int64 4 revenue 10842 non-null int64 5 original_title 10842 non-null object 6 cast 10767 non-null object 7 homepage 2931 non-null object 8 director 10800 non-null object 9 tagline 8036 non-null object 10 keywords 9367 non-null object 11 overview 10839 non-null object 12 runtime 10842 non-null int64 13 genres 10842 non-null object 14 production_companies 9826 non-null object 15 release_date 10842 non-null object 16 vote_count 10842 non-null int64 17 vote_average 10842 non-null float64 18 release_year 10842 non-null int64 19 budget_adj 10842 non-null float64 20 revenue_adj 10842 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.8+ MB
Calculating and adding 'profit' field to the data¶
movies['profit'] = movies['revenue'] - movies['budget']
Extracting Fields to Work with:¶
movies_genre = movies[['popularity', 'budget', 'revenue', 'original_title', 'runtime', 'genres', 'release_year', 'vote_count', 'vote_average', 'profit']]
movies_genre.head()
from pandas import Series, DataFrame
Expanding the data by 'genres' per row¶
This is to make it more useful for our use cases:
movies_genre = (movies_genre.assign(genre=movies_genre['genres'].str.split('|')).explode('genre').drop(columns='genres'))
movies_genre
| popularity | budget | revenue | original_title | runtime | release_year | vote_count | vote_average | profit | genre | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Action |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Adventure |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Science Fiction |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Thriller |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Action |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10863 | 0.065141 | 0 | 0 | Beregis Avtomobilya | 94 | 1966 | 11 | 6.500000 | 0 | Mystery |
| 10863 | 0.065141 | 0 | 0 | Beregis Avtomobilya | 94 | 1966 | 11 | 6.500000 | 0 | Comedy |
| 10864 | 0.064317 | 0 | 0 | What's Up, Tiger Lily? | 80 | 1966 | 22 | 5.400000 | 0 | Action |
| 10864 | 0.064317 | 0 | 0 | What's Up, Tiger Lily? | 80 | 1966 | 22 | 5.400000 | 0 | Comedy |
| 10865 | 0.035919 | 19000 | 0 | Manos: The Hands of Fate | 74 | 1966 | 15 | 1.500000 | -19000 | Horror |
26955 rows × 10 columns
movies_genre.head(10)
| popularity | budget | revenue | original_title | runtime | release_year | vote_count | vote_average | profit | genre | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Action |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Adventure |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Science Fiction |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Thriller |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Action |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Adventure |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Science Fiction |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Thriller |
| 2 | 13.112507 | 110000000 | 295238201 | Insurgent | 119 | 2015 | 2480 | 6.300000 | 185238201 | Adventure |
| 2 | 13.112507 | 110000000 | 295238201 | Insurgent | 119 | 2015 | 2480 | 6.300000 | 185238201 | Science Fiction |
Research Questions (Q):¶
1. Which genres are the most common (number of movies made)?
genres_count = pd.DataFrame(movies_genre.groupby('genre').original_title.nunique()).sort_values('original_title', ascending = True)
genres_count
| original_title | |
|---|---|
| genre | |
| Western | 163 |
| TV Movie | 167 |
| Foreign | 188 |
| War | 269 |
| History | 331 |
| Music | 403 |
| Documentary | 520 |
| Animation | 697 |
| Mystery | 796 |
| Fantasy | 899 |
| Science Fiction | 1207 |
| Family | 1211 |
| Crime | 1337 |
| Adventure | 1442 |
| Horror | 1580 |
| Romance | 1686 |
| Action | 2339 |
| Thriller | 2841 |
| Comedy | 3750 |
| Drama | 4672 |
genres_count['original_title'].plot.pie(title = 'Movies Per Genre in %', autopct = '%1.1f%%', figsize = (9,9))
<Axes: title={'center': 'Movies Per Genre in %'}, ylabel='original_title'>
The analysis above shows that Drama, Comedy, and Thriller are the most common genres, with Drama leading at 4,672 movies. The pie chart highlights how these genres dominate film production compared to less common ones like Western, TV Movie, and Foreign.
2. Which genres have high avg. budget and revenue?
genres_avg = genres_count.groupby('genre').mean()
genres_avg
| original_title | |
|---|---|
| genre | |
| Action | 2339.000000 |
| Adventure | 1442.000000 |
| Animation | 697.000000 |
| Comedy | 3750.000000 |
| Crime | 1337.000000 |
| Documentary | 520.000000 |
| Drama | 4672.000000 |
| Family | 1211.000000 |
| Fantasy | 899.000000 |
| Foreign | 188.000000 |
| History | 331.000000 |
| Horror | 1580.000000 |
| Music | 403.000000 |
| Mystery | 796.000000 |
| Romance | 1686.000000 |
| Science Fiction | 1207.000000 |
| TV Movie | 167.000000 |
| Thriller | 2841.000000 |
| War | 269.000000 |
| Western | 163.000000 |
genres_avg = movies_genre.groupby('genre').mean(numeric_only = True)
pd.options.display.float_format = '{:2f}'.format
genres_avg
| popularity | budget | revenue | runtime | release_year | vote_count | vote_average | profit | |
|---|---|---|---|---|---|---|---|---|
| genre | ||||||||
| Action | 0.926274 | 27727820.331376 | 72794732.002517 | 104.917785 | 2000.056208 | 392.993708 | 5.787752 | 45066911.671141 |
| Adventure | 1.154259 | 37543694.530931 | 113137861.069341 | 106.173351 | 1999.388851 | 513.125085 | 5.940585 | 75594166.538409 |
| Animation | 0.852182 | 23159781.606581 | 75256062.223176 | 68.181688 | 2003.995708 | 303.000000 | 6.403147 | 52096280.616595 |
| Comedy | 0.592607 | 13297915.618244 | 37526242.072238 | 96.745057 | 2000.816240 | 176.436330 | 5.905167 | 24228326.453994 |
| Crime | 0.744930 | 17663801.124815 | 42368661.645495 | 106.917282 | 1999.491137 | 278.805022 | 6.124889 | 24704860.520679 |
| Documentary | 0.181432 | 577149.148077 | 2041106.994231 | 102.651923 | 2008.313462 | 35.105769 | 6.908462 | 1463957.846154 |
| Drama | 0.591495 | 11880717.773529 | 29232255.725840 | 110.478151 | 2000.929202 | 182.544538 | 6.165546 | 17351537.952311 |
| Family | 0.786668 | 23359337.420796 | 72433176.373680 | 89.603574 | 2000.774167 | 272.320877 | 5.997563 | 49073838.952884 |
| Fantasy | 0.992840 | 32612585.348253 | 96313657.081878 | 100.736900 | 2000.292576 | 420.741266 | 5.863537 | 63701071.733624 |
| Foreign | 0.191496 | 1451434.925532 | 1520459.835106 | 107.228723 | 2001.398936 | 16.627660 | 5.981383 | 69024.909574 |
| History | 0.575936 | 18594919.302395 | 32011793.215569 | 136.206587 | 1997.497006 | 183.772455 | 6.410479 | 13416873.913174 |
| Horror | 0.465357 | 6226529.210751 | 16822808.624313 | 94.424557 | 2001.117288 | 120.059866 | 5.337447 | 10596279.413561 |
| Music | 0.487321 | 9438627.549020 | 28571768.691176 | 105.137255 | 2000.196078 | 124.340686 | 6.480392 | 19133141.142157 |
| Mystery | 0.690012 | 16119270.062963 | 40217566.661728 | 105.928395 | 1999.850617 | 236.998765 | 5.946790 | 24098296.598765 |
| Romance | 0.592082 | 12531271.847547 | 35691972.327103 | 106.891355 | 2000.438668 | 166.070678 | 6.042874 | 23160700.479556 |
| Science Fiction | 1.001548 | 24972680.524003 | 70140558.034174 | 99.419854 | 1999.978845 | 437.096013 | 5.665582 | 45167877.510171 |
| TV Movie | 0.270896 | 267664.670659 | 251497.005988 | 91.982036 | 2004.634731 | 34.365269 | 5.788024 | -16167.664671 |
| Thriller | 0.741563 | 17207693.769178 | 41728417.543860 | 103.247678 | 2001.693498 | 255.484348 | 5.750671 | 24520723.774682 |
| War | 0.727683 | 20891886.103704 | 47605183.300000 | 127.625926 | 1996.096296 | 270.733333 | 6.297778 | 26713297.196296 |
| Western | 0.590615 | 18974107.975758 | 28568709.284848 | 117.575758 | 1986.915152 | 205.739394 | 6.083030 | 9594601.309091 |
genres_avg.sort_values('budget', ascending = True, inplace = True)
genres_avg[['budget', 'revenue']].plot.barh(title = 'Budget and Revenue Per Genre', color = ('DarkBlue', 'DarkCyan'), figsize = (8,7))
<Axes: title={'center': 'Budget and Revenue Per Genre'}, ylabel='genre'>
The analysis above reveals that Adventure, Fantasy, and Action genres stand out with the highest average budgets and revenues, reflecting their blockbuster nature. A horizontal bar chart was used to compare these values across genres, making it clear that large-scale productions typically dominate financial performance in the film industry.
3. Which genres have high avg. profit?
genres_avg
| popularity | budget | revenue | runtime | release_year | vote_count | vote_average | profit | |
|---|---|---|---|---|---|---|---|---|
| genre | ||||||||
| Documentary | 0.181432 | 577149.148077 | 2041106.994231 | 102.651923 | 2008.313462 | 35.105769 | 6.908462 | 1463957.846154 |
| Foreign | 0.191496 | 1451434.925532 | 1520459.835106 | 107.228723 | 2001.398936 | 16.627660 | 5.981383 | 69024.909574 |
| TV Movie | 0.270896 | 267664.670659 | 251497.005988 | 91.982036 | 2004.634731 | 34.365269 | 5.788024 | -16167.664671 |
| Horror | 0.465357 | 6226529.210751 | 16822808.624313 | 94.424557 | 2001.117288 | 120.059866 | 5.337447 | 10596279.413561 |
| Music | 0.487321 | 9438627.549020 | 28571768.691176 | 105.137255 | 2000.196078 | 124.340686 | 6.480392 | 19133141.142157 |
| History | 0.575936 | 18594919.302395 | 32011793.215569 | 136.206587 | 1997.497006 | 183.772455 | 6.410479 | 13416873.913174 |
| Western | 0.590615 | 18974107.975758 | 28568709.284848 | 117.575758 | 1986.915152 | 205.739394 | 6.083030 | 9594601.309091 |
| Drama | 0.591495 | 11880717.773529 | 29232255.725840 | 110.478151 | 2000.929202 | 182.544538 | 6.165546 | 17351537.952311 |
| Romance | 0.592082 | 12531271.847547 | 35691972.327103 | 106.891355 | 2000.438668 | 166.070678 | 6.042874 | 23160700.479556 |
| Comedy | 0.592607 | 13297915.618244 | 37526242.072238 | 96.745057 | 2000.816240 | 176.436330 | 5.905167 | 24228326.453994 |
| Mystery | 0.690012 | 16119270.062963 | 40217566.661728 | 105.928395 | 1999.850617 | 236.998765 | 5.946790 | 24098296.598765 |
| War | 0.727683 | 20891886.103704 | 47605183.300000 | 127.625926 | 1996.096296 | 270.733333 | 6.297778 | 26713297.196296 |
| Thriller | 0.741563 | 17207693.769178 | 41728417.543860 | 103.247678 | 2001.693498 | 255.484348 | 5.750671 | 24520723.774682 |
| Crime | 0.744930 | 17663801.124815 | 42368661.645495 | 106.917282 | 1999.491137 | 278.805022 | 6.124889 | 24704860.520679 |
| Family | 0.786668 | 23359337.420796 | 72433176.373680 | 89.603574 | 2000.774167 | 272.320877 | 5.997563 | 49073838.952884 |
| Animation | 0.852182 | 23159781.606581 | 75256062.223176 | 68.181688 | 2003.995708 | 303.000000 | 6.403147 | 52096280.616595 |
| Action | 0.926274 | 27727820.331376 | 72794732.002517 | 104.917785 | 2000.056208 | 392.993708 | 5.787752 | 45066911.671141 |
| Fantasy | 0.992840 | 32612585.348253 | 96313657.081878 | 100.736900 | 2000.292576 | 420.741266 | 5.863537 | 63701071.733624 |
| Science Fiction | 1.001548 | 24972680.524003 | 70140558.034174 | 99.419854 | 1999.978845 | 437.096013 | 5.665582 | 45167877.510171 |
| Adventure | 1.154259 | 37543694.530931 | 113137861.069341 | 106.173351 | 1999.388851 | 513.125085 | 5.940585 | 75594166.538409 |
genres_avg.sort_values('profit', ascending = True, inplace = True)
genres_avg['profit'].plot.barh(title = 'Profit Per Genre', color = 'DarkCyan', figsize = (8,7))
<Axes: title={'center': 'Profit Per Genre'}, ylabel='genre'>
The analysis reveals that Adventure, Fantasy, Animation, and Family stand out as the most profitable genres, with Animation and Family even surpassing Science Fiction in average profit. A horizontal bar chart was used to visualize these results, showing blockbuster and family-oriented genres leading in financial returns.
4. Which genres have high avg. popularity?
genres_avg.sort_values('popularity', ascending = True, inplace = True)
genres_avg['popularity'].plot.barh(title = 'Popularity Per Genre', color = 'DarkCyan', figsize = (8,7))
<Axes: title={'center': 'Popularity Per Genre'}, ylabel='genre'>
The analysis shows that Adventure, Fantasy and Science Fiction are the genres with the highest average popularity, indicating strong audience appeal. A horizontal bar chart was used to visualize the comparison, highlighting how these genres consistently attract the most attention.
5. Which genres have highest number of movies with an voting avg. >=8?
movies_genre.head(5)
| popularity | budget | revenue | original_title | runtime | release_year | vote_count | vote_average | profit | genre | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Action |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Adventure |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Science Fiction |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Thriller |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Action |
vote_fifty = movies_genre[(movies_genre['vote_count'] >= 50) & (movies_genre['vote_average'] >= 8)]
vote_zero = movies_genre[movies_genre['vote_average'] >= 8]
genres_vote = pd.DataFrame(vote_zero.groupby('genre').vote_average.nunique()).sort_values('vote_average', ascending = True)
genres_vote
| vote_average | |
|---|---|
| genre | |
| Family | 1 |
| Action | 2 |
| Mystery | 2 |
| History | 2 |
| Fantasy | 2 |
| Adventure | 2 |
| War | 2 |
| Horror | 3 |
| Romance | 3 |
| Science Fiction | 3 |
| Thriller | 4 |
| Animation | 4 |
| Crime | 4 |
| Comedy | 5 |
| Drama | 6 |
| Music | 8 |
| Documentary | 9 |
genres_vote['vote_average'].plot.barh(title = 'Vote Average by Genre', color = 'DarkCyan', figsize = (8,7))
<Axes: title={'center': 'Vote Average by Genre'}, ylabel='genre'>
The analysis shows that Documetary, Music and Drama have the highest number of movies with a voting average of 8 or above. Meanwhile, Family, Action and Mystery have the lowest. A horizontal bar chart was used to display the counts, making it clear which genres consistently produce highly rated films.
6. Which production years saw the highest total revenue, and how does this align with budget trends?
import numpy as np
import matplotlib.pyplot as plt
movies_genre
| popularity | budget | revenue | original_title | runtime | release_year | vote_count | vote_average | profit | genre | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Action |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Adventure |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Science Fiction |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Thriller |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Action |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10863 | 0.065141 | 0 | 0 | Beregis Avtomobilya | 94 | 1966 | 11 | 6.500000 | 0 | Mystery |
| 10863 | 0.065141 | 0 | 0 | Beregis Avtomobilya | 94 | 1966 | 11 | 6.500000 | 0 | Comedy |
| 10864 | 0.064317 | 0 | 0 | What's Up, Tiger Lily? | 80 | 1966 | 22 | 5.400000 | 0 | Action |
| 10864 | 0.064317 | 0 | 0 | What's Up, Tiger Lily? | 80 | 1966 | 22 | 5.400000 | 0 | Comedy |
| 10865 | 0.035919 | 19000 | 0 | Manos: The Hands of Fate | 74 | 1966 | 15 | 1.500000 | -19000 | Horror |
26955 rows × 10 columns
# Yearly totals
yearly_totals = (movies_genre.groupby('release_year')[['revenue', 'budget']].sum(min_count=1).dropna().sort_index())
# Efficiency and ranks
yearly_totals['rev_to_budget_ratio'] = np.where(yearly_totals['budget'] > 0,yearly_totals['revenue'] / yearly_totals['budget'],np.nan)
yearly_totals['revenue_rank'] = yearly_totals['revenue'].rank(ascending=False, method='min')
yearly_totals['budget_rank'] = yearly_totals['budget'].rank(ascending=False, method='min')
yearly_totals['rank_gap(rev-budget)'] = yearly_totals['revenue_rank'] - yearly_totals['budget_rank']
# Top 10 years by total revenue
q6_top10 = yearly_totals.sort_values('revenue', ascending=False).head(10)
display(yearly_totals.head(12))
display(q6_top10)
| revenue | budget | rev_to_budget_ratio | revenue_rank | budget_rank | rank_gap(rev-budget) | |
|---|---|---|---|---|---|---|
| release_year | ||||||
| 1960 | 417915000 | 64140844 | 6.515583 | 55.000000 | 56.000000 | -1.000000 |
| 1961 | 1201673874 | 117537000 | 10.223792 | 47.000000 | 54.000000 | -7.000000 |
| 1962 | 726409692 | 184194630 | 3.943707 | 53.000000 | 48.000000 | 5.000000 |
| 1963 | 552004390 | 237328000 | 2.325913 | 54.000000 | 45.000000 | 9.000000 |
| 1964 | 1178849290 | 134471644 | 8.766527 | 49.000000 | 52.000000 | -3.000000 |
| 1965 | 1525459848 | 212590345 | 7.175584 | 46.000000 | 46.000000 | 0.000000 |
| 1966 | 195736689 | 136164000 | 1.437507 | 56.000000 | 51.000000 | 5.000000 |
| 1967 | 2177966116 | 324282200 | 6.716268 | 42.000000 | 42.000000 | 0.000000 |
| 1968 | 817108595 | 238439000 | 3.426908 | 51.000000 | 44.000000 | 7.000000 |
| 1969 | 743971370 | 132203174 | 5.627485 | 52.000000 | 53.000000 | -1.000000 |
| 1970 | 1692997298 | 480535838 | 3.523145 | 45.000000 | 40.000000 | 5.000000 |
| 1971 | 1200690537 | 253348000 | 4.739294 | 48.000000 | 43.000000 | 5.000000 |
| revenue | budget | rev_to_budget_ratio | revenue_rank | budget_rank | rank_gap(rev-budget) | |
|---|---|---|---|---|---|---|
| release_year | ||||||
| 2015 | 82160071197 | 23020835159 | 3.568944 | 1.000000 | 8.000000 | -7.000000 |
| 2014 | 72059584262 | 23255945482 | 3.098545 | 2.000000 | 7.000000 | -5.000000 |
| 2012 | 72019644724 | 24044807311 | 2.995227 | 3.000000 | 5.000000 | -2.000000 |
| 2009 | 71844890702 | 26207420963 | 2.741395 | 4.000000 | 4.000000 | 0.000000 |
| 2013 | 70798442190 | 26406690003 | 2.681080 | 5.000000 | 3.000000 | 2.000000 |
| 2011 | 70142112650 | 26566819277 | 2.640215 | 6.000000 | 2.000000 | 4.000000 |
| 2010 | 65304621240 | 28482377815 | 2.292808 | 7.000000 | 1.000000 | 6.000000 |
| 2007 | 60426073803 | 23263499212 | 2.597463 | 8.000000 | 6.000000 | 2.000000 |
| 2008 | 54764768294 | 22167781392 | 2.470467 | 9.000000 | 9.000000 | 0.000000 |
| 2004 | 50004105495 | 21297896216 | 2.347842 | 10.000000 | 12.000000 | -2.000000 |
# Correlation of yearly totals (how tightly budgets track revenue)
q6_spearman = yearly_totals[['revenue','budget']].corr(method='spearman').iloc[0,1]
print(f"[Q6] Spearman(revenue vs budget) across years = {q6_spearman:.3f}")
[Q6] Spearman(revenue vs budget) across years = 0.975
# Plot: revenue vs budget over time
plt.figure(figsize=(11,6))
plt.plot(yearly_totals.index, yearly_totals['revenue'], label='Total Revenue')
plt.plot(yearly_totals.index, yearly_totals['budget'], label='Total Budget')
plt.title('Total Revenue & Budget by Year')
plt.xlabel('Year'); plt.ylabel('Amount'); plt.legend(); plt.tight_layout(); plt.show()
The analysis shows that the highest total revenues were recorded in recent blockbuster years, with 2015, 2014, and 2012 leading the list. The trend also reveals that while budgets rose steadily during these periods, revenue growth often outpaced budget increases, resulting in strong revenue-to-budget ratios. A line chart comparing total revenue and total budget by year was used to illustrate this alignment.
Q7) Which genres deliver the highest average ROI, and is high ROI consistent over time?
ROI (Return on Investment) • Definition: A percentage that measures the efficiency of an investment relative to its cost.
• Formula:
ROI = [{Profit (or Gain) from Investment} \ {Cost of Investment}] * 100%
• Type: Measured in percentage (%).
• Purpose: Shows how effectively your investment generates returns — it answers “Was the investment worth it?”
import matplotlib.pyplot as plt
import seaborn as sns
# ROI at movie level (guard against division by zero)
movies_genre['profit'] = movies_genre['revenue'] - movies_genre['budget']
movies_genre['roi'] = np.where(movies_genre['budget'] > 0, movies_genre['profit'] / movies_genre['budget'], np.nan)
movies_genre.head()
| popularity | budget | revenue | original_title | runtime | release_year | vote_count | vote_average | profit | genre | roi | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Action | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Adventure | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Science Fiction | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Thriller | 9.090192 |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Action | 1.522909 |
# Mean ROI per (genre, year)
roi_by_year = (
movies_genre
.dropna(subset=['release_year', 'roi'])
.groupby(['genre', 'release_year'], as_index=False)['roi'].mean()
.rename(columns={'release_year': 'year', 'roi': 'mean_roi_year'})
)
roi_by_year.head(15)
| genre | year | mean_roi_year | |
|---|---|---|---|
| 0 | Action | 1960 | 2.726250 |
| 1 | Action | 1961 | 0.844444 |
| 2 | Action | 1962 | 18.171717 |
| 3 | Action | 1963 | 9.715914 |
| 4 | Action | 1964 | 34.685714 |
| 5 | Action | 1965 | 2.030697 |
| 6 | Action | 1966 | -0.014614 |
| 7 | Action | 1967 | 4.678831 |
| 8 | Action | 1968 | 4.209170 |
| 9 | Action | 1969 | 1.927660 |
| 10 | Action | 1970 | 1.375660 |
| 11 | Action | 1971 | 14.932931 |
| 12 | Action | 1972 | 12.283831 |
| 13 | Action | 1973 | 21.699363 |
| 14 | Action | 1974 | 6.609024 |
# Consistency by genre: overall mean ROI, volatility across years, years observed
roi_consistency = (
roi_by_year
.groupby('genre', as_index=False)
.agg(
mean_roi=('mean_roi_year', 'mean'),
roi_volatility=('mean_roi_year', 'std'),
years_observed=('year', 'nunique')
)
.fillna({'roi_volatility': 0})
.sort_values(['mean_roi', 'years_observed'], ascending=[False, False])
)
display(roi_consistency.head(20))
| genre | mean_roi | roi_volatility | years_observed | |
|---|---|---|---|---|
| 7 | Family | 5097.366486 | 36013.022664 | 50 |
| 14 | Romance | 1834.896107 | 13610.464854 | 56 |
| 1 | Adventure | 1659.294028 | 12374.474298 | 56 |
| 0 | Action | 1217.297322 | 9074.426205 | 56 |
| 6 | Drama | 822.291712 | 5678.375641 | 56 |
| 17 | Thriller | 77.977464 | 527.720212 | 55 |
| 3 | Comedy | 56.189440 | 329.346892 | 56 |
| 11 | Horror | 23.207555 | 79.766938 | 56 |
| 13 | Mystery | 22.797339 | 104.184279 | 51 |
| 5 | Documentary | 8.312802 | 24.843905 | 22 |
| 15 | Science Fiction | 4.964676 | 16.941324 | 53 |
| 12 | Music | 4.784395 | 8.749801 | 45 |
| 8 | Fantasy | 4.497297 | 18.673400 | 47 |
| 2 | Animation | 4.263699 | 9.198865 | 43 |
| 4 | Crime | 3.578558 | 5.809079 | 55 |
| 19 | Western | 1.896628 | 4.866633 | 39 |
| 18 | War | 1.868804 | 2.589981 | 48 |
| 10 | History | 1.344613 | 2.230760 | 48 |
| 16 | TV Movie | 0.200000 | 3.174902 | 7 |
| 9 | Foreign | -0.667293 | 0.621465 | 17 |
# Visual: mean ROI vs volatility
plt.figure(figsize=(8,6))
sns.scatterplot(data=roi_consistency, x='mean_roi', y='roi_volatility')
plt.title('Genre ROI: Mean vs Volatility'); plt.xlabel('Mean ROI'); plt.ylabel('ROI Volatility (Std Dev)')
plt.tight_layout();
plt.show()
The analysis shows that Family, Romance, and Adventure deliver the highest average ROI, although they also exhibit very high volatility across years. Genres such as Thriller, Comedy, and Horror offer more moderate but relatively stable returns. A scatter plot of mean ROI versus volatility was used to highlight the trade-off between profitability and consistency, helping to identify which genres balance high returns with lower risk.
Research Hypotheses (H):¶
1. The best movies according to vote avg. return high profit and revenue.
movies_genre.head()
| popularity | budget | revenue | original_title | runtime | release_year | vote_count | vote_average | profit | genre | roi | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Action | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Adventure | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Science Fiction | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Thriller | 9.090192 |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Action | 1.522909 |
movies_counted = movies_genre[movies_genre['vote_count'] >= 50]
movies_counted.corr(method='spearman', numeric_only=True)
| popularity | budget | revenue | runtime | release_year | vote_count | vote_average | profit | roi | |
|---|---|---|---|---|---|---|---|---|---|
| popularity | 1.000000 | 0.496554 | 0.606165 | 0.231128 | 0.091893 | 0.776718 | 0.213364 | 0.516448 | 0.412798 |
| budget | 0.496554 | 1.000000 | 0.723696 | 0.355018 | 0.024279 | 0.573096 | -0.036132 | 0.343588 | 0.016027 |
| revenue | 0.606165 | 0.723696 | 1.000000 | 0.346976 | -0.075445 | 0.699428 | 0.126505 | 0.841594 | 0.713213 |
| runtime | 0.231128 | 0.355018 | 0.346976 | 1.000000 | -0.136733 | 0.269755 | 0.260482 | 0.223173 | 0.101995 |
| release_year | 0.091893 | 0.024279 | -0.075445 | -0.136733 | 1.000000 | 0.176190 | -0.134809 | -0.075017 | -0.146544 |
| vote_count | 0.776718 | 0.573096 | 0.699428 | 0.269755 | 0.176190 | 1.000000 | 0.307466 | 0.598132 | 0.490226 |
| vote_average | 0.213364 | -0.036132 | 0.126505 | 0.260482 | -0.134809 | 0.307466 | 1.000000 | 0.217354 | 0.298911 |
| profit | 0.516448 | 0.343588 | 0.841594 | 0.223173 | -0.075017 | 0.598132 | 0.217354 | 1.000000 | 0.837730 |
| roi | 0.412798 | 0.016027 | 0.713213 | 0.101995 | -0.146544 | 0.490226 | 0.298911 | 0.837730 | 1.000000 |
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(9,7))
sns.regplot(x = 'vote_average', y = 'profit', data = movies_counted, line_kws = {'color':'cyan'})
<Axes: xlabel='vote_average', ylabel='profit'>
plt.figure(figsize=(9,7))
sns.regplot(x = 'vote_average', y = 'revenue', data = movies_counted, line_kws = {'color':'cyan'})
<Axes: xlabel='vote_average', ylabel='revenue'>
****The correlation analysis shows a positive but relatively weak relationship between vote average and both profit and revenue. This suggests that while highly rated movies can generate strong financial returns, ratings alone are not the dominant driver of profit and revenue. A regression plot was used to visualize this relationship, showing the general upward trend but also the wide spread of data points.****
2. The best movies according to popularity return high profit and revenue.
movies_counted.corr(method='spearman', numeric_only=True)
| popularity | budget | revenue | runtime | release_year | vote_count | vote_average | profit | roi | |
|---|---|---|---|---|---|---|---|---|---|
| popularity | 1.000000 | 0.496554 | 0.606165 | 0.231128 | 0.091893 | 0.776718 | 0.213364 | 0.516448 | 0.412798 |
| budget | 0.496554 | 1.000000 | 0.723696 | 0.355018 | 0.024279 | 0.573096 | -0.036132 | 0.343588 | 0.016027 |
| revenue | 0.606165 | 0.723696 | 1.000000 | 0.346976 | -0.075445 | 0.699428 | 0.126505 | 0.841594 | 0.713213 |
| runtime | 0.231128 | 0.355018 | 0.346976 | 1.000000 | -0.136733 | 0.269755 | 0.260482 | 0.223173 | 0.101995 |
| release_year | 0.091893 | 0.024279 | -0.075445 | -0.136733 | 1.000000 | 0.176190 | -0.134809 | -0.075017 | -0.146544 |
| vote_count | 0.776718 | 0.573096 | 0.699428 | 0.269755 | 0.176190 | 1.000000 | 0.307466 | 0.598132 | 0.490226 |
| vote_average | 0.213364 | -0.036132 | 0.126505 | 0.260482 | -0.134809 | 0.307466 | 1.000000 | 0.217354 | 0.298911 |
| profit | 0.516448 | 0.343588 | 0.841594 | 0.223173 | -0.075017 | 0.598132 | 0.217354 | 1.000000 | 0.837730 |
| roi | 0.412798 | 0.016027 | 0.713213 | 0.101995 | -0.146544 | 0.490226 | 0.298911 | 0.837730 | 1.000000 |
import matplotlib.pyplot as plt
plt.figure(figsize = (9,7))
sns.regplot(x = 'popularity', y = 'revenue', data = movies_counted, line_kws = {'color':'cyan'})
plt.show()
The correlation results show that popularity has a strong positive relationship with both revenue (0.61) and profit (0.52). This means that highly popular movies generally achieve higher commercial success, confirming the hypothesis. A regression plot was used to illustrate this trend, with a clear upward slope demonstrating how popularity drives financial performance.
3. Highly budgeted movies return high profit.
plt.figure(figsize = (9,7))
sns.regplot(x = 'budget', y = 'profit', data = movies_counted, line_kws = {'color':'cyan'})
plt.show()
The analysis shows a moderate positive relationship between budget and profit, indicating that higher-budget movies often achieve greater profits. However, the wide spread of data points in the regression plot highlights that large budgets do not always guarantee high returns.
4. Highly budgeted movies have a high popularity.
plt.figure(figsize = (9,7))
sns.regplot(x = 'budget', y = 'popularity', data = movies_counted, line_kws = {'color':'cyan'})
plt.show()
The analysis indicates only a weak positive relationship between budget and popularity, suggesting that higher-budget movies may attract more attention but not consistently so. The regression plot shows an upward trend, but the scatter of points reveals that popularity is influenced by other factors beyond budget.
5. Profit per Genre per Year¶
movies_genre.head()
| popularity | budget | revenue | original_title | runtime | release_year | vote_count | vote_average | profit | genre | roi | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Action | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Adventure | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Science Fiction | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Thriller | 9.090192 |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Action | 1.522909 |
time_genre = pd.DataFrame(movies_genre.groupby(['release_year', 'genre'])['profit'].mean())
time_genre.head(10)
| profit | ||
|---|---|---|
| release_year | genre | |
| 1960 | Action | 6363125.000000 |
| Adventure | 431000.000000 | |
| Comedy | 5258750.000000 | |
| Crime | 0.000000 | |
| Drama | 8245619.384615 | |
| Family | 3333333.333333 | |
| Fantasy | -375000.000000 | |
| Foreign | 0.000000 | |
| History | 9600000.000000 | |
| Horror | 4389007.428571 |
pivot_genre = pd.pivot_table(time_genre, values = 'profit', index = ['genre'], columns = ['release_year'])
pivot_genre
| release_year | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | ... | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| genre | |||||||||||||||||||||||
| Action | 6363125.000000 | 2457142.857143 | 10437500.000000 | 14999691.250000 | 24280000.000000 | 16104504.777778 | 324785.714286 | 21732786.428571 | 8868478.833333 | 5861449.300000 | 7735913.454545 | ... | 32379154.185714 | 38048623.212500 | 45977444.052632 | 47642764.404040 | 57555510.740741 | 34571030.355140 | 51968078.252174 | 71910257.424242 | 47431725.628099 | 69943650.015504 | 90399046.747664 |
| Adventure | 431000.000000 | 39130002.333333 | 21214285.714286 | 9428395.000000 | 24123400.000000 | 25124276.333333 | 791563.636364 | 49281873.857143 | 15122159.400000 | 16651809.400000 | 11406377.666667 | ... | 70756754.320755 | 64895200.090909 | 83916731.083333 | 55016526.809524 | 113364919.736111 | 98125646.677966 | 113343224.564516 | 162549574.300000 | 100638113.402985 | 120373306.985075 | 144787868.971014 |
| Animation | NaN | 211880014.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | -105000.000000 | 100921706.000000 | 0.000000 | 0.000000 | 25837628.500000 | ... | 40307312.153846 | 39047734.974359 | 67147670.531250 | 51907242.575758 | 48862898.250000 | 58455658.640000 | 63420098.934783 | 69231546.875000 | 93718795.261905 | 64313463.250000 | 83691174.384615 |
| Comedy | 5258750.000000 | 22198001.400000 | 1599974.800000 | 765591.923077 | 7080111.375000 | 3264285.714286 | -149550.000000 | 10135087.470588 | -1277777.777778 | -125000.000000 | 11901488.583333 | ... | 22147617.440559 | 24944308.174194 | 30674848.549669 | 21483012.426036 | 24741982.808081 | 23154807.875740 | 27241358.720930 | 25631999.136364 | 25068455.091429 | 26648786.864865 | 33703819.067901 |
| Crime | 0.000000 | 18828411.000000 | 3709948.666667 | 369526.750000 | 1723307.300000 | -22500.000000 | -275560.000000 | 12899678.714286 | 6762174.600000 | 31482963.000000 | -2125000.000000 | ... | 15680793.688889 | 21551432.962264 | 16700048.400000 | 31258696.854839 | 11016642.307692 | 13490412.320000 | 25395271.625000 | 29522925.555556 | 23248589.507042 | 5903014.230769 | 61286524.372549 |
| Documentary | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | NaN | NaN | 0.000000 | ... | 316012.705882 | 6772940.000000 | 5229344.631579 | 321262.230769 | -1603933.560000 | 187417.771429 | 2567196.816327 | 1102306.591837 | 321880.225806 | 13104.136986 | 531061.859649 |
| Drama | 8245619.384615 | 4912698.375000 | 4373796.190476 | 1587307.692308 | 3726105.550000 | 11933809.800000 | 1014793.062500 | 18045095.000000 | 6354859.150000 | 10317995.538462 | 19753262.842105 | ... | 17180823.510989 | 12994800.142132 | 13953425.421320 | 16736384.017167 | 15209056.888393 | 13510589.176190 | 10830976.523364 | 23418944.870690 | 13344942.608696 | 13697098.665493 | 16315310.438462 |
| Family | 3333333.333333 | 42576002.800000 | 4000000.000000 | -333333.333333 | 37818181.750000 | 51671428.666667 | -341828.571429 | 64281204.000000 | -640000.000000 | 0.000000 | 17225085.666667 | ... | 51714774.929825 | 35281751.276923 | 80470561.733333 | 44483026.142857 | 62009322.100000 | 98999767.490909 | 65483435.472222 | 69321493.023810 | 89321956.437500 | 56295182.744186 | 85974520.750000 |
| Fantasy | -375000.000000 | 0.000000 | -126.000000 | -500000.000000 | 24068181.750000 | NaN | 0.000000 | -3000000.000000 | -2000000.000000 | 0.000000 | 0.000000 | ... | 64388182.000000 | 57967661.441176 | 93461583.425532 | 51441997.767442 | 96235822.102041 | 83235328.522727 | 94486677.847826 | 122245621.970588 | 79363628.000000 | 112966482.638889 | 86576948.303030 |
| Foreign | 0.000000 | 0.000000 | -126.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN | -25000000.000000 | ... | -57761.300000 | -250.000000 | -1978098.117647 | 5591672.111111 | -1032352.941176 | -1185845.384615 | -403761.428571 | 0.000000 | NaN | NaN | NaN |
| History | 9600000.000000 | 2333333.333333 | 11250000.000000 | 4908750.000000 | 0.000000 | -3440000.000000 | 0.000000 | NaN | 4433333.333333 | 28102963.000000 | 18954430.333333 | ... | 19308745.700000 | 6680655.250000 | 10272090.153846 | 7683159.875000 | 3519452.875000 | 14883053.714286 | 12320119.375000 | 6216918.769231 | 23672850.857143 | 22025802.133333 | 4193780.266667 |
| Horror | 4389007.428571 | 600000.000000 | 231600.000000 | 898558.777778 | -10833.333333 | -60000.000000 | -6333.333333 | -16250.000000 | 7520356.500000 | -175000.000000 | -12000.000000 | ... | 17915983.446429 | 9281245.571429 | 16102357.113924 | 1813709.855263 | 7410959.733333 | 9365626.051282 | 5780349.653846 | 6799669.932692 | 14019309.127451 | 6162654.752381 | 5394573.384000 |
| Music | 0.000000 | 18828411.000000 | 8000000.000000 | NaN | 30354655.200000 | 76757143.000000 | NaN | 7200000.000000 | 2266666.666667 | 0.000000 | 0.000000 | ... | 16848210.461538 | 31348176.333333 | 40089857.400000 | 6433060.950000 | 36892810.647059 | -700418.900000 | 31950506.882353 | 27459603.590909 | 78735.100000 | 12672735.785714 | 17045008.303030 |
| Mystery | NaN | 0.000000 | -245000.000000 | 1568014.666667 | 1216209.750000 | 0.000000 | 1550000.000000 | 8459992.666667 | 25626555.750000 | -4000000.000000 | 2499906.000000 | ... | 12485711.068966 | 26821791.500000 | 56994329.941176 | 14934154.931034 | 14554527.098039 | 45215039.968750 | 33299398.657895 | 11465554.363636 | 3487391.552632 | 21897562.888889 | 10063931.666667 |
| Romance | 5225000.000000 | 214285.714286 | -400025.200000 | 4388698.500000 | 6722222.222222 | 50547239.200000 | 1216666.666667 | 17960342.636364 | 5444314.000000 | -312500.000000 | 19933333.333333 | ... | 27892262.129032 | 14888539.914286 | 16750359.121622 | 35814305.607143 | 26014869.542169 | 22446537.105882 | 25097807.626667 | 24848927.860759 | 8434191.753846 | 15024969.250000 | 22447977.175439 |
| Science Fiction | -316666.666667 | 0.000000 | -31000.000000 | -375000.000000 | -300000.000000 | 0.000000 | 917866.666667 | -16250.000000 | 18043865.750000 | 0.000000 | 5333239.333333 | ... | 63536368.000000 | 7707033.766667 | 26902035.439024 | 20509952.403846 | 70332541.295775 | 42764127.844444 | 52067849.625000 | 63413513.685185 | 80084403.885246 | 86164859.403226 | 83321303.523256 |
| TV Movie | NaN | NaN | NaN | NaN | NaN | 0.000000 | NaN | NaN | NaN | 0.000000 | 0.000000 | ... | 0.000000 | -1275000.000000 | 0.000000 | 0.000000 | -1500000.000000 | -312500.000000 | 0.000000 | 0.000000 | -200000.000000 | 0.000000 | -150000.000000 |
| Thriller | 5028842.000000 | NaN | 7902857.142857 | 8612335.300000 | 14029426.555556 | 11285049.363636 | 650000.000000 | 26892953.000000 | 6762174.600000 | 23658164.333333 | 18097830.200000 | ... | 22759650.239130 | 24651056.850877 | 19789641.168000 | 26262034.385827 | 13872156.605096 | 14726113.059701 | 19873485.568493 | 22469518.868750 | 22021043.091429 | 20129150.273743 | 32869106.847953 |
| War | 0.000000 | 11450000.000000 | 15000000.000000 | 1000000.000000 | 2546757.333333 | 11020224.375000 | 2000000.000000 | 20300000.000000 | 4325000.000000 | -4000000.000000 | 33032658.200000 | ... | 2339931.600000 | 3074035.285714 | 54239461.666667 | 7426139.555556 | 20431563.083333 | -8446301.571429 | 7767908.111111 | 23710293.600000 | 21060731.285714 | 60608265.086957 | 54677314.333333 |
| Western | 484166.666667 | -566666.666667 | 13266666.666667 | 0.000000 | 0.000000 | -950000.000000 | 181714.285714 | 3000000.000000 | 5000000.000000 | 14993349.000000 | 1924075.750000 | ... | 32218107.000000 | -16803693.000000 | -4254397.400000 | -3394303.000000 | NaN | -39203367.000000 | 41056818.666667 | 77092059.500000 | -57224233.333333 | -2261668.500000 | 72856619.000000 |
20 rows × 56 columns
sns.set(rc = {'figure.figsize' : (13,7)})
sns.heatmap(pivot_genre, cmap = 'PuBuGn', linewidths = .2)
plt.title('Genres by Profit per Year')
Text(0.5, 1.0, 'Genres by Profit per Year')
The analysis of profit per genre per year shows how profitability trends vary widely across both genres and time. Some genres, like Adventure, Fantasy, and Animation, display consistently strong profits across multiple decades, while others such as Documentary, TV Movie, and Foreign remain comparatively weak or inconsistent. A heatmap was used to visualize these results, making it easy to identify periods of peak profitability as well as fluctuations in performance over time.
6. Movies released in certain years achieve higher revenue/ROI than others¶
movies_genre.head()
| popularity | budget | revenue | original_title | runtime | release_year | vote_count | vote_average | profit | genre | roi | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Action | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Adventure | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Science Fiction | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Thriller | 9.090192 |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Action | 1.522909 |
# Using average per year to reduce the effect of volume changes
year_avgs = (
movies_genre.loc[np.isfinite(movies_genre['roi'])]
.groupby('release_year')[['revenue','roi']]
.mean()
.dropna()
.reset_index()
.rename(columns={'release_year':'year'})
)
year_avgs.head()
| year | revenue | roi | |
|---|---|---|---|
| 0 | 1960 | 15829791.666667 | 6.127400 |
| 1 | 1961 | 40055795.800000 | 8.848848 |
| 2 | 1962 | 19961705.485714 | 5.181289 |
| 3 | 1963 | 15149947.885714 | 2.608494 |
| 4 | 1964 | 39796993.862069 | 7.035513 |
# Trend strength (non-parametric)
spearman_revenue = year_avgs[['year','revenue']].corr(method='spearman').iloc[0,1]
spearman_roi = year_avgs[['year','roi']].corr(method='spearman').iloc[0,1]
print(f"Spearman(year vs avg revenue) = {spearman_revenue:.3f}")
print(f"Spearman(year vs avg ROI) = {spearman_roi:.3f}")
Spearman(year vs avg revenue) = 0.910 Spearman(year vs avg ROI) = -0.446
# Top 5 years by avg revenue and avg ROI
top5_rev_years = year_avgs.nlargest(5, 'revenue')[['year','revenue']]
top5_roi_years = year_avgs.nlargest(5, 'roi')[['year','roi']]
display(top5_rev_years)
display(top5_roi_years)
| year | revenue | |
|---|---|---|
| 55 | 2015 | 154679453.148077 |
| 54 | 2014 | 134690522.286252 |
| 52 | 2012 | 124013528.616056 |
| 53 | 2013 | 112221979.003205 |
| 41 | 2001 | 106517145.637975 |
| year | roi | |
|---|---|---|
| 26 | 1986 | 33074.737704 |
| 25 | 1985 | 1162.371159 |
| 51 | 2011 | 120.688962 |
| 39 | 1999 | 52.301377 |
| 17 | 1977 | 49.531412 |
fig, axes = plt.subplots(1,2, figsize=(14,6))
# Top 5 revenue years
axes[0].bar(top5_rev_years['year'].astype(str), top5_rev_years['revenue'], color='darkcyan')
axes[0].set_title('Top 5 Years by Avg Revenue')
axes[0].set_ylabel('Avg Revenue')
# Top 5 ROI years
axes[1].bar(top5_roi_years['year'].astype(str), top5_roi_years['roi'], color='darkblue')
axes[1].set_title('Top 5 Years by Avg ROI')
axes[1].set_ylabel('Avg ROI')
plt.tight_layout(); plt.show()
fig, ax1 = plt.subplots(figsize=(10,6))
ax1.plot(year_avgs['year'], year_avgs['revenue'], color='darkcyan', label='Revenue')
ax1.set_xlabel('Year'); ax1.set_ylabel('Average Revenue', color='darkcyan')
ax1.tick_params(axis='y', labelcolor='darkcyan')
ax2 = ax1.twinx()
ax2.plot(year_avgs['year'], year_avgs['roi'], color='darkblue', label='ROI')
ax2.set_ylabel('Average ROI', color='darkblue')
ax2.tick_params(axis='y', labelcolor='darkblue')
plt.title('Average Revenue vs ROI per Year')
fig.tight_layout(); plt.show()
The analysis highlights that 2015, 2014, and 2012 were the strongest years in terms of average revenue, reflecting the era of blockbuster dominance. In contrast, the highest ROI values were achieved in earlier breakout years such as 1986, 1985, and 1977, when lower budgets produced disproportionately high returns. A dual-axis line chart was used to display these trends, clearly showing that while revenues have generally risen over time, ROI peaks appear sporadically in specific years.
7. Genres with consistently high ROI are better long-term targets than high-budget genres¶
movies_genre.head()
| popularity | budget | revenue | original_title | runtime | release_year | vote_count | vote_average | profit | genre | roi | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Action | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Adventure | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Science Fiction | 9.090192 |
| 0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 2015 | 5562 | 6.500000 | 1363528810 | Thriller | 9.090192 |
| 1 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 2015 | 6185 | 7.100000 | 228436354 | Action | 1.522909 |
# Average budget vs average ROI by genre
genre_budget = movies_genre.groupby('genre')['budget'].mean().rename('avg_budget')
genre_roi = movies_genre.groupby('genre')['roi'].mean().rename('avg_roi')
budget_roi = pd.concat([genre_budget, genre_roi], axis=1).dropna()
budget_roi.head(5)
| avg_budget | avg_roi | |
|---|---|---|
| genre | ||
| Action | 27727820.331376 | 714.999015 |
| Adventure | 37543694.530931 | 1126.819397 |
| Animation | 23159781.606581 | 2.268490 |
| Comedy | 13297915.618244 | 52.962398 |
| Crime | 17663801.124815 | 1.752887 |
# Rank correlation: are high budgets the same as high ROI?
kendall_tau = budget_roi['avg_budget'].rank().corr(budget_roi['avg_roi'].rank(), method='kendall')
print(f"[H6] Kendall tau(rank budget vs rank ROI) = {kendall_tau:.3f}")
[H6] Kendall tau(rank budget vs rank ROI) = 0.168
# “Efficient” genres: above-median ROI at or below median budget
b_med = budget_roi['avg_budget'].median()
roi_med = budget_roi['avg_roi'].median()
efficient_genres = (
budget_roi[(budget_roi['avg_roi'] >= roi_med) & (budget_roi['avg_budget'] <= b_med)]
.sort_values('avg_roi', ascending=False)
.reset_index()
)
display(efficient_genres)
| genre | avg_budget | avg_roi | |
|---|---|---|---|
| 0 | Romance | 12531271.847547 | 1218.318745 |
| 1 | Drama | 11880717.773529 | 479.537545 |
| 2 | Mystery | 16119270.062963 | 53.486201 |
| 3 | Comedy | 13297915.618244 | 52.962398 |
| 4 | Thriller | 17207693.769178 | 38.658836 |
| 5 | Horror | 6226529.210751 | 35.107782 |
| 6 | Documentary | 577149.148077 | 10.820285 |
plt.figure(figsize=(10,6))
plt.scatter(budget_roi['avg_budget'], budget_roi['avg_roi'], color='darkcyan')
plt.axvline(b_med, color='red', linestyle='--', label='Median Budget')
plt.axhline(roi_med, color='blue', linestyle='--', label='Median ROI')
plt.title('Avg Budget vs Avg ROI by Genre')
plt.xlabel('Average Budget')
plt.ylabel('Average ROI')
plt.legend()
plt.tight_layout()
plt.show()
The analysis shows that genres such as Romance and Drama achieve exceptionally high ROI despite having only moderate budgets, while others like Horror and Documentary also perform efficiently at lower costs. The scatter plot of average budget versus average ROI highlights this pattern, with efficient genres appearing in the quadrant of low budget and high ROI, making them better long-term investment targets than high-budget categories.
Reccomendation¶
Based on the analysis, studios and investors should prioritize genres with consistently high ROI at moderate or low budgets, such as Romance, Drama, Horror, and Documentary, as they provide strong returns with less financial risk. While blockbuster genres like Action and Adventure generate high revenues, they also require large budgets and carry greater risk. Historical trends show that revenues have steadily increased in recent years, but the highest ROI spikes often occur in earlier or niche markets, suggesting opportunities in lower-cost productions. Balancing high-grossing blockbuster projects with a portfolio of efficient, ROI-driven genres can help ensure both profitability and sustainability. This mixed strategy would reduce dependence on costly hits while capitalizing on proven efficient categories.
Conclusion¶
In conclusion, the findings demonstrate that while blockbuster genres and high-budget productions dominate in terms of revenue, they do not always guarantee the best financial efficiency. Genres with moderate or low budgets but consistently high ROI, such as Romance, Drama, and Horror, present more sustainable long-term opportunities. Revenue trends show steady growth over the years, but ROI has been more volatile, with peaks in select periods and genres. Overall, success in the film industry depends not only on scale and popularity but also on strategic investment in genres that balance profitability with financial risk.