How to Scrape Netflix Movies & TV Shows Data

Netflix, Inc. is an American media and technology service provider & productions company, having headquartered in Los Gatos in California. It was founded in the year 1997 by Marc Randolph and Reed Hastings in Scotts Valley in California. The primary business of the company is subscription-based streaming services that provide online streaming of television and films series, including in-house production.

Netflix is an extremely popular entertainment service utilized by people across the globe. This EDA would explore the given Netflix dataset using graphs and visualizations with Python libraries, seaborn, and matplotlib.

We use Movies and TV Shows for scraping Netflix movies and TV shows data, EDA, and visualizations listed on a Netflix dataset using Kaggle. This dataset includes Movies and TV Shows accessible on Netflix after 2019. This dataset is gathered from Flixable, a third-party search engine for Netflix.

Importing Libraries

Let’s import the libraries needed.

import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt import seaborn as sns

Loading a Dataset

With Pandas Library, we would load a CSV file named that with netflix_df for a dataset.

netflix_df = pd.read_csv("netflix_titles.csv")

Then check the initial 5 data.

This dataset has more than 6234 titles and 12 descriptions. After getting a quicker view of data frames, that looks like the typical TVshows or movie data frames without any ratings. We may also see NaN values within some columns.

Data Reporting & Cleaning

Data Cleaning indicates the procedure of recognizing incorrect, inaccurate, irrelevant, incomplete, or missing data as well as modifying, replacing, and deleting them when required. Data Cleansing is measured as the fundamental element of Data Science.

Data Reporting & Cleaning

Do you want to know about the best 250 movies till date? Or the finest comedy shows, which have ever been broadcasted on the smaller screens? For such data like reviews, ratings, answers, as well as trivia associated with the domain of shows and movies, people worldwide use IMDB, an online database. While this data is updated by the fans, this database is held as well as operated by the subsidiary of Amazon. This was initially made as to the database in the 1990 as well as moved online in 1993. Whereas anybody can access this website data, you must do registration if you want to do edits to the reviews or facts. Here, we will go through

print('\nColumns with missing value:') print(netflix_df.isnull().any())

Columns with missing value: show_id False type False title False director True cast True country True date_added True release_year False rating True duration False listed_in False description False dtype: bool

From these details, we understand that 6,234 entries, as well as 12 columns, are given to deal with the EDA. There are some columns having null values, “cast,” “country,” “director,” “date_added,” and “rating.”

show_id 0 type 0 title 0 director 1969 cast 570 country 476 date_added 11 release_year 0 rating 10 duration 0 listed_in 0 description 0 dtype: int64

There are 3,036 null values in the whole dataset having 1,969 missing points underneath “director”, 570 below “cast,” 476 below “country,” 11 below “date_added,” as well as 10 below “ratings.” We would require to cope with all the null data points before diving into EDA as well as modeling.

Attribution is the method for treating missing values by filling it through definite techniques. Could use mode, mean, or utilize predictive modeling. Here, we would discuss the usage of fillna functions from Pandas to do the attribution. Drop rows having missing values. Could utilize the dropna functions from Pandas.

netflix_df.director.fillna("No Director", inplace=True) netflix_df.cast.fillna("No Cast", inplace=True) netflix_df.country.fillna("Country Unavailable", inplace=True) netflix_df.dropna(subset=["date_added", "rating"], inplace=True)

The coolest way of getting rid of it might be to delete rows having missing data to find missing values. Although, this wouldn’t become helpful to the EDA as this is information loss. As “cast,” “director,” and “country” have the most null values, we have selected to treat every missing value as inaccessible. Another two labels “date_added”, as well as “rating”, have an irrelevant data portion, therefore it drops from a dataset. In the end, we can observe that no missing values are there in a data frame.

Exploratory Visualization and Analysis

1. Netflix Content through Type

Analyzing the Netflix dataset including both shows and movies is needed. Let’s compare total shows and movies in the dataset to understand which the key point is.

plt.figure(figsize=(12,6)) plt.title("Percentation of Netflix Titles that are either Movies or TV Shows") g = plt.pie(netflix_df.type.value_counts(),explode=(0.025,0.025), labels=netflix_df.type.value_counts().index, colors=['red','black'],autopct='%1.1f%%', startangle=180) plt.show()

There are around 4,000++ movies as well as nearly 2,000 TV shows, having movies as the key part. There are so many movie titles having 68,5% than TV shows titles having 31,5%.

2. Content Amount as the Time Function

Then, we will search the content amount of Netflix OTT through web scraping OTT platform that has been added through the past years. As we are interested in when Netflix added a title to their platform, we would add the “year_added” column for showing date from “date_added” columns.

fig, ax = plt.subplots(figsize=(13, 7)) sns.lineplot(data=netflix_year_df, x='year', y='date_added') sns.lineplot(data=movies_year_df, x='year', y='date_added') sns.lineplot(data=shows_year_df, x='year', y='date_added') ax.set_xticks(np.arange(2008, 2020, 1)) plt.title("Total content added across all years (up to 2019)") plt.legend(['Total','Movie','TV Show']) plt.ylabel("Releases") plt.xlabel("Year") plt.show()

Depending on the timeline given, we can determine that a popular streaming platform was started gaining grip after 2013. And since then, the content added has been growing considerably. The development in total movies on Netflix is much larger in numbers than TV shows. Around 1,300 new movies got added in 2018 as well as 2019. Also, we know that Netflix is mainly focused on movies and not TV shows in the current years

3. Countries by Amount of Produced Content

Next is searching the countries through the amount of content produced on Netflix. We require to separate all the countries in the film before studying that and removing titles having no countries accessible.

filtered_countries = netflix_df.set_index('title').country.str.split(', ', expand=True).stack().reset_index(level=1, drop=True); filtered_countries = filtered_countries[filtered_countries != 'Country Unavailable'] plt.figure(figsize=(13,7)) g = sns.countplot(y = filtered_countries, order=filtered_countries.value_counts().index[:15]) plt.title('Top 15 Countries Contributor on Netflix') plt.xlabel('Titles') plt.ylabel('Country') plt.show()

Using the given images, we can have the top 15 contributors (country-wise) to Netflix. The country having the maximum amount of content production is the United States.

4. Top Directors on Netflix

For getting the most well-known director, we could visualize it.

filtered_directors = netflix_df[netflix_df.director != 'No Director'].set_index('title').director.str.split(', ', expand=True).stack().reset_index(level=1, drop=True) plt.figure(figsize=(13,7)) plt.title('Top 10 Director Based on The Number of Titles') sns.countplot(y = filtered_directors, order=filtered_directors.value_counts().index[:10], palette='Blues') plt.show()

The most well-liked director on Netflix, having the maximum titles, is mostly international.

5. Top Genres on Netflix

filtered_genres = netflix_df.set_index('title').listed_in.str.split(', ', expand=True).stack().reset_index(level=1, drop=True); plt.figure(figsize=(10,10)) g = sns.countplot(y = filtered_genres, order=filtered_genres.value_counts().index[:20]) plt.title('Top 20 Genres on Netflix') plt.xlabel('Titles') plt.ylabel('Genres') plt.show()

From this graph, we can understand that International Movies are at the first place, trailed by dramas as well as comedies.

order = netflix_df.rating.unique() count_movies = netflix_movies_df.groupby('rating')['title'].count().reset_index() count_shows = netflix_shows_df.groupby('rating')['title'].count().reset_index() count_shows = count_shows.append([{"rating" : "NC-17", "title" : 0},{"rating" : "PG-13", "title" : 0},{"rating" : "UR", "title" : 0}], ignore_index=True) count_shows.sort_values(by="rating", ascending=True) plt.figure(figsize=(13,7)) plt.title('Amount of Content by Rating (Movies vs TV Shows)') plt.bar(count_movies.rating, count_movies.title) plt.bar(count_movies.rating, count_shows.title, bottom=count_movies.title) plt.legend(['TV Shows', 'Movies']) plt.show()

The biggest count of the Netflix content is done with the “TV-14” ratings. “TV-14” has material having adult guardians or parents might find improper for children under 14 years of age. However, the biggest count of the TV shows is done with the “TV-MA” ratings. “TV-MA” is the ratings given by TV Parental Guidelines to television programs designed for mature audiences only.

6. Content by Ratings

filtered_cast_shows = netflix_shows_df[netflix_shows_df.cast != 'No Cast'].set_index('title').cast.str.split(', ', expand=True).stack().reset_index(level=1, drop=True) plt.figure(figsize=(13,7)) plt.title('Top 10 Actor TV Shows Based on The Number of Titles') sns.countplot(y = filtered_cast_shows, order=filtered_cast_shows.value_counts().index[:10], palette='pastel') plt.show()

7. Top Actors on Netflix Depending on Total Titles

The top actor on Netflix TV Shows, depending on total titles, is Takahiro Sakurai.

filtered_cast_movie = netflix_movies_df[netflix_movies_df.cast != 'No Cast'].set_index('title').cast.str.split(', ', expand=True).stack().reset_index(level=1, drop=True) plt.figure(figsize=(13,7)) plt.title('Top 10 Actor Movies Based on The Number of Titles') sns.countplot(y = filtered_cast_movie, order=filtered_cast_movie.value_counts().index[:10], palette='pastel') plt.show()

The top actor on Netflix Movies, depending on total titles is Anupam Kher.

Conclusion

We have taken many interesting implications from Scraping Netflix movies and TV shows data titles dataset; here’s the summary of some of them:

  • A country by the amount of content produces is the United States.
  • A general streaming platform in progress getting traction after the year 2014. Since that time, the added content has been growing significantly.
  • International Movies is the genre, which is mainly on Netflix.
  • The biggest count of the Netflix content is done with the “TV-14” ratings.
  • The maximum content type on the Netflix is Movies.
  • The most well-known actor on the Netflix movie, depending on total titles, is Anupam Kher.
  • The most well-known actor on the Netflix TV Shows depending on total titles is Takahiro Sakurai.
  • The most widespread director on Netflix having maximum titles, is Jan Suter.

Originally published at https://www.xbyte.io.

Founder of “X-Byte Enterprise Crawling”, a well-diversified corporation providing Enterprise grade Web Crawling service & solution, leveraging Cloud DaaS model