How to Scrape Amazon Prime Video Data Using Beautiful Soup & Selenium

Selenium is an extremely powerful tool used for web data scraping however, it has some flaws that are fair because it was produced mainly to test web applications. On the other hand, BeautifulSoup was developed produced for data scraping and it is extremely powerful indeed.

However, even BeautifulSoup has its faults, it won’t be beneficial if the required data is behind the “wall”, as it needs the user’s login for accessing the data or needs some actions from users.

That’s where we can utilize Selenium, for automating user interactions through the website as well as we would use BeautifulSoup for scraping data after getting inside a “wall”.

Integration of Selenium with BeautifulSoup makes an extremely powerful web scraping tool.

As you can use Selenium for automating user interactions as well as extract the data also, BeautifulSoup is much more effective in extracting the data.

We would be using BeautifulSoup and Selenium to extract movie information like name, description, ratings, etc. in the comedy category from Amazon Prime Video as well as we would filter out the movies depending on the IMDB ratings.

So, let’s start.

Initially, let’s import the necessary modules;

from selenium import webdriver from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup as soup from time import sleep from selenium.common.exceptions import NoSuchElementException import pandas as pd

Then, start three empty lists for holding the movie information,

movie_names = [] movie_descriptions = [] movie_ratings = []

If you want that this program to works, then you need to use a Chrome Web Driver. You can have a Chrome driver so ensure that you download a driver, which matches the Chrome browser’s version.

Then, let's make a function open_site() that will open the sign-in page of Amazon Prime.

def open_site(): options = webdriver.ChromeOptions() options.add_argument("--disable-notifiactions") driver = webdriver.Chrome(executable_path='PATH/TO/YOUR/CHROME/DRIVER',options=options) driver.get(r'https://www.amazon.com/ap/signin?accountStatusPolicy=P1&clientContext=261-1149697-3210253&language=en_US&openid.assoc_handle=amzn_prime_video_desktop_us&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.primevideo.com%2Fauth%2Freturn%2Fref%3Dav_auth_ap%3F_encoding%3DUTF8%26location%3D%252Fref%253Ddv_auth_ret') sleep(5) driver.find_element_by_id('ap_email').send_keys('ENTER YOUR EMAIL ID') driver.find_element_by_id('ap_password').send_keys('ENTER YOUR PASSWORD',Keys.ENTER) sleep(2) search(driver)

Now, define the search() function that searches the genre, which we identify,

def search(driver): driver.find_element_by_id('pv-search-nav').send_keys('Comedy Movies',Keys.ENTER) last_height = driver.execute_script("return document.body.scrollHeight") while True: driver.execute_script("scrollTo(0, document.body.scrollHeight);") sleep(5) new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height html = driver.page_source Soup = soup(html,'lxml') tiles = Soup.find_all('div',attrs={"class" : "av-hover-wrapper"}) for tile in tiles: movie_name = tile.find('h1',attrs={"class" : "_1l3nhs tst-hover-title"}) movie_description = tile.find('p',attrs={"class" : "_36qUej _1TesgD tst-hover-synopsis"}) movie_rating = tile.find('span',attrs={"class" : "dv-grid-beard-info"}) rating = (movie_rating.span.text) try: if float(rating[-3:]) > 8.0 and float(rating[-3:]) < 10.0: movie_descriptions.append(movie_description.text) movie_ratings.append(movie_rating.span.text) movie_names.append(movie_name.text) print(movie_name.text, rating) except ValueError: pass dataFrame()

This function searches for a genre as well as scrolls down till the page end, as Amazon Prime Video has limitless scrolling, we will do scrolling till the end with JavaScript executor as well as get the page source by using a driver.page_source. We utilize this source as well as pass that into the BeautifulSoup.

In case, the statement is getting movies that have ratings of over 8.0 as well as below 10.0, just to make sure.

Now, it’s time to make a pandas data frame, for storing all the movie data in,

def dataFrame(): details = { 'Movie Name' : movie_names, 'Description' : movie_descriptions, 'Rating' : movie_ratings } data = pd.DataFrame.from_dict(details,orient='index') data = data.transpose() data.to_csv('Comedy.csv')

Let’s call functions now,

Your result won’t look precisely like this so, we have formatted this sheet a bit, like adjusting column widths as well as wrapping the text. Other than that, it would look like the one given here.

While BeautifulSoup and Selenium work together in the best way and can provide good results, some other modules are there that are equally powerful.

If you have any other queries related to scraping Amazon Prime Video data, you can always contact X-Byte!

Originally published at https://www.xbyte.io.

Founder of “X-Byte Enterprise Crawling”, a well-diversified corporation providing Enterprise grade Web Crawling service & solution, leveraging Cloud DaaS model