How to Scrape Hertz Car Inventory Using Python?

Hertz Global Holdings or HTZ has substantial media coverage because of a current bankruptcy filing and also an effort to get an extra $500 million in the equity after bankruptcy announcement. Hertz had filed for Chapter 11 Bankruptcy expecting a reformation. The move of raising equity has got stopped as well as noted by the current Hertz 8Q filing as of 18th June 2020.

This blog shows a study using a Python programming language for both downloading data from the Hertz website to scrape car data and also saving it in “DataFrames”, which we had exported to “.csv” files, which were manipulated with Excel. The code utilized is given in the appendix of this blog.

To extract car API data, a search was done on the Hertz website for different vehicles for sale inside 10,000 miles of St. Louis, MO. A similar 10,000 radius download was implemented for New York City, NY as well as San Francisco, CA, except otherwise given statistics related to the “St. Louis” search. The search yielded a few descriptive details, however, not a comprehensive view of the size of the fleet portfolio, which is for sale.

Search results primarily showed a total of 26,054 vehicles for the St. Louis search, however, after extracting relevant search found are only 11,994 vehicles before site search results started showing blank. It brings questions about the accuracy of search options. Different online sources mention that Hertz has originally put more than 50,000 vehicles for sale, although, without a complete universe, which contains distinctive Vehicle Identification Numbers (VIN), attached with an organized scraping approach, we just cannot validate with inevitability the size of the fleet for sale. Here, this indicates that whereas a search gathered 26,054 results, scanning through these results indicates that specific results have ended after 11,994 vehicles. Given in one more manner, the whole suggested search results did not match with the accessible displayed results. It is most expected related to the programming structure or display structure of the used car sales website.

The data given briefly indicates some data from the St. Louis search. A few descriptions are a bit longer than others. For instance, we have a couple of entries in the St. Louis data, which list ‘1992’ like a complete description having a mileage of ‘1 mile’ as well as no other details. It is usual with larger datasets and not a concern for us as this is not material depending on the total data size from the search. Also, we have 215 entries for vehicles, which need manual calling to the locations for price quotes. So, all these 215 items have got removed from the St. Louis Data. Just go through the sample statistics for St. Louis data:

Example (St. Louis Zip Code Search):

  • Vehicle Title & Description: “2019 CADILLAC XT5 Premium Luxury SUV”

Descriptive Statistics Overview:

  • Total Vehicles: 11,994

Image 1: Portfolio Sales Value in terms of Make

Figure 2: Vehicle Data Buckets Summary Statistics in terms of Search Location

Closing

The data given shows that the average mileage, price, as well as age across the samples is comparable. On average, a car, which Hertz is selling is worth anywhere around $18,500. Also, an average mileage is around 33,000–34,000 miles for cars put for sale as well as most cars are of 2018–2019 models. Also, we observe that Hertz significantly favors Chevrolet car models. Although, without getting a list of all unique vehicles given for sale we just cannot go too far.

Next Steps:

The following step is to try and get a complete idea about how much money Hertz could raise through selling the portfolio as well as how that might affect them while going forward. For doing that, the given steps would be tried:

Accumulate a complete world of accessible to sell cars

Depending on the universe, just calculate an expected value of the whole sales portfolio

Utilize financial statements for determining what effect it might have on Hertz debt holders for recoveries because Hertz continues the bankruptcy proceedings

In bankruptcy reorganization, many factors could affect recoveries depending on which your securities assemble in the company’s capital structure. Generally, debt holders would get some part of the total principal in the recoveries or equity in a newly formed company. That estimated or perceived amount has a huge impact on the market value of present securities. Modeling possible recoveries can provide insights about what present exceptional securities are valued.

Image 3: Example of a Sliced File Format

Image 4: Python Code

from requests import get from bs4 import BeautifulSoup import pandas as pd url_insert = 0 # a variable to increment by 35 results per page count_one = 0 count_two = 0 #create an empty dataframe, the "Split" columns are placeholders for when we split our multi-word title into #components so in Excel we can analyze by make, model, etc. df_middle = pd.DataFrame(columns = ['Description', 'Price', 'Mileage', 'Split 1', "Split 2", "Split 3", "Split 4", "Split 5", "Split 6", "Split 7", "Split 8", "Split 9", "Split 10"]) #at 35 results per page using a p of 800 means we can search up to 28,000 (35 * 800) results. The most results we got after searching #was just over 26,000 even though we did not download all of them because blank pages started just before 12,000 at the time # of our St. Louis 10,000 mile radius search for p in range(0,800): url = 'https://www.hertzcarsales.com/used-cars-for-sale.htm?start=' + str(url_insert) + '&geoRadius=10000&geoZip=10007' response = get(url) html_soup = BeautifulSoup(response.text, 'html.parser') type(html_soup) li_class = html_soup.find_all('li', class_ = 'item hproduct clearfix closed certified primary') count_one = count_one + 1 print(count_one) print(url) url_insert = url_insert + 35 for i in li_class: title_long = i.find('a', class_ ='url') title_long = title_long.text title_long = str(title_long) title_long = title_long.rstrip("\n") title_long = title_long.lstrip("\n") title_split = title_long.split() #our pre-split placeholder list, when split up to 14 columns are available, one for each word based on the number of #words in the title blank_list = ["none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none", "none"] for a in range(len(title_split)): blank_list[a]=title_split[a] price = i.find('span', class_='value') price = price.text price = str(price) price = price.rstrip("\n") price = price.lstrip("\n") #print(price) data = i.find('div', class_='gv-description') mileage = data.span.span mileage = mileage.text mileage = mileage.rstrip("\n") qmileage = mileage.lstrip("\n") #the line below removes the text "miles" from our mileage column mileage_clean = ''.join([i for i in mileage if i.isdigit()]) #print(mileage.text) #print("----------------------------------------------------") list_current = {'Description':[title_long], 'Price':[price], 'Mileage':[mileage_clean], 'Split 1':blank_list[0], 'Split 2':blank_list[1], 'Split 3':blank_list[2], 'Split 4': blank_list[3], 'Split 5':blank_list[4], 'Split 6':blank_list[5], 'Split 7':blank_list[6], 'Split 8':blank_list[7], 'Split 9':blank_list[8], 'Split 10':blank_list[9], 'Split 11':blank_list[10], 'Split 12':blank_list[11], 'Split 13':blank_list[12], 'Split 14':blank_list[13]} df_current = pd.DataFrame(data = list_current) count_two = count_two + 1 print(count_two) #print(df_current) df_middle = df_middle.append(df_current) df_middle.to_csv('C:\\Users\\james\\PycharmProjects\\workingfiles\\Webscraping\\Hertz\\output_NewYorkCity.csv')

If you want to know more about how to scrape car data or extract car API data using Python, you can contact X-Byte Enterprise Crawling or ask for a free quote!

Originally published at https://www.xbyte.io.

Founder of “X-Byte Enterprise Crawling”, a well-diversified corporation providing Enterprise grade Web Crawling service & solution, leveraging Cloud DaaS model