How to Use Python to Scrape Real Estate Website Data Using Web Scraping and Making Data Wrangling?

In all data science projects, amongst the most inquired questions is how to find the data and where is that data. We would say there are lots of data available, you only need to scrape it. For instance, there are billions of petabytes of data accessible and the majority of them are free. You just need to understand is how to scrape that and make that helpful for your business. We would say all types of businesses can use this freely available data on the internet to get business improvements. They can utilize web scraping for scraping it.

To demonstrate web scraping in the blog, we will be extracting data from domian.com as it is a property website. We will be extracting the pricing, total bedrooms, total bathrooms, total parking, addresses as well as locations of all houses in Australia’s Melbourne City.

Before using Python, you should know some fundamentals about HTML (HyperText Markup Language).

  • All web pages are inscribed in HTML.
  • HTML defines a webpage’s structure.
  • HTML essentials label portions of content like “that is the heading”, “that is the paragraph”, “that is the link”, etc.
  • HTML features tell a browser about how to show the content.
  • HTML is a standard markup language to create web pages.

An easy HTML document will look like this:

<!DOCTYPE html> <html> <head> <title>Page Title<title>> <head> <body> <h1>This is a Heading<h1> <p>This is a paragraph.<p> <body> <html>

Wherever,

The <!DOCTYPE html> statement defines that the document is created in HTML5.

The <html> part is a root component of an HTML page.

The <head> part comprises Meta information regarding an HTML page.

The <title> part identifies the title for an HTML page (that is given in a browser’s title bar or within a page’s tab)

The <body> part describes a document’s body, as well as is the container for different visible contents like paragraphs, images, headings, tables, lists, hyperlinks, etc.

The <h1> part describes a huge heading.

The <p> part describes a paragraph.

You may get the HTML document about any website through doing the right-click on any webpage as well as choosing “View page resource” (accessible in Google Chrome and Microsoft Edge). All this content on a webpage would be within the HTML document within a well-structured format, you just need to scrape the necessary data from the HTML document.

1. Collecting Data

There are different libraries accessible in Python for getting the HTML document as well as parse that into the necessary format required.

# sample code to get a HTML document and parse it into the required format you want from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("https://www.domain.com.au/sale/melbourne-region-vic/") bsobj = BeautifulSoup(html, "lxml")

In the given code, urlopen is scraping HTML documents from given web pages as well as BeautifulSoup is wording it in the LXML format. LXML format is very easy to understand, so you can utilize another format that you wish like JSON, etc.

The given screenshot here shows “https://www.domain.com.au/sale/melbourne-region-vic/" URL results as well as it shows all the available properties to sell in Melbourne

however, we require to get webpages for all Melbourne properties accessible on the page https://www.domain.com.au/sale/melbourne-region-vic/. We could do that by scraping all the available URLs on the page as well as store them in the list. Another thing to supplement here is, there are around 50 pages of the Melbourne search available on Domain.com as well as it is only the 1st page therefore we require to visit every 50 pages as well as scrape all URLs for all advertised houses in Melbourne. So, we require to apply a loop for different 50 iterations, every iteration for every page.

from urllib.request import urlopen from bs4 import BeautifulSoup import re # home url of domian.com australia home_url = "https://www.domain.com.au" # number of pages of search result are 50, so we need to page_numbers = list(range(50))[1:50] # list to store all the urls of properties list_of_links = [] # for loop for all 50 search(melbourne region) pages for page in page_numbers: # extracting html document of search page html = urlopen(home_url + "/sale/melbourne-region-vic/?sort=price-desc&page=" + str(page)) # parsing html document to 'lxml' format bsobj = BeautifulSoup(html, "lxml") # finding all the links available in 'ul' tag whos 'data-testid' is 'results' all_links = bsobj.find("ul", {"data-testid": "results"}).findAll("a", href=re.compile("https://www.domain.com.au/*")) # inner loop to find links inside each property page because few properties are project so they have more properties inside their project page for link1 in all_links: # checking if it is a project and then performing similar thing I did above if 'project' in link1.attrs['href']: inner1_html = urlopen(link1.attrs['href']) inner1_bsobj = BeautifulSoup(inner1_html, "lxml") for link2 in inner1_bsobj.find("div", {"name": "listing-details__other-listings"}).findAll("a", href=re.compile("https://www.domain.com.au/*")): if 'href' in link2.attrs: list_of_links.append(link2.attrs['href']) else: list_of_links.append(link1.attrs['href'])

You may just copy as well as paste the given code to make some modifications as per your requirements and try and run.

Here, we did some different things:

  1. We used a search page that is arranged by price. We did it so that this would become easier to credit missing prices of houses. We will explain that more in the data wrangling section below.
  2. An Inner loop is utilized as some properties are like projects and every project has more property URL links within its pages.

Now, we also have the URLs for every property of Melbourne, Australia. Every URL is exclusive for every property in Melbourne. So, the next step would be, going inside every URL as well as scrape prices, total bedrooms, total bathrooms, total parking, addresses, and locations.

# removing duplicate links while maintaining the order of urls abc_links = [] for i in list_of_links: if i not in abc_links: abc_links.append(i) # defining required regural expression for data extraction pattern = re.compile(r'>(.+)(.+?).*') pattern1 = re.compile(r'>(.+)

Now, an output of a given code provides us the listing of dictionaries having all the accessible scraped data. Here, we would convert that into different individual lists as we need to do a bit more cleaning as well as the scraping of above-mined data as well as it would become easier to perform in the lists.

# creating empty lists beds_list = [] baths_list = [] parking_list = [] area_list = [] name_list = [] lat_list = [] long_list = [] price_list = [] # interating through list created above with data for row in basic_feature_list: # checking if the row cointains 'Beds', 'Bed' or nothing if 'Beds' in row: beds_list.append(row['Beds']) elif 'bed' in row: beds_list.append(row['Bed']) else: beds_list.append(None) # checking if the row cointains 'Baths', 'Bath' or nothing if 'Baths' in row: baths_list.append(row['Baths']) elif 'Bath ' in row: baths_list.append(row['Bath']) else: baths_list.append(None) # checking if the row cointains 'Parking', '-' or nothing if 'Parking' in row and row['Parking'] != '−': parking_list.append(row['Parking']) else: parking_list.append(None) # checking if the row cointains ' ', or nothing. Because empty space (i.e. ' ') reprsents area if ' ' in row: area_list.append(row[' ']) else: area_list.append(None) # checking if the row cointains 'name' that is address of property if 'name' in row: name_list.append(row['name']) else: name_list.append(None) # checking if the row cointains 'price' if 'price' in row: price_list.append(row['price']) else: price_list.append(None) # checking if the row cointains 'lat' that is lattitude of property if 'lat' in row: lat_list.append(row['lat']) else: lat_list.append(None) # checking if the row cointains 'long' that is lattitude of property if 'long' in row: long_list.append(row['long']) else: long_list.append(None)

At present, we are having all the information in the list format.

2. Data Wrangling

Some people do not wish to reveal the property pricing therefore, they do not show pricing in the property advertisement. At times, they will not put everything in the pricing column and at times, they put things like ‘after inspection pricing’ or ‘contact dealer’ or more. In addition, a few people do not show the direct price and they put the range of pricing or pricing with a few additional texts before their pricing or after pricing or both. Therefore, we require to deal with all the situations as well as scrape only the pricing and in case the pricing is not provided then use ‘none’ there. Here is the code:

import random # creating a new empty price list actual_price_list = [] # defining some regural expressions, they will be used to extract price of properties pattern1 = re.compile(r'\$\s?([0-9,\.]+).*\s?.+\s?\$\s?([0-9,\.]+)') pattern2 = re.compile(r'\$([0-9,\.]+)') # interating through price_list for i in range(len(price_list)): # check that a price is given or range of price is given if str(price_list[i]).count('$') == 1: b_num = pattern2.findall(str(price_list[i])) # checking length of string, if it is less than or equal to 5 then price is in millions so need to convert the price if len(b_num[0].replace(',', '')) > 5: actual_price_list.append(float(b_num[0].replace(',', ''))) else: actual_price_list.append(float(b_num[0].replace(',', ''))*1000000) elif str(price_list[i]).count('$') == 2: a_num = pattern1.findall(str(price_list[i])) random_error = random.randint(0, 10000) # checking length of string, if it is less than or equal to 5 then price is in millions so need to convert the price if len(a_num[0][0].replace(',', '')) > 5 and len(a_num[0][1].replace(',', '')) > 5: # to take average price between two price range given avg_price = (float(a_num[0][0].replace(',', '')) + float(a_num[0][1].replace(',','')))/2 else: avg_price = (float(a_num[0][0].replace(',', '')) + float(a_num[0][1].replace(',',''))/2)*1000000 # adding or subtracting the amount from the average price by normally distributed generated random number avg_price = avg_price + random_error actual_price_list.append(avg_price) else: actual_price_list.append('middle_price')

You can have many missing values in pricing as a lot of people do not wish to provide or provide house pricing on a website. At present, we have to credit the missing pricing and we have used a trick.

We have used a trick that we have sorted houses as per their prices and all these houses having or without a shown price would get sorted. This sorting by a website is made using the price provided by the house owners to a website however, this is not given on a website for the users. That is the reason why we have scraped houses data from a website when the website results get sorted by the price.

We need to understand this with an example. Assume there are 10 houses with house pricing missing however, we can categorize houses as per their prices, so initially, we categorize them as per their prices and we see the price of house no. 4 as well as house no. 5 is missing therefore, we would take means of the prices of house no. 3 as well as house no. 6. After that, attribute missing prices with the mean values. A similar type of things we would be offering in the given code:

# for loop to impute missing values at the start of list, because here we cannot take mean for i in range(len(actual_price_list)): if actual_price_list[i] != 'middle_price': for a in range(i, -1, -1): actual_price_list[a] = actual_price_list[i] break # here we will be taking mean and then add random number with same standard deviation normal distribution and then impute it for i in range(len(actual_price_list)): if actual_price_list[i] == 'middle_price': for j in range(i, len(actual_price_list)): if actual_price_list[j] != 'middle_price': mid = (actual_price_list[i-1] + actual_price_list[j])/2 if actual_price_list[j] > 12000000: for k in range(i, j): random_error = random.randint(-1000000, 1000000) mid = mid + random_error actual_price_list[k] = mid i = j break elif actual_price_list[j] > 5000000: for k in range(i, j): random_error = random.randint(-100000, 100000) mid = mid + random_error actual_price_list[k] = mid i = j break else: for k in range(i, j): random_error = random.randint(-10000, 10000) mid = mid + random_error actual_price_list[k] = mid i = j break elif j == len(actual_price_list)-1: for n in range(i, len(actual_price_list)): random_error = random.randint(-1000, 1000) a_price = actual_price_list[i-1] a_price = a_price + random_error actual_price_list[n] = a_price break

Making Dataframe.

import pandas as pd house_dict = {} house_dict['Beds'] = beds_list house_dict['Baths'] = baths_list house_dict['Parking'] = parking_list house_dict['Area'] = area_list house_dict['Address'] = name_list house_dict['Latitude'] = lat_list house_dict['Longitude'] = long_list house_dict['Price'] = actual_price_list house_df = pd.DataFrame(house_dict) house_df.info()

One ‘area’ column has different null values that cannot be credited so we would be removing the ‘area’ column.

house_df.drop('Area', axis=1, inplace=True)

In addition, convert baths, beds, and parking string types into numeric types.

house_df["Beds"] = pd.to_numeric(house_df["Beds"]) house_df["Baths"] = pd.to_numeric(house_df["Baths"]) house_df["Parking"] = pd.to_numeric(house_df["Parking"])

Now, do some descriptive data analytics for finding data problems and solve those problems. For instance, utilize scatter plot for checking outliers within data or utilize histogram to watch data distribution etc.

# scatter plot house_df.plot.scatter(x='Beds',y='Baths') # histogram house_df["Price"].plot.hist(bins = 50)

Data cleansing is an iterative procedure. The initial step of the data cleansing procedure is data auditing. Here, we recognize the kinds of anomalies, which decrease the quality of data. Data auditing means programmatically checking all data with some validation instructions, which are pre-specified, as well as creating the report about data quality as well as its problems. Also, we frequently apply a few statistical tests within this step of data examining.

Data Anomalies could be classified at a higher level in three groups:

1. Syntactic Anomalies

Define characteristics of the values and formats used for the entity’s representation. Syntactic anomalies like syntactical errors and irregularities, lexical errors, and domain format errors.

2. Semantic Anomalies

Semantic Anomalies Delay the data collection from getting a non-redundant and comprehensive representation of a mini-world. These kinds of anomalies comprise contradictions, integrity constraint violations, invalid tuples, and duplicates.

3. Coverage Anomalies

Reduce the entities as well as entity properties from a mini-world, which are symbolized in data collection. These coverage anomalies are considered as missing tuples and missing values.

Many ways are there to deal with these anomalies, we won’t go into the details about how to deal with these anomalies as our scraped data does not get these anomalies.

Information can also get transformed as per their requirements. The problem for predicting the house price is the reversion problem. In case, this is the linear reversion problem then we could make some transformations for making data please linear regression expectations.

We could create or add new features through current features in data sets so that we could make data more improved. We are making a new column that will have the distance of a house from its city.

Missing Values

Resource of different missing values:

Data Scraping: It is quite possible that you face problems with the scraping procedure. In these cases, we need to double-check regarding correct data having data guardians. A few hashing procedures could also be utilized to ensure that data scraping is correct. Errors during the data scraping stage are generally easy to get as well as could be easily corrected also.

Data Collection: The errors take place at the time of collecting data as well as are hard to correct.

Further, they could be categorized into four kinds:

Missing totally at Random: It is the case when probabilities about the missing variables are similar for all observations. For instance: defendants of the data collection procedure decide that they would declare their earnings after tossing the fair coin. In case, ahead takes place, the respondent declares earnings or vice versa. So, here, every observation is having an equal chance of lost value.

Missing Randomly: It is the case where variables are missing randomly and the missing ratio differs for various values/levels of other input variables. For instance, we are gathering data for age, and females have higher missing values compare to males.

Missing, Which Relies on Overlooked Forecasters: That is the case when missing prices are not randomly selected as well as are associated with unobserved input variables. For instance, in the medical study, in case, any particular diagnostic creates discomfort, there are better chances of dropping out from this study. This lost value is not at random except we have incorporated “discomfort” as the input variable for the patients.

Missing Which Relies on Missing Value Itself: That is the case when a probability about missing values is directly connected with the missing values themselves. For instance, people having lower or higher incomes are expected to offer non-response to the earning.

As per the research, you can securely discover what kind of messiness is there in the data. Our scraped data is Missing totally at random as well as the data set is enormous so we have deleted the rows using any ‘none’ values.

import math cleaned_house_df = house_df.dropna(how='any') cleaned_house_df.reset_index(drop = True, inplace = True) # radius of earth is 6378 r = 6378 dis_to_city = [] for i in range(len(cleaned_house_df)): lat1_n = math.radians(-37.818078) lat2 = math.radians(float(cleaned_house_df['Latitude'][i])) lon1_n = math.radians(144.96681) lon2 = math.radians(float(cleaned_house_df['Longitude'][i])) lon_diff_n = lon2 - lon1_n lat_diff_n = lat2 - lat1_n a_n = math.sin(lat_diff_n / 2)**2 + math.cos(lat1_n) * math.cos(lat2) * math.sin(lon_diff_n / 2)**2 c_n = 2 * math.atan2(math.sqrt(a_n), math.sqrt(1 - a_n)) dis_to_city.append(round(r*c_n, 4)) cleaned_house_df['distance_to_city'] = dis_to_city

The last step is exporting Dataframe to some other tabular formats file including a CSV or an excel file.

# exporting to csv file cleaned_house_df.to_csv('real_estate_data_csv.csv', index=False) # exporting to excel file cleaned_house_df.to_excel('real_estate_data_excel.xlsx', index=False)

Originally published at https://www.xbyte.io.

Founder of “X-Byte Enterprise Crawling”, a well-diversified corporation providing Enterprise grade Web Crawling service & solution, leveraging Cloud DaaS model