BigData
May 17, 2022

How to Scrape Rentals Websites Using BeautifulSoup and Python?

Web scraping using BeautifulSoup and data wrangling using Pandas to discuss generated insights.

Would renting a condo or apartment in Etobicoke, North York, or Mississauga be considerably cheaper than having one in downtown Toronto?

  • How do suburb's rents compare to the Toronto city’s rents?
  • How much can you potentially save if you have rented a basement unit?
  • Which suburbs have the lowest rent rates?

Browsing manually using listings on rental websites can be very time-consuming. So, the better option is to scrape rental websites using web scraping Python as well as analyze that to get answers to all your questions.

Scraping Rental Website Data through Web scraping using BeautifulSoup and Python

We have decided to extract data from TorontoRentals.com with Python and BeautifulSoup. This website has lists for Toronto as well as many suburbs like Brampton, Scarborough, Mississauga, Vaughan, etc. This has various kinds of listings like apartments, houses, condos, as well as basements.

Initially, we imported the necessary Python libraries.

# Import Python Libraries
# For HTML parsing
from bs4 import BeautifulSoup 
# For website connections
import requests 
# To prevent overwhelming the server between connections
from time import sleep 

# Display the progress bar
from tqdm import tqdm
# For data wrangling
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# For creating plots
import matplotlib.pyplot as plt
import plotly.graph_objects as go

Next, we have written the function named get_page to have soup objects for every page (iteration). Functions accept 4 user inputs — type, city, beds, and page. The function consists of logic for checking HTTP response status codes for finding if HTTP requests have been completed successfully. A get_page function is named from the key function named page_num.

def get_page(city, type, beds, page):
  
  url = f'https://www.torontorentals.com/{city}/{type}?beds={beds}%20&p={page}'
  # https://www.torontorentals.com/toronto/condos?beds=1%20&p=2
  
  result = requests.get(url)
  
  # check HTTP response status codes to find if HTTP request has been successfully completed
  if result.status_code >= 100  and result.status_code <= 199:
      print('Informational response')
  if result.status_code >= 200  and result.status_code <= 299:
      print('Successful response')
      soup = BeautifulSoup(result.content, "lxml")
  if result.status_code >= 300  and result.status_code <= 399:
      print('Redirect')
  if result.status_code >= 400  and result.status_code <= 499:
      print('Client error')
  if result.status_code >= 500  and result.status_code <= 599:
      print('Server error')
      
  return soup

Our plan is to scrape the given information from every listing — City, Zip, Street, Rent, Dimensions, Bed, and Bath. We have assigned an empty listing for every variable having scraped. Seven empty listings are created.

The complete scripting grabs the City, Zip, Street, Rent, Bath, Dimensions, and Bed for every listing with a nested FOR LOOP logic as well as consistent HTML tags.

for page_num in tqdm(range(1, 250)):
    sleep(2)
    
    # get soup object of the page
    soup_page = get_page('toronto', 'condos', '1', page_num)

    # grab listing street
    for tag in soup_page.find_all('div', class_='listing-brief'):
        for tag2 in tag.find_all('span', class_='replace street'):
            # to check if data point is missing
            if not tag2.get_text(strip=True):
                listingStreet.append("empty")
            else:
                listingStreet.append(tag2.get_text(strip=True))

After scripts complete execution, observe the length of all seven listings to ensure all have similar lengths. After that, make a panda's DF using the listing. Save a DF to the csv file.

# create the dataframe
df_Toronto_Condo = pd.DataFrame({'city_main':'Toronto', 'listing_type': 'Condo', 'street': listingStreet, 'city': listingCity, 'zip': listingZip, 'rent': listingRent, 'bed': listingBed,'bath': listingBath, 'dimensions': listingDim})

# saving the dataframe to csv file
df_Toronto_Condo.to_csv('df_Toronto_Condo.csv')

With page_num functions and changing different parameters of get_page function, we have collected data for various housing kinds — apartments, houses, condos, and basements for Toronto and suburban cities. We have created the panda's DF for every housing kind as well as saved that to the CSV file.

Data Preparation & Cleaning with Pandas

The main part of different data science projects includes data collection, cleaning, and preparation. So, we have united the DFs produced from web scraping for getting one key DF having all listings. After that, start data wrangling.

⭢ Search for different missing values in DF

Secret HTML wrappers inhabit empty listings.

Missing data in a few listings for bath, bed, or dimensions.

⭢ Cope with the missing data

Secret HTML wrappers, which populate like empty listings got dropped from a DF.

If details on bath, bed, or dimensions for the list got missing, after that this was set at zero.

⭢ Get data issues as well as fix them

For different listings, rent gets specified as the range. For all the listings, even bath, bed, and dimensions are identified as ranges. For e.g., in one listing, rent is between $1795–2500, bed differs between 1 to 3, bath ranges between 1 to 2, and dimensions differ between 622 to 955 ft2. With this ‘range’ listings look like 1 listing on a website, it looks as if they are promoting different units in the listing — however, we don’t understand how many units might be accessible within every listing or individual specifications about rent, baths, bed, and dimensions of separate units. Making speculations or getting averages doesn’t seem right in this situation. As such, these rows got dropped from this analysis.

Find and examine larger outliers in the DF.

⭢ Complete data transformations

Clean City features through deleting ‘, ON’ from entries.

Clean particular characters including $, -, as well as,

Data kind conversions: to do data analysis, Rents & Dimensions got converted into numeric data types.

After completing web preparation and cleaning, we get a clear dataset, which we can analyze more to draw helpful insights.

Insights Produced from Data

Here is a count of a total number of listings through Type and City.

Sample data frame indicating the information and listings collected on various features.

Insights from different plots produced with Plotly and Matplotlib

Condos are having a maximum number of lists on a website, given by Apartments.

Renters looking for other kinds of accommodation including Basements or Houses should perhaps search other rental websites.

Toronto is having the maximum listings on a website given by Etobicoke and North York.

Suburb cities are having fewer listings associated to Toronto.

For the majority of cities, Condos have the maximum listing types. The next one is Apartments.

For Scarborough, this appears as there are an equivalent number of listings given for Apartments, Condos, and Houses.

Brampton looks to have maximum Basements and Houses listed.

The majority of listings either get one or two bedrooms. This trend gets observed across various cities.

Listings having one bath are most usual.

Toronto looks to have a good % of lists having two baths.

Mean gets affected by the outliers in a dataset. Median is the better statistic because it is robust to the outliers.

For Rent: Important difference between the Median and Mean for Etobicoke, Toronto, Mississauga, and North York.

For Dimensions: Important difference between Median and Mean for Richmond Hill, Vaughan, Markham, and Toronto.

Additional investigation in data to know why Vaughan has enormously higher values for %difference between Median and Mean: Two lists for Vaughan have the dimensions of 800,899 SQFT!! Renting of ~$2300 for 2B-2B condo. So, it looks like these listings have typos for dimensions.

Toronto, Mississauga, as well as Etobicoke, have the maximum Median Rent among all cities.

This was expected that the cost of rent in Toronto might be considerably higher compared to other cities. However, to do data analysis, this looks as if Toronto, Etobicoke, and Mississauga have related median Rents.

Scarborough and Suburbs Brampton have the lowermost median Rent.

Using Median Rents for Mississauga and Toronto are alike, the median dimensions about lists in Mississauga are ~100 SQFT larger than those within Toronto.

Lists in Brampton are having lowest Rent, however, largest dimensions compared with other cities.

Investigation of Relationship Between Dimensions and Rent

Scatter plot A: with original DF

No relationships between Dimensions and Rent.

A few larger outliers of Dimensions and Rent skew a plot.

Scatter plot B: with DF after reducing rows having outlier Dimensions

Weaker positive correlation between Dimensions and Rent.

Larger Rent outliers skew a plot.

Scatter plot C: With DF after dipping rows having outlier Rent

A bit more positive association between Dimensions and Rent.

Note: Missing data of dimensions got replaced using zeros. Therefore, in the given three plots, we have observed many listings that look to get zero dimensions.

For more information about scraping rental websites using Python and BeautifulSoup, contact 3i Data Scraping or ask for a free quote!