Hosted with nbsanity. See source notebook on GitHub.

Webscraping tutorial

16 January 2020


This notebook contains all the code from the blog post “Intro to webscraping” from 1 January 2021

First example: Create a dataset of lonely dogs

In this example we scrape Pet Rescue (https://www.petrescue.com.au) to create a dataset of dog names and corresponding locations.

Install the relevant libraries

from lxml import html
import requests
import pandas as pd
url_base = 'https://www.petrescue.com.au/listings/search/dogs?page='

name_path = '//article[@class="cards-listings-preview"]/a/header/h3/text()'
location_path = '//strong[@class="cards-listings-preview__content__section__location"]/text()'

all_names = []
all_locations = []

for n in range(1, 50):
    print(f'Scraping page: {n}')
    url = f'{url_base}{n}'
    page = requests.get(url)
    tree = html.fromstring(page.text)
    names = tree.xpath(name_path)
    locations = tree.xpath(location_path)
    locations = locations[1::2]
    all_names += names
    all_locations += locations
df = pd.DataFrame(data={'name': all_names, 'location': all_locations})
df['name'] = df['name'].str.strip()
df['location'] = df['location'].str.strip()
df.head(5)

Second example: ATM locations

In this second example we create a dataset of locations of all National Australia Bank ATMs in the country.

import requests
import pandas as pd
lat_min, lng_min = -43.834124, 114.078644
lat_max, lng_max = -10.400824, 154.508331
url = f'https://api.nab.com.au/info/nab/location/locationType/atm+brc/queryType/geo/{lat_min}/{lng_min}/{lat_max}/{lng_max}/1/4000?v=1'

headers = {'Host': 'api.nab.com.au', 
'Origin': 'https://www.nab.com.au', 
'Referer': 'https://www.nab.com.au/',
'x-nab-key': 'a8469c09-22f8-45c1-a0aa-4178438481ef'}

page = requests.get(url=url, headers=headers)
data = page.json()
df = pd.json_normalize(data['locationSearchResponse']['locations'])
df = df[['atm.address1', 'atm.suburb', 'atm.state', 'atm.postcode', 'atm.latitude', 'atm.longitude']].dropna()
df.head(5)

Map of NAB ATMs