Exercise 1: Scrape Webpage using 'split and select'

Our goal is to scrape the price of products on Amazon including

https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC


In [6]:
s = open('class1_files/gooselamp.html').read()
s[0:1000]
Out[6]:
'<!DOCTYPE html>\n<!-- saved from url=(0070)https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC -->\n<html class=" a-js a-audio a-video a-canvas a-svg a-drag-drop a-geolocation a-history a-webworker a-autofocus a-input-placeholder a-textarea-placeholder a-local-storage a-gradients a-hires a-transform3d -scrolling a-text-shadow a-text-stroke a-box-shadow a-border-radius a-border-image a-opacity a-transform a-transition a-ember" data-19ax5a9jf="dingo" data-aui-build-date="3.17.19-2017-11-30" data-aui-version="a52ef144e0fc28636d8006a975ea6403de8efda9"><!-- sp:feature:head-start --><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><script async="" src="./gooselamp_files/ClientSideMetricsAUIJavascript-d7371dee33ab3a54a5d91c1bc82e1019bc556141._V2_.js" crossorigin="anonymous"></script><script>var aPageStart = (new Date()).getTime();</script>\n<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script><!-- sp:feature:cs-optimization -->\n<meta htt'

Searching through the text would be a headache. Instead, use Mozilla Firefox webdeveloper->inspector to visually identify the relavent

See Firefox demo


Once identified, use `split' to extract the price data.

In [7]:
l1 = s.split('<span id="priceblock_ourprice" class="a-size-medium a-color-price">$')
len(l1)
Out[7]:
2
In [8]:
l1[0][:1000]
Out[8]:
'<!DOCTYPE html>\n<!-- saved from url=(0070)https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC -->\n<html class=" a-js a-audio a-video a-canvas a-svg a-drag-drop a-geolocation a-history a-webworker a-autofocus a-input-placeholder a-textarea-placeholder a-local-storage a-gradients a-hires a-transform3d -scrolling a-text-shadow a-text-stroke a-box-shadow a-border-radius a-border-image a-opacity a-transform a-transition a-ember" data-19ax5a9jf="dingo" data-aui-build-date="3.17.19-2017-11-30" data-aui-version="a52ef144e0fc28636d8006a975ea6403de8efda9"><!-- sp:feature:head-start --><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><script async="" src="./gooselamp_files/ClientSideMetricsAUIJavascript-d7371dee33ab3a54a5d91c1bc82e1019bc556141._V2_.js" crossorigin="anonymous"></script><script>var aPageStart = (new Date()).getTime();</script>\n<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script><!-- sp:feature:cs-optimization -->\n<meta htt'
In [9]:
price = l1[1].split('</span>')[0]
price
Out[9]:
'19.59'
In [10]:
price = float(l1[1].split('</span>')[0])
price
Out[10]:
19.59
In [11]:
# In summary
s = open('class1_files/gooselamp.html').read()
l1 = s.split('<span id="priceblock_ourprice" class="a-size-medium a-color-price">$')
price = float(l1[1].split('</span>')[0])
price
Out[11]:
19.59

Now lets access the real webpage, and not the downloaded html file

In [12]:
import requests
url = 'https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC'
s = requests.get(url)
'19.59' in s.text

#if this returns true, then you've successfully accessed the webpage and it does in fact contain the string `19.59'
Out[12]:
True

Oh, it actually worked. Sometimes you will find Amazon refuses to serve the page to a script (robot). In that case, we will need to fake our User Agent.

see https://www.whoishostingthis.com/tools/user-agent/ for more info

In [13]:
url = 'http://math.buffalo.edu'
s = requests.get(url)
# print(s.text)
In [14]:
url = 'http://www.buffalo.edu/cas/math.html'
s = requests.get(url,headers={'User-Agent':'Fake out!'})
In [15]:
ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
s = requests.get(url,headers={'User-Agent':'Fake out!'})
s.text[1:100]
Out[15]:
'!DOCTYPE HTML><html lang="en"><!-- cmspub01 0126-193444 -->\n<head>\n    <meta http-equiv="X-UA-Compa'

Finally, we can wrap everything up in a function that can retrieve the price of any product:

In [16]:
import requests
def getprice(pid):
         ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
         url = 'https://www.amazon.com/dp/'+pid
         s = requests.get(url, headers={'User-Agent':ua})
         pattern = '<span id="priceblock_ourprice" class="a-size-medium a-color-price">$'
         price = float( s.text.split(pattern)[-1].split('</span>')[0] )
         return price

price = getprice('B0027YPQEC')
print(price)
19.59
In [17]:
pid = 'B00BB581NQ'
url = 'https://www.amazon.com/dp/'+pid

price = getprice(pid) # another item:  a kite

print('the price of item ' + url + ' is ' + str(price) )
the price of item https://www.amazon.com/dp/B00BB581NQ is 32.9