Exercise 1: Scrape Webpage using 'split and select'¶

Our goal is to scrape the price of products on Amazon including¶

https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC ¶

s = open('class1_files/gooselamp.html').read()
s[0:1000]

'<!DOCTYPE html>\n<!-- saved from url=(0070)https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC -->\n<html class=" a-js a-audio a-video a-canvas a-svg a-drag-drop a-geolocation a-history a-webworker a-autofocus a-input-placeholder a-textarea-placeholder a-local-storage a-gradients a-hires a-transform3d -scrolling a-text-shadow a-text-stroke a-box-shadow a-border-radius a-border-image a-opacity a-transform a-transition a-ember" data-19ax5a9jf="dingo" data-aui-build-date="3.17.19-2017-11-30" data-aui-version="a52ef144e0fc28636d8006a975ea6403de8efda9"><!-- sp:feature:head-start --><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><script async="" src="./gooselamp_files/ClientSideMetricsAUIJavascript-d7371dee33ab3a54a5d91c1bc82e1019bc556141._V2_.js" crossorigin="anonymous"></script><script>var aPageStart = (new Date()).getTime();</script>\n<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script><!-- sp:feature:cs-optimization -->\n<meta htt'

Searching through the text would be a headache. Instead, use Mozilla Firefox webdeveloper->inspector to visually identify the relavent¶

See Firefox demo¶

Once identified, use `split' to extract the price data.¶

l1 = s.split('<span id="priceblock_ourprice" class="a-size-medium a-color-price">$')
len(l1)

2

l1[0][:1000]

'<!DOCTYPE html>\n<!-- saved from url=(0070)https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC -->\n<html class=" a-js a-audio a-video a-canvas a-svg a-drag-drop a-geolocation a-history a-webworker a-autofocus a-input-placeholder a-textarea-placeholder a-local-storage a-gradients a-hires a-transform3d -scrolling a-text-shadow a-text-stroke a-box-shadow a-border-radius a-border-image a-opacity a-transform a-transition a-ember" data-19ax5a9jf="dingo" data-aui-build-date="3.17.19-2017-11-30" data-aui-version="a52ef144e0fc28636d8006a975ea6403de8efda9"><!-- sp:feature:head-start --><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><script async="" src="./gooselamp_files/ClientSideMetricsAUIJavascript-d7371dee33ab3a54a5d91c1bc82e1019bc556141._V2_.js" crossorigin="anonymous"></script><script>var aPageStart = (new Date()).getTime();</script>\n<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script><!-- sp:feature:cs-optimization -->\n<meta htt'

price = l1[1].split('</span>')[0]
price

'19.59'

price = float(l1[1].split('</span>')[0])
price

19.59

# In summary
s = open('class1_files/gooselamp.html').read()
l1 = s.split('<span id="priceblock_ourprice" class="a-size-medium a-color-price">$')
price = float(l1[1].split('</span>')[0])
price

19.59

Now lets access the real webpage, and not the downloaded html file¶

import requests
url = 'https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC'
s = requests.get(url)
'19.59' in s.text

#if this returns true, then you've successfully accessed the webpage and it does in fact contain the string `19.59'

True

Oh, it actually worked. Sometimes you will find Amazon refuses to serve the page to a script (robot). In that case, we will need to fake our User Agent.¶

see https://www.whoishostingthis.com/tools/user-agent/ for more info¶

url = 'http://math.buffalo.edu'
s = requests.get(url)
# print(s.text)

url = 'http://www.buffalo.edu/cas/math.html'
s = requests.get(url,headers={'User-Agent':'Fake out!'})

ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
s = requests.get(url,headers={'User-Agent':'Fake out!'})
s.text[1:100]

'!DOCTYPE HTML><html lang="en"><!-- cmspub01 0126-193444 -->\n<head>\n    <meta http-equiv="X-UA-Compa'

Finally, we can wrap everything up in a function that can retrieve the price of any product:¶

import requests
def getprice(pid):
         ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
         url = 'https://www.amazon.com/dp/'+pid
         s = requests.get(url, headers={'User-Agent':ua})
         pattern = '<span id="priceblock_ourprice" class="a-size-medium a-color-price">$'
         price = float( s.text.split(pattern)[-1].split('</span>')[0] )
         return price

price = getprice('B0027YPQEC')
print(price)

19.59

pid = 'B00BB581NQ'
url = 'https://www.amazon.com/dp/'+pid

price = getprice(pid) # another item:  a kite

print('the price of item ' + url + ' is ' + str(price) )

the price of item https://www.amazon.com/dp/B00BB581NQ is 32.9

Exercise 1: Scrape Webpage using 'split and select'¶

Our goal is to scrape the price of products on Amazon including¶

https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC¶

Searching through the text would be a headache. Instead, use Mozilla Firefox webdeveloper->inspector to visually identify the relavent¶

See Firefox demo¶

Once identified, use `split' to extract the price data.¶

Now lets access the real webpage, and not the downloaded html file¶

Oh, it actually worked. Sometimes you will find Amazon refuses to serve the page to a script (robot). In that case, we will need to fake our User Agent.¶

see https://www.whoishostingthis.com/tools/user-agent/ for more info¶

Finally, we can wrap everything up in a function that can retrieve the price of any product:¶

https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC ¶