- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 212字
- 2021-07-09 19:42:45
Setting a user agent
By default, urllib will download content with the Python-urllib/3.x user agent, where 3.x is the environment's current version of Python. It would be preferable to use an identifiable user agent in case problems occur with our web crawler. Also, some websites block this default user agent, perhaps after they have experienced a poorly made Python web crawler overloading their server. For example, http://www.meetup.com/ currently returns a 403 Forbidden when requesting the page with urllib's default user agent.
To download sites reliably, we will need to have control over setting the user agent. Here is an updated version of our download function with the default user agent set to 'wswp' (which stands forWeb Scraping with Python):
def download(url, user_agent='wswp', num_retries=2):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
html = urllib.request.urlopen(request).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
If you now try meetup.com, you will see valid HTML. Our download function can now be reused in later code to catch errors, retry the site when possible, and set the user agent.
- Deploying Node.js
- SEO智慧
- Podman實(shí)戰(zhàn)
- 自制編程語(yǔ)言
- Go并發(fā)編程實(shí)戰(zhàn)
- Visual C#通用范例開(kāi)發(fā)金典
- Lighttpd源碼分析
- The Professional ScrumMaster’s Handbook
- Ext JS 4 Plugin and Extension Development
- C語(yǔ)言程序設(shè)計(jì)與應(yīng)用實(shí)驗(yàn)指導(dǎo)書(shū)(第2版)
- Java多線程并發(fā)體系實(shí)戰(zhàn)(微課視頻版)
- Isomorphic Go
- INSTANT Lift Web Applications How-to
- 分布式系統(tǒng)架構(gòu)與開(kāi)發(fā):技術(shù)原理與面試題解析
- Java EE框架開(kāi)發(fā)技術(shù)與案例教程