- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 212字
- 2021-07-09 19:42:45
Setting a user agent
By default, urllib will download content with the Python-urllib/3.x user agent, where 3.x is the environment's current version of Python. It would be preferable to use an identifiable user agent in case problems occur with our web crawler. Also, some websites block this default user agent, perhaps after they have experienced a poorly made Python web crawler overloading their server. For example, http://www.meetup.com/ currently returns a 403 Forbidden when requesting the page with urllib's default user agent.
To download sites reliably, we will need to have control over setting the user agent. Here is an updated version of our download function with the default user agent set to 'wswp' (which stands forWeb Scraping with Python):
def download(url, user_agent='wswp', num_retries=2):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
html = urllib.request.urlopen(request).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
If you now try meetup.com, you will see valid HTML. Our download function can now be reused in later code to catch errors, retry the site when possible, and set the user agent.
- C語言從入門到精通(第4版)
- Responsive Web Design by Example
- Learn React with TypeScript 3
- 深入理解Elasticsearch(原書第3版)
- C/C++程序員面試指南
- Python極簡講義:一本書入門數據分析與機器學習
- Python圖形化編程(微課版)
- C專家編程
- Visual Basic 6.0程序設計實驗教程
- Node學習指南(第2版)
- .NET 4.5 Parallel Extensions Cookbook
- PhoneGap 4 Mobile Application Development Cookbook
- Penetration Testing with the Bash shell
- Build Your Own PaaS with Docker
- Laravel 5.x Cookbook