官术网_书友最值得收藏!

Setting a user agent

By default, urllib will download content with the Python-urllib/3.x user agent, where 3.x is the environment's current version of Python. It would be preferable to use an identifiable user agent in case problems occur with our web crawler. Also, some websites block this default user agent, perhaps after they have experienced a poorly made Python web crawler overloading their server. For example,  http://www.meetup.com/ currently returns a 403 Forbidden when requesting the page with urllib's default user agent.

To download sites reliably, we will need to have control over setting the user agent. Here is an updated version of our download function with the default user agent set to 'wswp' (which stands forWeb Scraping with Python):

def download(url, user_agent='wswp', num_retries=2): 
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
html = urllib.request.urlopen(request).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html

If you now try meetup.com, you will see valid HTML. Our download function can now be reused in later code to catch errors, retry the site when possible, and set the user agent.

主站蜘蛛池模板: 辰溪县| 盘锦市| 仁怀市| 沿河| 萨迦县| 错那县| 黎川县| 金湖县| 福海县| 长沙市| 康平县| 江阴市| 隆德县| 阳江市| 盐城市| 东莞市| 共和县| 商水县| 民乐县| 磐安县| 稻城县| 博客| 宁安市| 新田县| 信宜市| 博乐市| 五指山市| 陵川县| 泌阳县| 遂平县| 浦东新区| 武穴市| 织金县| 札达县| 门源| 大同县| 镇雄县| 绵阳市| 黔江区| 五莲县| 盘锦市|