官术网_书友最值得收藏!

Setting a user agent

By default, urllib will download content with the Python-urllib/3.x user agent, where 3.x is the environment's current version of Python. It would be preferable to use an identifiable user agent in case problems occur with our web crawler. Also, some websites block this default user agent, perhaps after they have experienced a poorly made Python web crawler overloading their server. For example,  http://www.meetup.com/ currently returns a 403 Forbidden when requesting the page with urllib's default user agent.

To download sites reliably, we will need to have control over setting the user agent. Here is an updated version of our download function with the default user agent set to 'wswp' (which stands forWeb Scraping with Python):

def download(url, user_agent='wswp', num_retries=2): 
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
html = urllib.request.urlopen(request).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html

If you now try meetup.com, you will see valid HTML. Our download function can now be reused in later code to catch errors, retry the site when possible, and set the user agent.

主站蜘蛛池模板: 上饶市| 新安县| 宜宾市| 永仁县| 永福县| 本溪市| 山东省| 蚌埠市| 西乌| 阜阳市| 融水| 平江县| 宁河县| 江西省| 玛沁县| 龙江县| 汨罗市| 自贡市| 兴和县| 东海县| 佛冈县| 莲花县| 贵溪市| 灵川县| 巴青县| 丹寨县| 元阳县| 古蔺县| 涪陵区| 阳西县| 玉山县| 巴彦淖尔市| 城口县| 湖南省| 峨眉山市| 伊宁县| 绥化市| 衡东县| 乌拉特后旗| 玉田县| 达日县|