官术网_书友最值得收藏!

Setting a user agent

By default, urllib will download content with the Python-urllib/3.x user agent, where 3.x is the environment's current version of Python. It would be preferable to use an identifiable user agent in case problems occur with our web crawler. Also, some websites block this default user agent, perhaps after they have experienced a poorly made Python web crawler overloading their server. For example,  http://www.meetup.com/ currently returns a 403 Forbidden when requesting the page with urllib's default user agent.

To download sites reliably, we will need to have control over setting the user agent. Here is an updated version of our download function with the default user agent set to 'wswp' (which stands forWeb Scraping with Python):

def download(url, user_agent='wswp', num_retries=2): 
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
html = urllib.request.urlopen(request).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html

If you now try meetup.com, you will see valid HTML. Our download function can now be reused in later code to catch errors, retry the site when possible, and set the user agent.

主站蜘蛛池模板: 波密县| 永春县| 西畴县| 乐昌市| 沙田区| 吴江市| 嫩江县| 通城县| 开化县| 河北区| 玛纳斯县| 临潭县| 洛扎县| 竹溪县| 吉水县| 黄龙县| 卢湾区| 彭山县| 海安县| 澄迈县| 博野县| 益阳市| 吉林省| 斗六市| 马尔康县| 佛山市| 阜平县| 扶沟县| 罗平县| 渭源县| 周宁县| 聂荣县| 沂南县| 治县。| 抚州市| 冀州市| 徐水县| 资中县| 永修县| 大厂| 噶尔县|