官术网_书友最值得收藏!

Downloading a web page

To scrape web pages, we first need to download them. Here is a simple Python script that uses Python's urllib module to download a URL:

import urllib.request
def download(url):
return urllib.request.urlopen(url).read()

When a URL is passed, this function will download the web page and return the HTML. The problem with this snippet is that, when downloading the web page, we might encounter errors that are beyond our control; for example, the requested page may no longer exist. In these cases, urllib will raise an exception and exit the script. To be safer, here is a more robust version to catch these exceptions:

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

def download(url):
print('Downloading:', url)
try:
html = urllib.request.urlopen(url).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
return html

Now, when a download or URL error is encountered, the exception is caught and the function returns None.

Throughout this book, we will assume you are creating files with code that is presented without prompts (like the code above). When you see code that begins with a Python prompt >>> or and IPython prompt In [1]:, you will need to either enter that into the main file you have been using, or save the file and import those functions and classes into your Python interpreter. If you run into any issues, please take a look at the code in the book repository at https://github.com/kjam/wswp.
主站蜘蛛池模板: 中宁县| 镇远县| 东港市| 平遥县| 昔阳县| 荆门市| 基隆市| 固始县| 镇雄县| 桃园市| 元氏县| 额济纳旗| 东至县| 峡江县| 沂南县| 微山县| 舞钢市| 岑巩县| 阿勒泰市| 高青县| 东乡县| 南昌县| 永春县| 新沂市| 微博| 区。| 桐柏县| 二连浩特市| 焦作市| 务川| 张家界市| 井陉县| 石棉县| 天镇县| 武清区| 宁明县| 怀安县| 社旗县| 乡宁县| 阿瓦提县| 湘乡市|