官术网_书友最值得收藏!

Sitemap crawler

For our first simple crawler, we will use the sitemap discovered in the example website's robots.txt to download all the web pages. To parse the sitemap, we will use a simple regular expression to extract URLs within the <loc> tags.

We will need to update our code to handle encoding conversions as our current download function simply returns bytes. Note that a more robust parsing approach called CSS selectors will be introduced in the next chapter. Here is our first example crawler:

import re

def download(url, user_agent='wswp', num_retries=2, charset='utf-8'):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html

def crawl_sitemap(url):
# download the sitemap file
sitemap = download(url)
# extract the sitemap links
links = re.findall('<loc>(.*?)</loc>', sitemap)
# download each link
for link in links:
html = download(link)
# scrape html here
# ...

Now, we can run the sitemap crawler to download all countries from the example website:

    >>> crawl_sitemap('http://example.webscraping.com/sitemap.xml')
Downloading: http://example.webscraping.com/sitemap.xml
Downloading: http://example.webscraping.com/view/Afghanistan-1
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Albania-3
...

As shown in our download method above, we had to update the character encoding to utilize regular expressions with the website response. The Python read method on the response will return bytes, and the re module expects a string. Our code depends on the website maintainer to include the proper character encoding in the response headers. If the character encoding header is not returned, we default to UTF-8 and hope for the best. Of course, this decoding will throw an error if either the header encoding returned is incorrect or if the encoding is not set and also not UTF-8. There are some more complex ways to guess encoding (see: https://pypi.python.org/pypi/chardet), which are fairly easy to implement.

For now, the Sitemap crawler works as expected. But as discussed earlier, Sitemap files often cannot be relied on to provide links to every web page. In the next section, another simple crawler will be introduced that does not depend on the Sitemap file.

If you don't want to continue the crawl at any time you can hit Ctrl + C or c md + C to exit the Python interpreter or program execution.
主站蜘蛛池模板: 尼玛县| 万安县| 双辽市| 铜川市| 湖北省| 安达市| 米易县| 平度市| 容城县| 扬州市| 修武县| 巴中市| 长武县| 阿瓦提县| 天门市| 阿鲁科尔沁旗| 革吉县| 石首市| 吉木乃县| 乐都县| 衡南县| 都安| 武汉市| 班戈县| 阜平县| 黄梅县| 鄂温| 峨山| 绍兴市| 邵阳县| 青岛市| 景泰县| 黄大仙区| 宕昌县| 黄石市| 许昌县| 秦皇岛市| 常宁市| 长岭县| 巴彦淖尔市| 内丘县|