- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 374字
- 2021-07-09 19:42:45
Sitemap crawler
For our first simple crawler, we will use the sitemap discovered in the example website's robots.txt to download all the web pages. To parse the sitemap, we will use a simple regular expression to extract URLs within the <loc> tags.
We will need to update our code to handle encoding conversions as our current download function simply returns bytes. Note that a more robust parsing approach called CSS selectors will be introduced in the next chapter. Here is our first example crawler:
import re
def download(url, user_agent='wswp', num_retries=2, charset='utf-8'):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
def crawl_sitemap(url):
# download the sitemap file
sitemap = download(url)
# extract the sitemap links
links = re.findall('<loc>(.*?)</loc>', sitemap)
# download each link
for link in links:
html = download(link)
# scrape html here
# ...
Now, we can run the sitemap crawler to download all countries from the example website:
>>> crawl_sitemap('http://example.webscraping.com/sitemap.xml')
Downloading: http://example.webscraping.com/sitemap.xml
Downloading: http://example.webscraping.com/view/Afghanistan-1
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Albania-3
...
As shown in our download method above, we had to update the character encoding to utilize regular expressions with the website response. The Python read method on the response will return bytes, and the re module expects a string. Our code depends on the website maintainer to include the proper character encoding in the response headers. If the character encoding header is not returned, we default to UTF-8 and hope for the best. Of course, this decoding will throw an error if either the header encoding returned is incorrect or if the encoding is not set and also not UTF-8. There are some more complex ways to guess encoding (see: https://pypi.python.org/pypi/chardet), which are fairly easy to implement.
For now, the Sitemap crawler works as expected. But as discussed earlier, Sitemap files often cannot be relied on to provide links to every web page. In the next section, another simple crawler will be introduced that does not depend on the Sitemap file.
- Bootstrap Site Blueprints Volume II
- 簡(jiǎn)單高效LATEX
- Procedural Content Generation for C++ Game Development
- Hadoop大數(shù)據(jù)分析技術(shù)
- Java Hibernate Cookbook
- Java高手是怎樣煉成的:原理、方法與實(shí)踐
- Python Automation Cookbook
- 安卓工程師教你玩轉(zhuǎn)Android
- Blender 3D Cookbook
- Extending Docker
- C語言程序設(shè)計(jì)
- 軟件測(cè)試
- Python語言及其應(yīng)用(第2版)
- Apple Watch極速開發(fā)
- AngularJS Test:driven Development