官术网_书友最值得收藏!

ID iteration crawler

In this section, we will take advantage of a weakness in the website structure to easily access all the content. Here are the URLs of some sample countries:

We can see that the URLs only differ in the final section of the URL path, with the country name (known as a slug) and ID. It is a common practice to include a slug in the URL to help with search engine optimization. Quite often, the web server will ignore the slug and only use the ID to match relevant records in the database. Let's check whether this works with our example website by removing the slug and checking the page http://example.webscraping.com/view/1:

The web page still loads! This is useful to know because now we can ignore the slug and simply utilize database IDs to download all the countries. Here is an example code snippet that takes advantage of this trick:

import itertools 

def crawl_site(url):
for page in itertools.count(1):
pg_url = '{}{}'.format(url, page)
html = download(pg_url)
if html is None:
break
# success - can scrape the result

Now we can use the function by passing in the base URL:

>>> crawl_site('http://example.webscraping.com/view/-')
Downloading: http://example.webscraping.com/view/-1
Downloading: http://example.webscraping.com/view/-2
Downloading: http://example.webscraping.com/view/-3
Downloading: http://example.webscraping.com/view/-4
[...]

Here, we iterate the ID until we encounter a download error, which we assume means our scraper has reached the last country. A weakness in this implementation is that some records may have been deleted, leaving gaps in the database IDs. Then, when one of these gaps is reached, the crawler will immediately exit. Here is an improved version of the code that allows a number of consecutive download errors before exiting:

def crawl_site(url, max_errors=5):
for page in itertools.count(1):
pg_url = '{}{}'.format(url, page)
html = download(pg_url)
if html is None:
num_errors += 1
if num_errors == max_errors:
# max errors reached, exit loop
break
else:
num_errors = 0
# success - can scrape the result

The crawler in the preceding code now needs to encounter five consecutive download errors to stop iteration, which decreases the risk of stopping iteration prematurely when some records have been deleted or hidden.

Iterating the IDs is a convenient approach to crawling a website, but is similar to the sitemap approach in that it will not always be available. For example, some websites will check whether the slug is found in the URL and if not return a 404 Not Found error. Also, other websites use large nonsequential or nonnumeric IDs, so iterating is not practical. For example, Amazon uses ISBNs, as the ID for the available books, that have at least ten digits. Using an ID iteration for ISBNs would require testing billions of possible combinations, which is certainly not the most efficient approach to scraping the website content.

As you've been following along, you might have noticed some download errors with the message TOO MANY REQUESTS . Don't worry about them at the moment; we will cover more about handling these types of error in the Advanced Features section of this chapter.

主站蜘蛛池模板: 徐水县| 沂水县| 贵定县| 那曲县| 郯城县| 陆良县| 榆中县| 三门县| 武汉市| 万山特区| 台东市| 海门市| 荆门市| 绥德县| 延庆县| 滨海县| 六盘水市| 大石桥市| 洪雅县| 西青区| 宁化县| 收藏| 都昌县| 饶平县| 招远市| 瑞昌市| 宁安市| 东山县| 武穴市| 昆山市| 丰镇市| 房山区| 宝坻区| 垫江县| 商城县| 枞阳县| 南投市| 家居| 吉木乃县| 丽江市| 南投市|