官术网_书友最值得收藏!

Final version

The full source code for this advanced link crawler can be downloaded at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler.py. Each of the sections in this chapter has matching code in the repository at https://github.com/kjam/wswp. To easily follow along, feel free to fork the repository and use it to compare and test your own code.

To test the link crawler, let's try setting the user agent to BadCrawler, which, as we saw earlier in this chapter, was blocked by robots.txt. As expected, the crawl is blocked and finishes immediately:

    >>> start_url = 'http://example.webscraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com/

Now, let's try using the default user agent and setting the maximum depth to 1 so that only the links from the home page are downloaded:

    >>> link_crawler(start_url, link_regex, max_depth=1)
Downloading: http://example.webscraping.com//index
Downloading: http://example.webscraping.com/index/1
Downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.webscraping.com/view/Antarctica-9
Downloading: http://example.webscraping.com/view/Anguilla-8
Downloading: http://example.webscraping.com/view/Angola-7
Downloading: http://example.webscraping.com/view/Andorra-6
Downloading: http://example.webscraping.com/view/American-Samoa-5
Downloading: http://example.webscraping.com/view/Algeria-4
Downloading: http://example.webscraping.com/view/Albania-3
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Afghanistan-1

As expected, the crawl stopped after downloading the first page of countries.

主站蜘蛛池模板: 雅安市| 清河县| 迭部县| 阿拉善右旗| 曲阳县| 任丘市| 湟中县| 天峨县| 南皮县| 南雄市| 汤原县| 特克斯县| 秦安县| 武陟县| 乌苏市| 昌平区| 宾阳县| 临朐县| 共和县| 凤庆县| 临沂市| 墨竹工卡县| 巫溪县| 巴林右旗| 弥渡县| 仲巴县| 邓州市| 息烽县| 长阳| 玉环县| 广东省| 桃源县| 雅江县| 治多县| 南陵县| 社旗县| 舒兰市| 长子县| 正宁县| 迭部县| 光泽县|