官术网_书友最值得收藏!

Final version

The full source code for this advanced link crawler can be downloaded at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler.py. Each of the sections in this chapter has matching code in the repository at https://github.com/kjam/wswp. To easily follow along, feel free to fork the repository and use it to compare and test your own code.

To test the link crawler, let's try setting the user agent to BadCrawler, which, as we saw earlier in this chapter, was blocked by robots.txt. As expected, the crawl is blocked and finishes immediately:

    >>> start_url = 'http://example.webscraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com/

Now, let's try using the default user agent and setting the maximum depth to 1 so that only the links from the home page are downloaded:

    >>> link_crawler(start_url, link_regex, max_depth=1)
Downloading: http://example.webscraping.com//index
Downloading: http://example.webscraping.com/index/1
Downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.webscraping.com/view/Antarctica-9
Downloading: http://example.webscraping.com/view/Anguilla-8
Downloading: http://example.webscraping.com/view/Angola-7
Downloading: http://example.webscraping.com/view/Andorra-6
Downloading: http://example.webscraping.com/view/American-Samoa-5
Downloading: http://example.webscraping.com/view/Algeria-4
Downloading: http://example.webscraping.com/view/Albania-3
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Afghanistan-1

As expected, the crawl stopped after downloading the first page of countries.

主站蜘蛛池模板: 渝北区| 交城县| 尉犁县| 新乐市| 阿尔山市| 宾川县| 同德县| 石林| 江阴市| 田林县| 安岳县| 吴堡县| SHOW| 辰溪县| 石河子市| 台中市| 瑞安市| 平江县| 乐陵市| 宣威市| 南丹县| 广丰县| 东海县| 滕州市| 蓬莱市| 海林市| 通榆县| 宝鸡市| 鹿邑县| 驻马店市| 甘孜| 延长县| 乐山市| 安宁市| 苏州市| 旬邑县| 温宿县| 婺源县| 陇西县| 沭阳县| 康定县|