- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 168字
- 2021-07-09 19:42:46
Final version
The full source code for this advanced link crawler can be downloaded at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler.py. Each of the sections in this chapter has matching code in the repository at https://github.com/kjam/wswp. To easily follow along, feel free to fork the repository and use it to compare and test your own code.
To test the link crawler, let's try setting the user agent to BadCrawler, which, as we saw earlier in this chapter, was blocked by robots.txt. As expected, the crawl is blocked and finishes immediately:
>>> start_url = 'http://example.webscraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com/
Now, let's try using the default user agent and setting the maximum depth to 1 so that only the links from the home page are downloaded:
>>> link_crawler(start_url, link_regex, max_depth=1)
Downloading: http://example.webscraping.com//index
Downloading: http://example.webscraping.com/index/1
Downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.webscraping.com/view/Antarctica-9
Downloading: http://example.webscraping.com/view/Anguilla-8
Downloading: http://example.webscraping.com/view/Angola-7
Downloading: http://example.webscraping.com/view/Andorra-6
Downloading: http://example.webscraping.com/view/American-Samoa-5
Downloading: http://example.webscraping.com/view/Algeria-4
Downloading: http://example.webscraping.com/view/Albania-3
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Afghanistan-1
As expected, the crawl stopped after downloading the first page of countries.
- Visual FoxPro程序設計教程
- Windows系統管理與服務配置
- Network Automation Cookbook
- Hands-On C++ Game Animation Programming
- 精通Linux(第2版)
- NetBeans IDE 8 Cookbook
- Scala Reactive Programming
- 從零開始學Python網絡爬蟲
- Vue.js光速入門及企業項目開發實戰
- AMP:Building Accelerated Mobile Pages
- Python趣味創意編程
- Learning GraphQL and Relay
- Visual FoxPro程序設計
- Python算法交易實戰
- Learning ClojureScript