- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 168字
- 2021-07-09 19:42:46
Final version
The full source code for this advanced link crawler can be downloaded at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler.py. Each of the sections in this chapter has matching code in the repository at https://github.com/kjam/wswp. To easily follow along, feel free to fork the repository and use it to compare and test your own code.
To test the link crawler, let's try setting the user agent to BadCrawler, which, as we saw earlier in this chapter, was blocked by robots.txt. As expected, the crawl is blocked and finishes immediately:
>>> start_url = 'http://example.webscraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com/
Now, let's try using the default user agent and setting the maximum depth to 1 so that only the links from the home page are downloaded:
>>> link_crawler(start_url, link_regex, max_depth=1)
Downloading: http://example.webscraping.com//index
Downloading: http://example.webscraping.com/index/1
Downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.webscraping.com/view/Antarctica-9
Downloading: http://example.webscraping.com/view/Anguilla-8
Downloading: http://example.webscraping.com/view/Angola-7
Downloading: http://example.webscraping.com/view/Andorra-6
Downloading: http://example.webscraping.com/view/American-Samoa-5
Downloading: http://example.webscraping.com/view/Algeria-4
Downloading: http://example.webscraping.com/view/Albania-3
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Afghanistan-1
As expected, the crawl stopped after downloading the first page of countries.
- Qt 5 and OpenCV 4 Computer Vision Projects
- Docker技術入門與實戰(第3版)
- 跟老齊學Python:輕松入門
- JavaScript+jQuery開發實戰
- 游戲程序設計教程
- Mastering ServiceNow(Second Edition)
- 深入淺出PostgreSQL
- 深入分布式緩存:從原理到實踐
- Python全棧數據工程師養成攻略(視頻講解版)
- Image Processing with ImageJ
- 數據分析與挖掘算法:Python實戰
- 青少年學Python(第2冊)
- Java高并發編程詳解:深入理解并發核心庫
- Spark技術內幕:深入解析Spark內核架構設計與實現原理
- 程序員的英語