- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 168字
- 2021-07-09 19:42:46
Final version
The full source code for this advanced link crawler can be downloaded at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler.py. Each of the sections in this chapter has matching code in the repository at https://github.com/kjam/wswp. To easily follow along, feel free to fork the repository and use it to compare and test your own code.
To test the link crawler, let's try setting the user agent to BadCrawler, which, as we saw earlier in this chapter, was blocked by robots.txt. As expected, the crawl is blocked and finishes immediately:
>>> start_url = 'http://example.webscraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com/
Now, let's try using the default user agent and setting the maximum depth to 1 so that only the links from the home page are downloaded:
>>> link_crawler(start_url, link_regex, max_depth=1)
Downloading: http://example.webscraping.com//index
Downloading: http://example.webscraping.com/index/1
Downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.webscraping.com/view/Antarctica-9
Downloading: http://example.webscraping.com/view/Anguilla-8
Downloading: http://example.webscraping.com/view/Angola-7
Downloading: http://example.webscraping.com/view/Andorra-6
Downloading: http://example.webscraping.com/view/American-Samoa-5
Downloading: http://example.webscraping.com/view/Algeria-4
Downloading: http://example.webscraping.com/view/Albania-3
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Afghanistan-1
As expected, the crawl stopped after downloading the first page of countries.
- 大話PLC(輕松動(dòng)漫版)
- 計(jì)算機(jī)網(wǎng)絡(luò)
- Java多線程編程實(shí)戰(zhàn)指南:設(shè)計(jì)模式篇(第2版)
- 流量的秘密:Google Analytics網(wǎng)站分析與優(yōu)化技巧(第2版)
- 技術(shù)領(lǐng)導(dǎo)力:程序員如何才能帶團(tuán)隊(duì)
- WSO2 Developer’s Guide
- iOS開發(fā)實(shí)戰(zhàn):從零基礎(chǔ)到App Store上架
- Oracle 12c中文版數(shù)據(jù)庫管理、應(yīng)用與開發(fā)實(shí)踐教程 (清華電腦學(xué)堂)
- Reactive Programming With Java 9
- Python機(jī)器學(xué)習(xí)編程與實(shí)戰(zhàn)
- Hands-On Reinforcement Learning with Python
- 愛上micro:bit
- Node Cookbook(Second Edition)
- 響應(yīng)式Web設(shè)計(jì):HTML5和CSS3實(shí)戰(zhàn)(第2版)
- Java 9:Building Robust Modular Applications