- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 244字
- 2021-07-09 19:42:43
Checking robots.txt
Most websites define a robots.txt file to let crawlers know of any restrictions when crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and to discover clues about the website's structure. More information about the robots.txt protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt, which is available at http://example.webscraping.com/robots.txt:
# section 1
User-agent: BadCrawler
Disallow: /
# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap
# section 3
Sitemap: http://example.webscraping.com/sitemap.xml
In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt anyway. A later example in this chapter will show you how to make your crawler follow robots.txt automatically.
Section 2 specifies a crawl delay of 5 seconds between download requests for all user-agents, which should be respected to avoid overloading their server(s). There is also a /trap link to try to block malicious crawlers who follow disallowed links. If you visit this link, the server will block your IP for one minute! A real website would block your IP for much longer, perhaps permanently, but then we could not continue with this example.
Section 3 defines a Sitemap file, which will be examined in the next section.
- C++程序設計教程
- Git Version Control Cookbook
- Mastering SVG
- Data Analysis with IBM SPSS Statistics
- Android開發:從0到1 (清華開發者書庫)
- Solutions Architect's Handbook
- Python開發基礎
- Oracle Data Guard 11gR2 Administration Beginner's Guide
- Mastering SciPy
- 軟技能2:軟件開發者職業生涯指南
- Python機器學習
- Swift從入門到精通 (移動開發叢書)
- ACE技術內幕:深入解析ACE架構設計與實現原理
- Instant Buildroot
- JavaScript實戰-JavaScript、jQuery、HTML5、Node.js實例大全(第2版)