- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 244字
- 2021-07-09 19:42:43
Checking robots.txt
Most websites define a robots.txt file to let crawlers know of any restrictions when crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and to discover clues about the website's structure. More information about the robots.txt protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt, which is available at http://example.webscraping.com/robots.txt:
# section 1
User-agent: BadCrawler
Disallow: /
# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap
# section 3
Sitemap: http://example.webscraping.com/sitemap.xml
In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt anyway. A later example in this chapter will show you how to make your crawler follow robots.txt automatically.
Section 2 specifies a crawl delay of 5 seconds between download requests for all user-agents, which should be respected to avoid overloading their server(s). There is also a /trap link to try to block malicious crawlers who follow disallowed links. If you visit this link, the server will block your IP for one minute! A real website would block your IP for much longer, perhaps permanently, but then we could not continue with this example.
Section 3 defines a Sitemap file, which will be examined in the next section.
- 流量的秘密:Google Analytics網站分析與優化技巧(第2版)
- Getting Started with React
- Docker and Kubernetes for Java Developers
- Spring Boot開發與測試實戰
- 造個小程序:與微信一起干件正經事兒
- Mastering Concurrency in Go
- oreilly精品圖書:軟件開發者路線圖叢書(共8冊)
- Java程序員面試算法寶典
- Lua程序設計(第4版)
- PostgreSQL 11從入門到精通(視頻教學版)
- Essential C++(中文版)
- JavaScript應用開發實踐指南
- C++程序設計教程
- Python+Office:輕松實現Python辦公自動化
- Java EE架構設計與開發實踐