- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 244字
- 2021-07-09 19:42:43
Checking robots.txt
Most websites define a robots.txt file to let crawlers know of any restrictions when crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and to discover clues about the website's structure. More information about the robots.txt protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt, which is available at http://example.webscraping.com/robots.txt:
# section 1
User-agent: BadCrawler
Disallow: /
# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap
# section 3
Sitemap: http://example.webscraping.com/sitemap.xml
In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt anyway. A later example in this chapter will show you how to make your crawler follow robots.txt automatically.
Section 2 specifies a crawl delay of 5 seconds between download requests for all user-agents, which should be respected to avoid overloading their server(s). There is also a /trap link to try to block malicious crawlers who follow disallowed links. If you visit this link, the server will block your IP for one minute! A real website would block your IP for much longer, perhaps permanently, but then we could not continue with this example.
Section 3 defines a Sitemap file, which will be examined in the next section.
- INSTANT Mock Testing with PowerMock
- 黑客攻防從入門到精通(實戰秘笈版)
- Vue 3移動Web開發與性能調優實戰
- Python科學計算(第2版)
- Mastering JavaScript Object-Oriented Programming
- 羅克韋爾ControlLogix系統應用技術
- Java技術手冊(原書第7版)
- Mastering PHP Design Patterns
- Visual C++串口通信技術詳解(第2版)
- Learn Swift by Building Applications
- Java性能權威指南(第2版)
- Learning Laravel's Eloquent
- C++反匯編與逆向分析技術揭秘(第2版)
- 計算機應用基礎(第二版)
- Python預測分析與機器學習