官术网_书友最值得收藏!

Checking robots.txt

Most websites define a robots.txt file to let crawlers know of any restrictions when crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and to discover clues about the website's structure. More information about the robots.txt protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt, which is available at http://example.webscraping.com/robots.txt:

# section 1 
User-agent: BadCrawler
Disallow: /

# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap

# section 3
Sitemap: http://example.webscraping.com/sitemap.xml

In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt anyway. A later example in this chapter will show you how to make your crawler follow robots.txt automatically.

Section 2 specifies a crawl delay of 5 seconds between download requests for all user-agents, which should be respected to avoid overloading their server(s). There is also a /trap link to try to block malicious crawlers who follow disallowed links. If you visit this link, the server will block your IP for one minute! A real website would block your IP for much longer, perhaps permanently, but then we could not continue with this example.

Section 3 defines a Sitemap file, which will be examined in the next section.

主站蜘蛛池模板: 中超| 哈密市| 承德市| 抚顺市| 江源县| 南和县| 白水县| 永仁县| 安龙县| 祁东县| 清流县| 麦盖提县| 内丘县| 平邑县| 台湾省| 仙桃市| 寿宁县| 宾川县| 营口市| 安阳县| 北辰区| 汕头市| 丹东市| 新余市| 琼结县| 公主岭市| 桃源县| 永仁县| 龙陵县| 尚义县| 七台河市| 沛县| 聊城市| 兴文县| 道孚县| 盱眙县| 托克托县| 新昌县| 泰州市| 阿巴嘎旗| 太仓市|