官术网_书友最值得收藏!

Checking robots.txt

Most websites define a robots.txt file to let crawlers know of any restrictions when crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and to discover clues about the website's structure. More information about the robots.txt protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt, which is available at http://example.webscraping.com/robots.txt:

# section 1 
User-agent: BadCrawler
Disallow: /

# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap

# section 3
Sitemap: http://example.webscraping.com/sitemap.xml

In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt anyway. A later example in this chapter will show you how to make your crawler follow robots.txt automatically.

Section 2 specifies a crawl delay of 5 seconds between download requests for all user-agents, which should be respected to avoid overloading their server(s). There is also a /trap link to try to block malicious crawlers who follow disallowed links. If you visit this link, the server will block your IP for one minute! A real website would block your IP for much longer, perhaps permanently, but then we could not continue with this example.

Section 3 defines a Sitemap file, which will be examined in the next section.

主站蜘蛛池模板: 盐边县| 肃宁县| 盖州市| 长海县| 冷水江市| 郸城县| 钦州市| 错那县| 花莲县| 敖汉旗| 多伦县| 武平县| 儋州市| 华亭县| 桃园县| 长沙县| 天全县| 五大连池市| 靖远县| 开封县| 万载县| 九江市| 东辽县| 平泉县| 密云县| 静宁县| 高州市| 鲁山县| 行唐县| 武鸣县| 台安县| 内江市| 郁南县| 柳江县| 永年县| 九龙县| 平定县| 蓬安县| 石家庄市| 雷山县| 平遥县|