官术网_书友最值得收藏!

Checking robots.txt

Most websites define a robots.txt file to let crawlers know of any restrictions when crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and to discover clues about the website's structure. More information about the robots.txt protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt, which is available at http://example.webscraping.com/robots.txt:

# section 1 
User-agent: BadCrawler
Disallow: /

# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap

# section 3
Sitemap: http://example.webscraping.com/sitemap.xml

In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt anyway. A later example in this chapter will show you how to make your crawler follow robots.txt automatically.

Section 2 specifies a crawl delay of 5 seconds between download requests for all user-agents, which should be respected to avoid overloading their server(s). There is also a /trap link to try to block malicious crawlers who follow disallowed links. If you visit this link, the server will block your IP for one minute! A real website would block your IP for much longer, perhaps permanently, but then we could not continue with this example.

Section 3 defines a Sitemap file, which will be examined in the next section.

主站蜘蛛池模板: 洞头县| 诸城市| 玉田县| 吕梁市| 德兴市| 独山县| 论坛| 固始县| 武宣县| 乌鲁木齐县| 登封市| 自贡市| 洛川县| 乌鲁木齐市| 涿鹿县| 扶余县| 如东县| 石狮市| 定南县| 江达县| 巴东县| 高淳县| 巴彦淖尔市| 铜山县| 巴楚县| 鹰潭市| 藁城市| 炉霍县| 庆元县| 南安市| 淳化县| 胶州市| 陈巴尔虎旗| 乃东县| 边坝县| 尚义县| 集贤县| 昌宁县| 达拉特旗| 会东县| 玛曲县|