官术网_书友最值得收藏!

Parsing robots.txt

First, we need to interpret robots.txt to avoid downloading blocked URLs. Python urllib comes with the robotparser module, which makes this straightforward, as follows:

    >>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url('http://example.webscraping.com/robots.txt')
>>> rp.read()
>>> url = 'http://example.webscraping.com'
>>> user_agent = 'BadCrawler'
>>> rp.can_fetch(user_agent, url)
False
>>> user_agent = 'GoodCrawler'
>>> rp.can_fetch(user_agent, url)
True

The robotparser module loads a robots.txt file and then provides a can_fetch()function, which tells you whether a particular user agent is allowed to access a web page or not. Here, when the user agent is set to 'BadCrawler', the robotparser module says that this web page can not be fetched, as we saw in the definition in the example site's robots.txt.

To integrate robotparser into the link crawler, we first want to create a new function to return the  robotparser object:

def get_robots_parser(robots_url):
" Return the robots parser object using the robots_url "
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp

We need to reliably set the robots_url; we can do so by passing an extra keyword argument to our function. We can also set a default value catch in case the user does not pass the variable. Assuming the crawl will start at the root of the site, we can simply add robots.txt to the end of the URL. We also need to define the user_agent:

def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp'):
...
if not robots_url:
robots_url = '{}/robots.txt'.format(start_url)
rp = get_robots_parser(robots_url)

Finally, we add the parser check in the crawl loop:

... 
while crawl_queue:
url = crawl_queue.pop()
# check url passes robots.txt restrictions
if rp.can_fetch(user_agent, url):
html = download(url, user_agent=user_agent)
...
else:
print('Blocked by robots.txt:', url)

We can test our advanced link crawler and its use of robotparser by using the bad user agent string.

>>> link_crawler('http://example.webscraping.com', '/(index|view)/', user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com
主站蜘蛛池模板: 沙雅县| 曲松县| 铜川市| 潮州市| 措勤县| 沿河| 哈密市| 晴隆县| 和平区| 元谋县| 繁昌县| 沈阳市| 司法| 新津县| 都昌县| 调兵山市| 榆社县| 于田县| 鄂州市| 汪清县| 开江县| 松桃| 台前县| 阿克苏市| 喜德县| 通化县| 鞍山市| 晋江市| 三都| 建瓯市| 仙桃市| 万源市| 澄江县| 屏东市| 牡丹江市| 皋兰县| 洪泽县| 荆州市| 乐山市| 江孜县| 黎平县|