- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 308字
- 2021-07-09 19:42:46
Parsing robots.txt
First, we need to interpret robots.txt to avoid downloading blocked URLs. Python urllib comes with the robotparser module, which makes this straightforward, as follows:
>>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url('http://example.webscraping.com/robots.txt')
>>> rp.read()
>>> url = 'http://example.webscraping.com'
>>> user_agent = 'BadCrawler'
>>> rp.can_fetch(user_agent, url)
False
>>> user_agent = 'GoodCrawler'
>>> rp.can_fetch(user_agent, url)
True
The robotparser module loads a robots.txt file and then provides a can_fetch()function, which tells you whether a particular user agent is allowed to access a web page or not. Here, when the user agent is set to 'BadCrawler', the robotparser module says that this web page can not be fetched, as we saw in the definition in the example site's robots.txt.
To integrate robotparser into the link crawler, we first want to create a new function to return the robotparser object:
def get_robots_parser(robots_url):
" Return the robots parser object using the robots_url "
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp
We need to reliably set the robots_url; we can do so by passing an extra keyword argument to our function. We can also set a default value catch in case the user does not pass the variable. Assuming the crawl will start at the root of the site, we can simply add robots.txt to the end of the URL. We also need to define the user_agent:
def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp'):
...
if not robots_url:
robots_url = '{}/robots.txt'.format(start_url)
rp = get_robots_parser(robots_url)
Finally, we add the parser check in the crawl loop:
...
while crawl_queue:
url = crawl_queue.pop()
# check url passes robots.txt restrictions
if rp.can_fetch(user_agent, url):
html = download(url, user_agent=user_agent)
...
else:
print('Blocked by robots.txt:', url)
We can test our advanced link crawler and its use of robotparser by using the bad user agent string.
>>> link_crawler('http://example.webscraping.com', '/(index|view)/', user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com
- Unity 2020 By Example
- 數(shù)據(jù)庫(kù)系統(tǒng)教程(第2版)
- Java加密與解密的藝術(shù)(第2版)
- Building Mapping Applications with QGIS
- Learning Concurrent Programming in Scala
- 持續(xù)集成與持續(xù)交付實(shí)戰(zhàn):用Jenkins、Travis CI和CircleCI構(gòu)建和發(fā)布大規(guī)模高質(zhì)量軟件
- Mastering Python Design Patterns
- GameMaker Essentials
- 大數(shù)據(jù)時(shí)代的企業(yè)升級(jí)之道(全3冊(cè))
- 高效使用Greenplum:入門、進(jìn)階與數(shù)據(jù)中臺(tái)
- C語(yǔ)言程序設(shè)計(jì)
- 現(xiàn)代JavaScript編程:經(jīng)典范例與實(shí)踐技巧
- Mastering Citrix? XenDesktop?
- 新手學(xué)ASP動(dòng)態(tài)網(wǎng)頁(yè)開發(fā)
- Visual FoxPro程序設(shè)計(jì)教程(第3版)