官术网_书友最值得收藏!

Throttling downloads

If we crawl a website too quickly, we risk being blocked or overloading the server(s). To minimize these risks, we can throttle our crawl by waiting for a set delay between downloads. Here is a class to implement this:

from urllib.parse import urlparse
import time

class Throttle:
"""Add a delay between downloads to the same domain
"""
def __init__(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
# timestamp of when a domain was last accessed
self.domains = {}

def wait(self, url):
domain = urlparse(url).netloc
last_accessed = self.domains.get(domain)

if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (time.time() - last_accessed)
if sleep_secs > 0:
# domain has been accessed recently
# so need to sleep
time.sleep(sleep_secs)
# update the last accessed time
self.domains[domain] = time.time()

This Throttle class keeps track of when each domain was last accessed and will sleep if the time since the last access is shorter than the specified delay. We can add throttling to the crawler by calling throttle before every download:

throttle = Throttle(delay) 
...
throttle.wait(url)
html = download(url, user_agent=user_agent, num_retries=num_retries,
proxy=proxy, charset=charset)
主站蜘蛛池模板: 岑巩县| 罗江县| 漯河市| 岳阳县| 石台县| 临汾市| 中山市| 石棉县| 兴隆县| 永定县| 乌鲁木齐市| 定日县| 新晃| 建昌县| 东乡| 万全县| 邵东县| 平罗县| 大化| 克拉玛依市| 丁青县| 东乌珠穆沁旗| 莱芜市| 来凤县| 资兴市| 太和县| 霞浦县| 涿州市| 和硕县| 民县| 葫芦岛市| 内乡县| 临清市| 房产| 循化| 山丹县| 朝阳县| 房产| 姚安县| 清涧县| 东阳市|