官术网_书友最值得收藏!

Throttling downloads

If we crawl a website too quickly, we risk being blocked or overloading the server(s). To minimize these risks, we can throttle our crawl by waiting for a set delay between downloads. Here is a class to implement this:

from urllib.parse import urlparse
import time

class Throttle:
"""Add a delay between downloads to the same domain
"""
def __init__(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
# timestamp of when a domain was last accessed
self.domains = {}

def wait(self, url):
domain = urlparse(url).netloc
last_accessed = self.domains.get(domain)

if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (time.time() - last_accessed)
if sleep_secs > 0:
# domain has been accessed recently
# so need to sleep
time.sleep(sleep_secs)
# update the last accessed time
self.domains[domain] = time.time()

This Throttle class keeps track of when each domain was last accessed and will sleep if the time since the last access is shorter than the specified delay. We can add throttling to the crawler by calling throttle before every download:

throttle = Throttle(delay) 
...
throttle.wait(url)
html = download(url, user_agent=user_agent, num_retries=num_retries,
proxy=proxy, charset=charset)
主站蜘蛛池模板: 镇雄县| 紫金县| 安新县| 八宿县| 盐山县| 萝北县| 遂昌县| 萝北县| 昌吉市| 渭南市| 凌云县| 会东县| 阿坝县| 东海县| 高密市| 安义县| 哈尔滨市| 庆元县| 康保县| 湘西| 无极县| 泸水县| 新竹市| 九龙坡区| 惠州市| 赞皇县| 洱源县| 漠河县| 巧家县| 宣汉县| 铜陵市| 佛山市| 抚松县| 罗定市| 区。| 大田县| 宝丰县| 合作市| 威宁| 东乌珠穆沁旗| 新营市|