官术网_书友最值得收藏!

Throttling downloads

If we crawl a website too quickly, we risk being blocked or overloading the server(s). To minimize these risks, we can throttle our crawl by waiting for a set delay between downloads. Here is a class to implement this:

from urllib.parse import urlparse
import time

class Throttle:
"""Add a delay between downloads to the same domain
"""
def __init__(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
# timestamp of when a domain was last accessed
self.domains = {}

def wait(self, url):
domain = urlparse(url).netloc
last_accessed = self.domains.get(domain)

if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (time.time() - last_accessed)
if sleep_secs > 0:
# domain has been accessed recently
# so need to sleep
time.sleep(sleep_secs)
# update the last accessed time
self.domains[domain] = time.time()

This Throttle class keeps track of when each domain was last accessed and will sleep if the time since the last access is shorter than the specified delay. We can add throttling to the crawler by calling throttle before every download:

throttle = Throttle(delay) 
...
throttle.wait(url)
html = download(url, user_agent=user_agent, num_retries=num_retries,
proxy=proxy, charset=charset)
主站蜘蛛池模板: 山阳县| 贵港市| 土默特右旗| 淮阳县| 石狮市| 大城县| 巢湖市| 马关县| 海安县| 台东县| 惠州市| 余江县| 阿克苏市| 乌兰县| 德州市| 离岛区| 定远县| 庆城县| 肇源县| 来宾市| 罗城| 泾阳县| 鸡泽县| 太谷县| 四平市| 界首市| 大城县| 潞西市| 安康市| 廉江市| 钦州市| 上林县| 三穗县| 宽城| 海兴县| 阿拉善盟| 镇巴县| 金阳县| 台东市| 化州市| 昆山市|