Throttling downloads

If we crawl a website too quickly, we risk being blocked or overloading the server(s). To minimize these risks, we can throttle our crawl by waiting for a set delay between downloads. Here is a class to implement this:

from urllib.parse import urlparse
import time

class Throttle: 
    """Add a delay between downloads to the same domain 
    """ 
    def __init__(self, delay): 
        # amount of delay between downloads for each domain 
        self.delay = delay 
        # timestamp of when a domain was last accessed 
        self.domains = {} 

    def wait(self, url): 
        domain = urlparse(url).netloc 
        last_accessed = self.domains.get(domain) 

        if self.delay > 0 and last_accessed is not None: 
            sleep_secs = self.delay - (time.time() - last_accessed) 
            if sleep_secs > 0: 
                # domain has been accessed recently 
                # so need to sleep 
                time.sleep(sleep_secs) 
        # update the last accessed time 
        self.domains[domain] = time.time()

This Throttle class keeps track of when each domain was last accessed and will sleep if the time since the last access is shorter than the specified delay. We can add throttling to the crawler by calling throttle before every download:

throttle = Throttle(delay) 
... 
throttle.wait(url) 
html = download(url, user_agent=user_agent, num_retries=num_retries, 
                proxy=proxy, charset=charset)

官术网_书友最值得收藏!

Python Web Scraping（Second Edition）

Throttling downloads