- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 187字
- 2021-07-09 19:42:46
Throttling downloads
If we crawl a website too quickly, we risk being blocked or overloading the server(s). To minimize these risks, we can throttle our crawl by waiting for a set delay between downloads. Here is a class to implement this:
from urllib.parse import urlparse
import time
class Throttle:
"""Add a delay between downloads to the same domain
"""
def __init__(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
# timestamp of when a domain was last accessed
self.domains = {}
def wait(self, url):
domain = urlparse(url).netloc
last_accessed = self.domains.get(domain)
if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (time.time() - last_accessed)
if sleep_secs > 0:
# domain has been accessed recently
# so need to sleep
time.sleep(sleep_secs)
# update the last accessed time
self.domains[domain] = time.time()
This Throttle class keeps track of when each domain was last accessed and will sleep if the time since the last access is shorter than the specified delay. We can add throttling to the crawler by calling throttle before every download:
throttle = Throttle(delay)
...
throttle.wait(url)
html = download(url, user_agent=user_agent, num_retries=num_retries,
proxy=proxy, charset=charset)
推薦閱讀
- Python自動化運維快速入門(第2版)
- Learning Informatica PowerCenter 10.x(Second Edition)
- NumPy Essentials
- Java軟件開發(fā)基礎(chǔ)
- PHP+MySQL+Dreamweaver動態(tài)網(wǎng)站開發(fā)實例教程
- Serverless架構(gòu)
- 精通Python設(shè)計模式(第2版)
- Oracle從入門到精通(第5版)
- C#程序設(shè)計教程(第3版)
- Python大學(xué)實用教程
- Oracle Database XE 11gR2 Jump Start Guide
- SQL Server 2012 數(shù)據(jù)庫應(yīng)用教程(第3版)
- Developing Java Applications with Spring and Spring Boot
- Visual Basic語言程序設(shè)計上機指導(dǎo)與練習(xí)(第3版)
- Mastering ArcGIS Server Development with JavaScript