- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 187字
- 2021-07-09 19:42:46
Throttling downloads
If we crawl a website too quickly, we risk being blocked or overloading the server(s). To minimize these risks, we can throttle our crawl by waiting for a set delay between downloads. Here is a class to implement this:
from urllib.parse import urlparse
import time
class Throttle:
"""Add a delay between downloads to the same domain
"""
def __init__(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
# timestamp of when a domain was last accessed
self.domains = {}
def wait(self, url):
domain = urlparse(url).netloc
last_accessed = self.domains.get(domain)
if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (time.time() - last_accessed)
if sleep_secs > 0:
# domain has been accessed recently
# so need to sleep
time.sleep(sleep_secs)
# update the last accessed time
self.domains[domain] = time.time()
This Throttle class keeps track of when each domain was last accessed and will sleep if the time since the last access is shorter than the specified delay. We can add throttling to the crawler by calling throttle before every download:
throttle = Throttle(delay)
...
throttle.wait(url)
html = download(url, user_agent=user_agent, num_retries=num_retries,
proxy=proxy, charset=charset)
推薦閱讀
- JavaScript從入門到精通(微視頻精編版)
- Java 開發從入門到精通(第2版)
- Oracle從新手到高手
- Arduino開發實戰指南:LabVIEW卷
- ASP.NET動態網頁設計教程(第三版)
- Visual FoxPro程序設計
- 軟件測試實用教程
- Mastering Backbone.js
- 持續輕量級Java EE開發:編寫可測試的代碼
- Hands-On JavaScript for Python Developers
- 微信小程序開發實戰:設計·運營·變現(圖解案例版)
- Vue.js 3應用開發與核心源碼解析
- Software-Defined Networking with OpenFlow(Second Edition)
- 游戲設計的底層邏輯
- 從零開始學算法:基于Python