- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 235字
- 2021-07-09 19:42:46
Supporting proxies
Sometimes it's necessary to access a website through a proxy. For example, Hulu is blocked in many countries outside the United States as are some videos on YouTube. Supporting proxies with urllib is not as easy as it could be. We will cover requests for a more user-friendly Python HTTP module that can also handle proxies later in this chapter. Here's how to support a proxy with urllib:
proxy = 'http://myproxy.net:1234' # example string
proxy_support = urllib.request.ProxyHandler({'http': proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
# now requests via urllib.request will be handled via proxy
Here is an updated version of the download function to integrate this:
def download(url, user_agent='wswp', num_retries=2, charset='utf-8', proxy=None):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
if proxy:
proxy_support = urllib.request.ProxyHandler({'http': proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html
The current urllib module does not support https proxies by default (Python 3.5). This may change with future versions of Python, so check the latest documentation. Alternatively, you can use the documentation's recommended recipe (https://code.activestate.com/recipes/456195/) or keep reading to learn how to use the requests library.
- Drupal 8 Blueprints
- Microsoft Application Virtualization Cookbook
- C#完全自學教程
- Object-Oriented JavaScript(Second Edition)
- Access 2016數據庫管
- Oracle 18c 必須掌握的新特性:管理與實戰
- 全棧自動化測試實戰:基于TestNG、HttpClient、Selenium和Appium
- PySpark Cookbook
- 從Java到Web程序設計教程
- Azure Serverless Computing Cookbook
- jQuery for Designers Beginner's Guide Second Edition
- Machine Learning for OpenCV
- scikit-learn Cookbook(Second Edition)
- 程序員的成長課
- AngularJS UI Development