官术网_书友最值得收藏!

Using the requests library

Although we have built a fairly advanced parser using only urllib, the majority of scrapers written in Python today utilize the requests library to manage complex HTTP requests. What started as a small library to help wrap urllib features in something "human-readable" is now a very large project with hundreds of contributors. Some of the features available include built-in handling of encoding, important updates to SSL and security, as well as easy handling of POST requests, JSON, cookies, and proxies.

Throughout most of this book, we will utilize the requests library for its simplicity and ease of use, and because it has become the de facto standard for most web scraping.

To install requests, simply use pip:

pip install requests

For an in-depth overview of all features, you should read the documentation at http://python-requests.org or browse the source code at https://github.com/kennethreitz/requests

To compare differences using the two libraries, I've also built the advanced link crawler so that it can use requests. You can see the code at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler_using_requests.py. The main download function shows the key differences. The requests version is as follows:

def download(url, user_agent='wswp', num_retries=2, proxies=None):
print('Downloading:', url)
headers = {'User-Agent': user_agent}
try:
resp = requests.get(url, headers=headers, proxies=proxies)
html = resp.text
if resp.status_code >= 400:
print('Download error:', resp.text)
html = None
if num_retries and 500 <= resp.status_code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
except requests.exceptions.RequestException as e:
print('Download error:', e.reason)
html = None

One notable difference is the ease of use of having status_code as an available attribute for each request. Additionally, we no longer need to test for character encoding, as the text attribute on our Response object does so automatically. In the rare case of an non-resolvable URL or timeout, they are all handled by RequestException so it makes for an easy catch statement. Proxy handling is also taken care of by simply passing a dictionary of proxies (that is {'http': 'http://myproxy.net:1234', 'https': 'https://myproxy.net:1234'}).

We will continue to compare and use both libraries, so that you are familiar with them depending on your needs and use case. I strongly recommend using requests whenever you are handling more complex websites, or need to handle important humanizing methods such as using cookies or sessions. We will talk more about these methods in Chapter 6Interacting with Forms.

主站蜘蛛池模板: 长顺县| 始兴县| 朝阳市| 唐山市| 马公市| 昆明市| 安远县| 札达县| 漳平市| 平遥县| 汝阳县| 曲水县| 南丹县| 北宁市| 广汉市| 介休市| 武平县| 韶山市| 印江| 隆回县| 三台县| 攀枝花市| 大埔县| 曲松县| 佛教| 涿鹿县| 饶河县| 西华县| 桓仁| 临漳县| 浙江省| 敖汉旗| 呼玛县| 黑山县| 盐山县| 榆中县| 绵阳市| 陇川县| 遵义市| 大田县| 观塘区|