- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 243字
- 2021-07-09 19:42:45
Downloading a web page
To scrape web pages, we first need to download them. Here is a simple Python script that uses Python's urllib module to download a URL:
import urllib.request
def download(url):
return urllib.request.urlopen(url).read()
When a URL is passed, this function will download the web page and return the HTML. The problem with this snippet is that, when downloading the web page, we might encounter errors that are beyond our control; for example, the requested page may no longer exist. In these cases, urllib will raise an exception and exit the script. To be safer, here is a more robust version to catch these exceptions:
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError
def download(url):
print('Downloading:', url)
try:
html = urllib.request.urlopen(url).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
return html
Now, when a download or URL error is encountered, the exception is caught and the function returns None.
Throughout this book, we will assume you are creating files with code that is presented without prompts (like the code above). When you see code that begins with a Python prompt >>> or and IPython prompt In [1]:, you will need to either enter that into the main file you have been using, or save the file and import those functions and classes into your Python interpreter. If you run into any issues, please take a look at the code in the book repository at https://github.com/kjam/wswp.
推薦閱讀
- Getting Started with Gulp(Second Edition)
- OpenDaylight Cookbook
- 密碼學(xué)原理與Java實(shí)現(xiàn)
- C語言程序設(shè)計(jì)基礎(chǔ)與實(shí)驗(yàn)指導(dǎo)
- PhoneGap Mobile Application Development Cookbook
- Responsive Web Design by Example
- SQL Server從入門到精通(第3版)
- Corona SDK Mobile Game Development:Beginner's Guide(Second Edition)
- Unity 5.X從入門到精通
- IBM RUP參考與認(rèn)證指南
- ServiceDesk Plus 8.x Essentials
- Python量子計(jì)算實(shí)踐:基于Qiskit和IBM Quantum Experience平臺(tái)
- 程序員的算法趣題2
- C語言解惑:指針、數(shù)組、函數(shù)和多文件編程
- WCF 4.5 Multi-Layer Services Development with Entity Framework(Third Edition)