官术网_书友最值得收藏!

Scraping Python.org with Scrapy

Scrapy is a very popular open source Python scraping framework for extracting data. It was originally designed for only scraping, but it is has also evolved into a powerful web crawling solution.

In our previous recipes, we used Requests and urllib2 to fetch data and Beautiful Soup to extract data. Scrapy offers all of these functionalities with many other built-in modules and extensions. It is also our tool of choice when it comes to scraping with Python. 

Scrapy offers a number of powerful features that are worth mentioning:

  • Built-in extensions to make HTTP requests and handle compression, authentication, caching, manipulate user-agents, and HTTP headers
  • Built-in support for selecting and extracting data with selector languages such as CSS and XPath, as well as support for utilizing regular expressions for selection of content and links 
  • Encoding support to deal with languages and non-standard encoding declarations
  • Flexible APIs to reuse and write custom middleware and pipelines, which provide a clean and easy way to implement tasks such as automatically downloading assets (for example, images or media) and storing data in storage such as file systems, S3, databases, and others
主站蜘蛛池模板: 泽库县| 崇礼县| 赤水市| 中阳县| 西华县| 阿城市| 安塞县| 伊春市| 邢台县| 江源县| 巍山| 望城县| 德安县| 云林县| 新丰县| 西和县| 永年县| 自治县| 张家川| 台江县| 五台县| 交口县| 无为县| 定安县| 安化县| 崇州市| 贵阳市| 北海市| 绿春县| 汉阴县| 安国市| 温泉县| 靖州| 苗栗县| 石楼县| 金乡县| 宝丰县| 徐水县| 剑河县| 龙海市| 朝阳区|