官术网_书友最值得收藏!

Scraping the Data

In the previous chapter, we built a crawler which follows links to download the web pages we want. This is interesting but not useful-the crawler downloads a web page, and then discards the result. Now, we need to make this crawler achieve something by extracting data from each web page, which is known as scraping.

We will first cover browser tools to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the chapter will conclude with a comparison of these three scraping alternatives.

In this chapter, we will cover the following topics:

  • Analyzing a web page
  • Approaches to scrape a web page
  • Using the console
  • xpath selectors
  • Scraping results
主站蜘蛛池模板: 融水| 常州市| 商丘市| 台东市| 方山县| 绥滨县| 英超| 富裕县| 肇源县| 云浮市| 若尔盖县| 高淳县| 北辰区| 陆川县| 葫芦岛市| 安新县| 潼南县| 阿荣旗| 万盛区| 大连市| 红安县| 石楼县| 沛县| 宝丰县| 昌都县| 滨海县| 海原县| 新丰县| 拉萨市| 新安县| 莆田市| 安远县| 靖西县| 报价| 出国| 崇信县| 聂拉木县| 威远县| 新乡县| 镇沅| 博野县|