官术网_书友最值得收藏!

Scraping the Data

In the previous chapter, we built a crawler which follows links to download the web pages we want. This is interesting but not useful-the crawler downloads a web page, and then discards the result. Now, we need to make this crawler achieve something by extracting data from each web page, which is known as scraping.

We will first cover browser tools to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the chapter will conclude with a comparison of these three scraping alternatives.

In this chapter, we will cover the following topics:

  • Analyzing a web page
  • Approaches to scrape a web page
  • Using the console
  • xpath selectors
  • Scraping results
主站蜘蛛池模板: 广饶县| 封开县| 大方县| 华蓥市| 寿阳县| 鄢陵县| 巫溪县| 同江市| 措美县| 汉阴县| 宣化县| 唐河县| 三穗县| 仪征市| 白银市| 株洲县| 永年县| 常山县| 德化县| 临泽县| 寻乌县| 东莞市| 安庆市| 宝应县| 教育| 吴旗县| 徐水县| 青铜峡市| 江孜县| 葵青区| 嵩明县| 米易县| 宜昌市| 莒南县| 米易县| 平安县| 甘孜县| 衡东县| 永年县| 汨罗市| 玉溪市|