官术网_书友最值得收藏!

Scraping the Data

In the previous chapter, we built a crawler which follows links to download the web pages we want. This is interesting but not useful-the crawler downloads a web page, and then discards the result. Now, we need to make this crawler achieve something by extracting data from each web page, which is known as scraping.

We will first cover browser tools to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the chapter will conclude with a comparison of these three scraping alternatives.

In this chapter, we will cover the following topics:

  • Analyzing a web page
  • Approaches to scrape a web page
  • Using the console
  • xpath selectors
  • Scraping results
主站蜘蛛池模板: 封丘县| 龙口市| 邵阳县| 普陀区| 翁源县| 彰化县| 张家口市| 昌乐县| 阿克苏市| 红桥区| 大同市| 和田市| 台南市| 定西市| 邵阳县| 孙吴县| 务川| 绥宁县| 镇宁| 山东省| 河西区| 聂拉木县| 阿坝县| 拉萨市| 樟树市| 临夏县| 托克托县| 延长县| 缙云县| 万宁市| 高碑店市| 平原县| 固原市| 海林市| 中牟县| 华阴市| 台东市| 吉林省| 微山县| 宣化县| 靖江市|