官术网_书友最值得收藏!

Crawling your first website

In order to scrape a website, we first need to download its web pages containing the data of interest, a process known as crawling. There are a number of approaches that can be used to crawl a website, and the appropriate choice will depend on the structure of the target website. This chapter will explore how to download web pages safely, and then introduce the following three common approaches to crawling a website:

  • Crawling a sitemap
  • Iterating each page using database IDs 
  • Following web page links

We have so far used the terms scraping and crawling interchangeably, but let's take a moment to define the similarities and differences in these two approaches.

主站蜘蛛池模板: 望江县| 沐川县| 伊宁县| 山丹县| 安徽省| 宁乡县| 镇宁| 肥东县| 清丰县| 夹江县| 陇西县| 桦南县| 宜都市| 和政县| 沙田区| 马鞍山市| 得荣县| 淮南市| 海兴县| 鄱阳县| 云安县| 岳池县| 怀化市| 镇平县| 安溪县| 德清县| 镶黄旗| 育儿| 开阳县| 绍兴县| 格尔木市| 牟定县| 海原县| 石河子市| 邯郸市| 乃东县| 襄城县| 彭水| 吉首市| 深泽县| 儋州市|