官术网_书友最值得收藏!

Estimating the size of a website

The size of the target website will affect how we crawl it. If the website is just a few hundred URLs, such as our example website, efficiency is not important. However, if the website has over a million web pages, downloading each sequentially would take months. This problem is addressed later in Chapter 4 , Concurrent Downloading, on distributed downloading.

A quick way to estimate the size of a website is to check the results of Google's crawler, which has quite likely already crawled the website we are interested in. We can access this information through a Google search with the site keyword to filter the results to our domain. An interface to this and other advanced search parameters are available at http://www.google.com/advanced_search.

Here are the site search results for our example website when searching Google for site:example.webscraping.com:

As we can see, Google currently estimates more than 200 web pages (this result may vary), which is around the website size. For larger websites, Google's estimates may be less accurate.

We can filter these results to certain parts of the website by adding a URL path to the domain. Here are the results for site:example.webscraping.com/view, which restricts the site search to the country web pages:

Again, your results may vary in size; however, this additional filter is useful because ideally you only want to crawl the part of a website containing useful data rather than every page.

主站蜘蛛池模板: 名山县| 武强县| 绥中县| 楚雄市| 绵竹市| 闽侯县| 辽宁省| 大连市| 七台河市| 高要市| 宁陵县| 抚顺县| 宜昌市| 双江| 铁岭县| 仁怀市| 石城县| 鄂温| 玉龙| 金华市| 永昌县| 镶黄旗| 穆棱市| 安化县| 湘潭市| 错那县| 平陆县| 江西省| 邛崃市| 剑川县| 土默特右旗| 玛纳斯县| 山西省| 仪征市| 怀安县| 昌图县| 祁连县| 甘肃省| 永州市| 上饶市| 孟州市|