官术网_书友最值得收藏!

Estimating the size of a website

The size of the target website will affect how we crawl it. If the website is just a few hundred URLs, such as our example website, efficiency is not important. However, if the website has over a million web pages, downloading each sequentially would take months. This problem is addressed later in Chapter 4 , Concurrent Downloading, on distributed downloading.

A quick way to estimate the size of a website is to check the results of Google's crawler, which has quite likely already crawled the website we are interested in. We can access this information through a Google search with the site keyword to filter the results to our domain. An interface to this and other advanced search parameters are available at http://www.google.com/advanced_search.

Here are the site search results for our example website when searching Google for site:example.webscraping.com:

As we can see, Google currently estimates more than 200 web pages (this result may vary), which is around the website size. For larger websites, Google's estimates may be less accurate.

We can filter these results to certain parts of the website by adding a URL path to the domain. Here are the results for site:example.webscraping.com/view, which restricts the site search to the country web pages:

Again, your results may vary in size; however, this additional filter is useful because ideally you only want to crawl the part of a website containing useful data rather than every page.

主站蜘蛛池模板: 海城市| 交口县| 云和县| 吴川市| 宁城县| 德兴市| 庆城县| 新巴尔虎右旗| 北票市| 澄江县| 元朗区| 棋牌| 鄂托克前旗| 勐海县| 南投市| 浪卡子县| 准格尔旗| 朝阳市| 阳江市| 富宁县| 壤塘县| 肥东县| 宿松县| 莱阳市| 南江县| 江口县| 阳信县| 资溪县| 临海市| 阿鲁科尔沁旗| 牙克石市| 隆化县| 建始县| 万荣县| 阆中市| 韶山市| 东乡族自治县| 邵阳县| 临西县| 饶平县| 乌什县|