- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 242字
- 2021-07-09 19:42:44
Estimating the size of a website
The size of the target website will affect how we crawl it. If the website is just a few hundred URLs, such as our example website, efficiency is not important. However, if the website has over a million web pages, downloading each sequentially would take months. This problem is addressed later in Chapter 4 , Concurrent Downloading, on distributed downloading.
A quick way to estimate the size of a website is to check the results of Google's crawler, which has quite likely already crawled the website we are interested in. We can access this information through a Google search with the site keyword to filter the results to our domain. An interface to this and other advanced search parameters are available at http://www.google.com/advanced_search.
Here are the site search results for our example website when searching Google for site:example.webscraping.com:

As we can see, Google currently estimates more than 200 web pages (this result may vary), which is around the website size. For larger websites, Google's estimates may be less accurate.
We can filter these results to certain parts of the website by adding a URL path to the domain. Here are the results for site:example.webscraping.com/view, which restricts the site search to the country web pages:

Again, your results may vary in size; however, this additional filter is useful because ideally you only want to crawl the part of a website containing useful data rather than every page.
- ExtGWT Rich Internet Application Cookbook
- JavaScript前端開發模塊化教程
- ASP.NET MVC4框架揭秘
- 工程軟件開發技術基礎
- Android和PHP開發最佳實踐(第2版)
- CMDB分步構建指南
- Network Automation Cookbook
- Hadoop+Spark大數據分析實戰
- JavaScript動態網頁開發詳解
- Python從入門到精通
- Scratch·愛編程的藝術家
- 寫給程序員的Python教程
- Python:Deeper Insights into Machine Learning
- R Data Science Essentials
- Appcelerator Titanium:Patterns and Best Practices