- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 111字
- 2021-07-09 19:42:44
Examining the Sitemap
Sitemap files are provided bywebsites to help crawlers locate their updated content without needing to crawl every web page. For further details, the sitemap standard is defined at http://www.sitemaps.org/protocol.html. Many web publishing platforms have the ability to generate a sitemap automatically. Here is the content of the Sitemap file located in the listed robots.txt file:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://example.webscraping.com/view/Afghanistan-1</loc></url>
<url><loc>http://example.webscraping.com/view/Aland-Islands-2</loc></url>
<url><loc>http://example.webscraping.com/view/Albania-3</loc></url>
...
</urlset>
This sitemap provides links to all the web pages, which will be used in the next section to build our first crawler. Sitemap files provide an efficient way to crawl a website, but need to be treated carefully because they can be missing, out-of-date, or incomplete.
推薦閱讀
- 手機安全和可信應用開發指南:TrustZone與OP-TEE技術詳解
- SpringMVC+MyBatis快速開發與項目實戰
- 深度學習經典案例解析:基于MATLAB
- 算法基礎:打開程序設計之門
- Android 9 Development Cookbook(Third Edition)
- The DevOps 2.4 Toolkit
- D3.js 4.x Data Visualization(Third Edition)
- Canvas Cookbook
- BeagleBone Robotic Projects(Second Edition)
- 監控的藝術:云原生時代的監控框架
- Java設計模式深入研究
- 每個人的Python:數學、算法和游戲編程訓練營
- 可視化H5頁面設計與制作:Mugeda標準教程
- Learning Zimbra Server Essentials
- Elasticsearch實戰(第2版)