官术网_书友最值得收藏!

Examining the Sitemap

Sitemap files are provided bywebsites to help crawlers locate their updated content without needing to crawl every web page. For further details, the sitemap standard is defined at http://www.sitemaps.org/protocol.html. Many web publishing platforms have the ability to generate a sitemap automatically. Here is the content of the  Sitemap file located in the listed robots.txt file:

<?xml version="1.0" encoding="UTF-8"?> 
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://example.webscraping.com/view/Afghanistan-1</loc></url>
<url><loc>http://example.webscraping.com/view/Aland-Islands-2</loc></url>
<url><loc>http://example.webscraping.com/view/Albania-3</loc></url>
...
</urlset>

This sitemap provides links to all the web pages, which will be used in the next section to build our first crawler. Sitemap files provide an efficient way to crawl a website, but need to be treated carefully because they can be missing, out-of-date, or incomplete.

主站蜘蛛池模板: 长春市| 临城县| 扎兰屯市| 玉田县| 洪洞县| 霍邱县| 鱼台县| 广水市| 兴宁市| 资源县| 菏泽市| 清徐县| 南溪县| 新建县| 内丘县| 亳州市| 永平县| 仙居县| 综艺| 樟树市| 大姚县| 五台县| 博兴县| 六枝特区| 南岸区| 昌乐县| 瑞金市| 元江| 商城县| 罗城| 德清县| 夏邑县| 离岛区| 乾安县| 厦门市| 长垣县| 浦北县| 志丹县| 宁化县| 抚远县| 东平县|