官术网_书友最值得收藏!

Avoiding spider traps

Currently, our crawler will follow any link it hasn't seen before. However, some websites dynamically generate their content and can have an infinite number of web pages. For example, if the website has an online calendar with links provided for the next month and year, then the next month will also have links to the next month, and so on for however long the widget is set (this can be a LONG time). The site may offer the same functionality with simple pagination navigation, essentially paginating over empty search result pages until the maximum pagination is reached. This situation is known as a spider trap.

A simple way to avoid getting stuck in a spider trap is to track how many links have been followed to reach the current web page, which we will refer to as depth. Then, when a maximum depth is reached, the crawler does not add links from that web page to the queue. To implement maximum depth, we will change the seen variable, which currently tracks visited web pages, into a dictionary to also record the depth the links were found at:

def link_crawler(..., max_depth=4): 
seen = {}
...
if rp.can_fetch(user_agent, url):
depth = seen.get(url, 0)
if depth == max_depth:
print('Skipping %s due to depth' % url)
continue
...
for link in get_links(html):
if re.match(link_regex, link):
abs_link = urljoin(start_url, link)
if abs_link not in seen:
seen[abs_link] = depth + 1
crawl_queue.append(abs_link)

Now, with this feature, we can be confident the crawl will complete eventually. To disable this feature, max_depth can be set to a negative number so the current depth will never be equal to it.

主站蜘蛛池模板: 洛川县| 宜兴市| 密山市| 邵阳市| 布拖县| 通海县| 武城县| 徐州市| 剑河县| 仁化县| 鹰潭市| 乌拉特后旗| 诏安县| 瓦房店市| 灵丘县| 江油市| 信丰县| 凤庆县| 韶山市| 九龙城区| 宁远县| 宜城市| 中西区| 兰考县| 轮台县| 金门县| 德兴市| 获嘉县| 全椒县| 长汀县| 新泰市| 延津县| 江陵县| 星子县| 盘山县| 桦川县| 阳山县| 化州市| 呼伦贝尔市| 清水河县| 鹤岗市|