- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 274字
- 2021-07-09 19:42:46
Avoiding spider traps
Currently, our crawler will follow any link it hasn't seen before. However, some websites dynamically generate their content and can have an infinite number of web pages. For example, if the website has an online calendar with links provided for the next month and year, then the next month will also have links to the next month, and so on for however long the widget is set (this can be a LONG time). The site may offer the same functionality with simple pagination navigation, essentially paginating over empty search result pages until the maximum pagination is reached. This situation is known as a spider trap.
A simple way to avoid getting stuck in a spider trap is to track how many links have been followed to reach the current web page, which we will refer to as depth. Then, when a maximum depth is reached, the crawler does not add links from that web page to the queue. To implement maximum depth, we will change the seen variable, which currently tracks visited web pages, into a dictionary to also record the depth the links were found at:
def link_crawler(..., max_depth=4):
seen = {}
...
if rp.can_fetch(user_agent, url):
depth = seen.get(url, 0)
if depth == max_depth:
print('Skipping %s due to depth' % url)
continue
...
for link in get_links(html):
if re.match(link_regex, link):
abs_link = urljoin(start_url, link)
if abs_link not in seen:
seen[abs_link] = depth + 1
crawl_queue.append(abs_link)
Now, with this feature, we can be confident the crawl will complete eventually. To disable this feature, max_depth can be set to a negative number so the current depth will never be equal to it.
- Windows系統管理與服務配置
- Mastering Entity Framework
- 深入淺出Windows API程序設計:編程基礎篇
- Raspberry Pi 2 Server Essentials
- Unity 5.x By Example
- Spring Boot Cookbook
- INSTANT Sinatra Starter
- 深度學習:Java語言實現
- Java網絡編程核心技術詳解(視頻微課版)
- C語言從入門到精通
- 打開Go語言之門:入門、實戰與進階
- Orleans:構建高性能分布式Actor服務
- Mastering Concurrency Programming with Java 9(Second Edition)
- Ext JS 4 Plugin and Extension Development
- 視窗軟件設計和開發自動化:可視化D++語言