官术网_书友最值得收藏!

How it works

We will get into some details about Scrapy in later chapters, but let's just go through this code quick to get a feel how it is accomplishing this scrape.  Everything in Scrapy revolves around creating a spider.  Spiders crawl through pages on the Internet based upon rules that we provide.  This spider only processes one single page, so it's not really much of a spider.  But it shows the pattern we will use through later Scrapy examples.

The spider is created with a class definition that derives from one of the Scrapy spider classes.  Ours derives from the scrapy.Spider class.

class PythonEventsSpider(scrapy.Spider):
name = 'pythoneventsspider'

start_urls = ['https://www.python.org/events/python-events/',]

Every spider is given a name, and also one or more start_urls which tell it where to start the crawling.

This spider has a field to store all the events that we find:

    found_events = []

The spider then has a method names parse which will be called for every page the spider collects.

def parse(self, response):
for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
event_details = dict()
event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
event_details['time'] = event.xpath('p/time/text()').extract_first()
self.found_events.append(event_details)

The implementation of this method uses and XPath selection to get the events from the page (XPath is the built in means of navigating HTML in Scrapy). It them builds the event_details dictionary object similarly to the other examples, and then adds it to the found_events list.

The remaining code does the programmatic execution of the Scrapy crawler.

    process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
process.crawl(PythonEventsSpider)
spider = next(iter(process.crawlers)).spider
process.start()

It starts with the creation of a CrawlerProcess which does the actual  crawling and a lot of other tasks.  We pass it a LOG_LEVEL of ERROR to prevent the voluminous Scrapy output.  Change this to DEBUG and re-run it to see the difference.

Next we tell the crawler process to use our Spider implementation.  We get the actual spider object from that crawler so that we can get the items when the crawl is complete.  And then we kick of the whole thing by calling process.start().

When the crawl is completed we can then iterate and print out the items that were found.

    for event in spider.found_events: print(event)
This example really didn't touch any of the power of Scrapy.  We will look more into some of the more advanced features later in the book.
主站蜘蛛池模板: 辽阳县| 巍山| 梅河口市| 琼海市| 荆州市| 南投县| 拉萨市| 大连市| 鄂尔多斯市| 抚顺市| 札达县| 崇明县| 屏东县| 南丹县| 榆树市| 乡城县| 龙陵县| 安达市| 开江县| 平原县| 虎林市| 谢通门县| 仙桃市| 奈曼旗| 云梦县| 克拉玛依市| 缙云县| 新泰市| 田林县| 咸阳市| 姚安县| 呼图壁县| 宁波市| 宝丰县| 绩溪县| 中牟县| 南康市| 大冶市| 吉安县| 东山县| 周至县|