官术网_书友最值得收藏!

How to do it...

We start by importing Selector from scrapy, and also requests so that we can retrieve the page:

In [1]: from scrapy.selector import Selector
...: import requests
...:

Next we load the page.  For this example we are going to retrieve the most recent questions on StackOverflow and extract their titles.  We can make this query with the the following:

In [2]: response = requests.get("http://stackoverflow.com/questions")

Now create a Selector and pass it the response object:

In [3]: selector = Selector(response)
...: selector
...:
Out[3]: <Selector xpath=None data='<html>\r\n\r\n <head>\r\n\r\n <title>N'>

Examining the content of this page we can see that questions have the following structure to their HTML:

The HTML of a StackOverflow Question

With the selector we can find these using XPath:

In [4]: summaries = selector.xpath('//div[@class="summary"]/h3')
...: summaries[0:5]
...:
Out[4]:
[<Selector xpath='//div[@class="summary"]/h3' data='<h3><a href="/questions/48353091/how-to-'>,
<Selector xpath='//div[@class="summary"]/h3' data='<h3><a href="/questions/48353090/move-fi'>,
<Selector xpath='//div[@class="summary"]/h3' data='<h3><a href="/questions/48353089/java-la'>,
<Selector xpath='//div[@class="summary"]/h3' data='<h3><a href="/questions/48353086/how-do-'>,
<Selector xpath='//div[@class="summary"]/h3' data='<h3><a href="/questions/48353085/running'>]

And now we drill a little further into each to get the title of the question.

In [5]: [x.extract() for x in summaries.xpath('a[@class="question-hyperlink"]/text()')][:10]
Out[5]:
['How to convert stdout binary file to a data URL?',
'Move first letter from sentence to the end',
'Java launch program and interact with it programmatically',
'How do I build vala from scratch',
'Running Sql Script',
'Mysql - Auto create, update, delete table 2 from table 1',
'how to map meeting data corresponding calendar time in java',
'Range of L*a* b* in Matlab',
'set maximum and minimum number input box in js,html',
'I created generic array and tried to store the value but it is showing ArrayStoreException']
主站蜘蛛池模板: 上犹县| 榆林市| 洛浦县| 陇南市| 和静县| 石渠县| 商南县| 深水埗区| 同仁县| 乐亭县| 铜鼓县| 柯坪县| 休宁县| 都匀市| 民权县| 林甸县| 峡江县| 金川县| 托克逊县| 天长市| 新邵县| 灵武市| 孟州市| 皮山县| 图们市| 乐山市| 册亨县| 沙雅县| 育儿| 阜城县| 长沙市| 安义县| 故城县| 时尚| 濮阳县| 垦利县| 乌什县| 理塘县| 桐乡市| 长寿区| 都匀市|