官术网_书友最值得收藏!

Querying the DOM with XPath and lxml

XPath is a query language for selecting nodes from an XML document and is a must-learn query language for anyone performing web scraping. XPath offers a number of benefits to its user over other model-based tools:

  • Can easily navigate through the DOM tree
  • More sophisticated and powerful than other selectors like CSS selectors and regular expressions
  • It has a great set (200+) of built-in functions and is extensible with custom functions
  • It is widely supported by parsing libraries and scraping platforms 

XPath contains seven data models (we have seen some of them previously):

  • root node (top level parent node)
  • element nodes (<a>..</a>)
  • attribute nodes (href="example.html")
  • text nodes ("this is a text")
  • comment nodes (<!-- a comment -->)
  • namespace nodes 
  • processing instruction nodes

XPath expressions can return different data types:

  • strings
  • booleans
  • numbers
  • node-sets (probably the most common case)

An (XPath) axis defines a node-set relative to the current node. A total of 13 axes are defined in XPath to enable easy searching for different node parts, from the current context node, or the root node.

lxml is a Python wrapper on top of the libxml2 XML parsing library, which is written in C.  The implementation in C helps make it faster than Beautiful Soup, but also harder to install on some computers. The latest installation instructions are available at: http://lxml.de/installation.html.

lxml supports XPath, which makes it considerably easy to manage complex XML and HTML documents. We will examine several techniques of using lxml and XPath together, and how to use lxml and XPath to navigate the DOM and access data.

主站蜘蛛池模板: 同江市| 深州市| 沙坪坝区| 兰坪| 申扎县| 呼和浩特市| 洛浦县| 威海市| 南丰县| 石门县| 广水市| 潢川县| 衡山县| 大宁县| 古交市| 瑞金市| 磐安县| 富蕴县| 新巴尔虎左旗| 乐昌市| 吉木萨尔县| 平陆县| 藁城市| 灵丘县| 北票市| 九江县| 章丘市| 库尔勒市| 罗源县| 长垣县| 桂林市| 克东县| 乐至县| 宁乡县| 闽清县| 长白| 东安县| 肇州县| 乐平市| 颍上县| 海口市|