- Python Web Scraping Cookbook
- Michael Heydt
- 267字
- 2021-06-30 18:44:01
Querying the DOM with XPath and lxml
XPath is a query language for selecting nodes from an XML document and is a must-learn query language for anyone performing web scraping. XPath offers a number of benefits to its user over other model-based tools:
- Can easily navigate through the DOM tree
- More sophisticated and powerful than other selectors like CSS selectors and regular expressions
- It has a great set (200+) of built-in functions and is extensible with custom functions
- It is widely supported by parsing libraries and scraping platforms
XPath contains seven data models (we have seen some of them previously):
- root node (top level parent node)
- element nodes (<a>..</a>)
- attribute nodes (href="example.html")
- text nodes ("this is a text")
- comment nodes (<!-- a comment -->)
- namespace nodes
- processing instruction nodes
XPath expressions can return different data types:
- strings
- booleans
- numbers
- node-sets (probably the most common case)
An (XPath) axis defines a node-set relative to the current node. A total of 13 axes are defined in XPath to enable easy searching for different node parts, from the current context node, or the root node.
lxml is a Python wrapper on top of the libxml2 XML parsing library, which is written in C. The implementation in C helps make it faster than Beautiful Soup, but also harder to install on some computers. The latest installation instructions are available at: http://lxml.de/installation.html.
lxml supports XPath, which makes it considerably easy to manage complex XML and HTML documents. We will examine several techniques of using lxml and XPath together, and how to use lxml and XPath to navigate the DOM and access data.
- Cisco OSPF命令與配置手冊
- 面向物聯(lián)網(wǎng)的CC2530與傳感器應(yīng)用開發(fā)
- Mastering Machine Learning for Penetration Testing
- 從區(qū)塊鏈到Web3:構(gòu)建未來互聯(lián)網(wǎng)生態(tài)
- 社交電商運(yùn)營策略、技巧與實(shí)操
- Oracle SOA Suite 11g Performance Tuning Cookbook
- PLC、現(xiàn)場總線及工業(yè)網(wǎng)絡(luò)實(shí)用技術(shù)速成
- The Kubernetes Workshop
- 4G小基站系統(tǒng)原理、組網(wǎng)及應(yīng)用
- 端到端QoS網(wǎng)絡(luò)設(shè)計(jì)
- 異構(gòu)蜂窩網(wǎng)絡(luò)關(guān)鍵理論與技術(shù)
- Selenium WebDriver 3 Practical Guide
- 物聯(lián)網(wǎng)與智慧廣電
- 互聯(lián)網(wǎng)安全的40個(gè)智慧洞見(2018)
- 路由與交換技術(shù)