官术网_书友最值得收藏!

Parsing an XML file

We'll start by parsing an Extensible Markup Language (XML) file to get the raw latitude and longitude pairs. This will show you how we can encapsulate some not-quite-functional features of Python to create an iterable sequence of values.

We'll make use of the xml.etree module. After parsing, the resulting ElementTree object has a findall() method that will iterate through the available values.

We'll be looking for constructs, such as the following XML example:

<Placemark><Point> 
<coordinates>-76.33029518659048,
37.54901619777347,0</coordinates> </Point></Placemark>

The file will have a number of <Placemark> tags, each of which has a point and coordinate structure within it. This is typical of Keyhole Markup Language (KML) files that contain geographic information.

Parsing an XML file can be approached at two levels of abstraction. At the lower level, we need to locate the various tags, attribute values, and content within the XML file. At a higher level, we want to make useful objects out of the text and attribute values.

The lower-level processing can be approached in the following way:

import xml.etree.ElementTree as XML
from typing import Text, List, TextIO, Iterable
def row_iter_kml(file_obj: TextIO) -> Iterable[List[Text]]:
ns_map= { "ns0": "http://www.opengis.net/kml/2.2", "ns1": "http://www.google.com/kml/ext/2.2"}
path_to_points= ("./ns0:Document/ns0:Folder/ns0:Placemark/"
"ns0:Point/ns0:coordinates") doc= XML.parse(file_obj) return (comma_split(Text(coordinates.text)) for coordinates in
doc.findall(path_to_points, ns_map))

This function requires text from a file opened via a with statement. The result is a generator that creates list objects from the latitude/longitude pairs. As a part of the XML processing, this function includes a simple static dict object, ns_map, that provides the namespace mapping information for the XML tags we'll be searching. This dictionary will be used by the ElementTree.findall() method.

The essence of the parsing is a generator function that uses the sequence of tags located by doc.findall(). This sequence of tags is then processed by a comma_split() function to tease the text value into its comma-separated components.

The comma_split() function is the functional version of the split() method of a string, which is as follows:

def comma_split(text: Text) -> List[Text]:
return text.split(",")

We've used the functional wrapper to emphasize a slightly more uniform syntax. We've also added explicit type hints to make it clear that text is converted to a list of text values. Without the type hint, there are two potential definitions of split() that could be meant. The method applies to bytes as well as str. We've used the Text type name, which is an alias for str in Python 3.

The result of the row_iter_kml() function is an iterable sequence of rows of data. Each row will be a list composed of three strings—latitude, longitude, and altitude of a way point along this path. This isn't directly useful yet. We'll need to do some more processing to get latitude and longitude as well as converting these two numbers into useful floating-point values.

This idea of an iterable sequence of tuples (or lists) allows us to process some kinds of data files in a simple and uniform way. In Chapter 3, Functions, Iterators, and Generators, we looked at how Comma Separated Values (CSV) files are easily handled as rows of tuples. In Chapter 6, Recursions and Reductions, we'll revisit the parsing idea to compare these various examples.

The output from the preceding function looks like the following example:

[['-76.33029518659048', '37.54901619777347', '0'], 
['-76.27383399999999', '37.840832', '0'],
['-76.459503', '38.331501', '0'],
etc.
['-76.47350299999999', '38.976334', '0']]

Each row is the source text of the <ns0:coordinates> tag split using the (,) that's part of the text content. The values are the east-west longitude, north-south latitude, and altitude. We'll apply some additional functions to the output of this function to create a usable subset of this data.

主站蜘蛛池模板: 沙河市| 东乡县| 安塞县| 富宁县| 沽源县| 孝昌县| 古浪县| 安图县| 宜章县| 鹤壁市| 芒康县| 苏尼特左旗| 定兴县| 永定县| 罗山县| 博罗县| 许昌县| 青川县| 象州县| 绥棱县| 大田县| 灵川县| 象山县| 吕梁市| 平昌县| 临沭县| 施甸县| 手机| 子长县| 洪湖市| 南丰县| 永和县| 衡东县| 松江区| 山阴县| 加查县| 鸡西市| 台北市| 社旗县| 马鞍山市| 昌平区|