官术网_书友最值得收藏!

Parsing an XML file

We'll start by parsing an Extensible Markup Language (XML) file to get the raw latitude and longitude pairs. This will show you how we can encapsulate some not-quite-functional features of Python to create an iterable sequence of values.

We'll make use of the xml.etree module. After parsing, the resulting ElementTree object has a findall() method that will iterate through the available values.

We'll be looking for constructs, such as the following XML example:

<Placemark><Point> 
<coordinates>-76.33029518659048,
37.54901619777347,0</coordinates> </Point></Placemark>

The file will have a number of <Placemark> tags, each of which has a point and coordinate structure within it. This is typical of Keyhole Markup Language (KML) files that contain geographic information.

Parsing an XML file can be approached at two levels of abstraction. At the lower level, we need to locate the various tags, attribute values, and content within the XML file. At a higher level, we want to make useful objects out of the text and attribute values.

The lower-level processing can be approached in the following way:

import xml.etree.ElementTree as XML
from typing import Text, List, TextIO, Iterable
def row_iter_kml(file_obj: TextIO) -> Iterable[List[Text]]:
ns_map= { "ns0": "http://www.opengis.net/kml/2.2", "ns1": "http://www.google.com/kml/ext/2.2"}
path_to_points= ("./ns0:Document/ns0:Folder/ns0:Placemark/"
"ns0:Point/ns0:coordinates") doc= XML.parse(file_obj) return (comma_split(Text(coordinates.text)) for coordinates in
doc.findall(path_to_points, ns_map))

This function requires text from a file opened via a with statement. The result is a generator that creates list objects from the latitude/longitude pairs. As a part of the XML processing, this function includes a simple static dict object, ns_map, that provides the namespace mapping information for the XML tags we'll be searching. This dictionary will be used by the ElementTree.findall() method.

The essence of the parsing is a generator function that uses the sequence of tags located by doc.findall(). This sequence of tags is then processed by a comma_split() function to tease the text value into its comma-separated components.

The comma_split() function is the functional version of the split() method of a string, which is as follows:

def comma_split(text: Text) -> List[Text]:
return text.split(",")

We've used the functional wrapper to emphasize a slightly more uniform syntax. We've also added explicit type hints to make it clear that text is converted to a list of text values. Without the type hint, there are two potential definitions of split() that could be meant. The method applies to bytes as well as str. We've used the Text type name, which is an alias for str in Python 3.

The result of the row_iter_kml() function is an iterable sequence of rows of data. Each row will be a list composed of three strings—latitude, longitude, and altitude of a way point along this path. This isn't directly useful yet. We'll need to do some more processing to get latitude and longitude as well as converting these two numbers into useful floating-point values.

This idea of an iterable sequence of tuples (or lists) allows us to process some kinds of data files in a simple and uniform way. In Chapter 3, Functions, Iterators, and Generators, we looked at how Comma Separated Values (CSV) files are easily handled as rows of tuples. In Chapter 6, Recursions and Reductions, we'll revisit the parsing idea to compare these various examples.

The output from the preceding function looks like the following example:

[['-76.33029518659048', '37.54901619777347', '0'], 
['-76.27383399999999', '37.840832', '0'],
['-76.459503', '38.331501', '0'],
etc.
['-76.47350299999999', '38.976334', '0']]

Each row is the source text of the <ns0:coordinates> tag split using the (,) that's part of the text content. The values are the east-west longitude, north-south latitude, and altitude. We'll apply some additional functions to the output of this function to create a usable subset of this data.

主站蜘蛛池模板: 孙吴县| 赤水市| 通州区| 桂林市| 文水县| 商城县| 金湖县| 沁阳市| 贵南县| 武山县| 收藏| 永福县| 绥化市| 淮南市| 车致| 洞口县| 三原县| 平邑县| 西乌珠穆沁旗| 古交市| 上蔡县| 安塞县| 河西区| 博爱县| 乐至县| 金阳县| 饶河县| 电白县| 梨树县| 滕州市| 鄂州市| 蒙山县| 洛川县| 屯昌县| 罗平县| 罗城| 彰武县| 武宁县| 宣城市| 拜泉县| 唐山市|