官术网_书友最值得收藏!

Normalizing imports

Often data used in a spatial database is imported from other sources. As such it may not be in a form that is useful for our current application. In such a case, it may be useful to write functions that will aid in transforming the data into a form that is more useful for our application. This is particularly the case when going from flat file formats, such as shapefiles, to relational databases such as PostgreSQL.

Note

A shapefile is a de facto as well as formal standard for the storage of spatial data, and is probably the most common delivery format for vector spatial data. A shapefile, in spite of its name, is never just one file, but a collection of files. It consists of at least *.shp (which contains geometry), *.shx (an index file), and *.dbf (which contains the tabular information for the shapefile). It is a powerful and useful format but, as a flat file, it is inherently nonrelational. Each geometry is associated in a one-to-one relationship with each row in a table.

There are many structures that might serve as a proxy for relational stores in a shapefile. We will explore one here—a single field with delimited text for multiple relations. This is a not-too-uncommon hack to encode multiple relationships into a flat file. The other common approach is to create multiple fields to store what in a relational arrangement would be a single field.

Getting ready

The dataset we will be working with is a trails dataset that has linear extents for a set of trails in a park system. The data is the typical data that comes from the GIS world—as a flat shapefile, there are no explicit relational constructs in the data.

First, we load the data using the following command:

shp2pgsql -s 3734 -d -i -I -W LATIN1 -g the_geom trails chp02.trails | psql -U me -d postgis_cookbook

Looking at the linear data in a Desktop GIS, we see some categories for trail use:

Getting ready

We want to retain this information as well as the name. Unfortunately, the label_name field is a messy field with a variety of related names concatenated with an ampersand (&), as shown in the following query:

SELECT DISTINCT label_name FROM chp02.trails
 WHERE label_name LIKE '%&%' LIMIT 10; 

It will return the following output:

 label_name 
-----------------------------------------------------------------
 All Purpose Trail & Buckeye Trail & Sagamore Creek Loop Trail
 Hemlock Loop Trail & Prairie Loop Trail & Wildflower Loop Trail
 NC1 & NC2
 Hinckley Hills Loop Trail & HI3
 All Purpose Trail & North Ravine Loop Trail
 BR3 & BR4 & Buckeye Trail
 Bridle Trail & Hemlock Loop Trail
 Hemlock Trail & NC2
 Hinckley Hills Loop Trail & HI1
 Lake Isaac Trail & Lake to Lake Trail
(10 rows)

This is where the normalization of our table will begin.

How to do it...

The first thing we need to do is find all the fields that don't have ampersands and use those as our unique list of available trails. In our case, we can do this, as every trail has at least one segment that is uniquely named and not associated with another trail name. This approach will not work with all datasets, so be careful in understanding your data before applying this approach to that data. To select the fields without ampersands, we use the following query:

SELECT DISTINCT label_name, res
 FROM chp02.trails
 WHERE label_name NOT LIKE '%&%';

It will return the following output:

 label_name | res
--------------------------------------------+-------------------
South Quarry Loop Trail | Mill Stream Run Reservation
Buckeye Trail | Hinckley Reservation
Bridle Connector Trail | Rocky River Reservation
West Channel Pond Loop Trail | Rocky River Reservation
Green Milkweed Trail | Mill Stream Run Reservation
All Purpose Trail | Euclid Creek Reservation
Connector Trail | Bradley Woods Reservation
North Chagrin Reservation Bridle Trail | North Chagrin Reservation
Connector Trail | Garfield Park Reservation
BR2 | Brecksville Reservation
Connector Trail | Rocky River Reservation
Buckeye Trail | South Chagrin Reservation

For sanity, we will also sort these entries as follows:

SELECT DISTINCT label_name, res
 FROM chp02.trails
 WHERE label_name NOT LIKE '%&%'
 ORDER BY label_name, res;

Next, we want to search for all the records that match any of these unique trail names. This will give us the list of records that will serve as relations. The first step in doing this search is to append the percent (%) signs to our unique list in order to build a string on which we can search using a LIKE query:

SELECT '%' || label_name || '%' AS label_name, label_name as label, res FROM
 (
 SELECT DISTINCT label_name, res
 FROM chp02.trails
 WHERE label_name NOT LIKE '%&%'
 ORDER BY label_name, res
 ) AS label;

Finally, we'll use this in the context of a WITH block to do the normalization itself. This will provide us with a table of unique IDs for each segment in our first column, along with the associated label column. For good measure, we will do this as a CREATE TABLE procedure as shown in the following query:

CREATE TABLE chp02.trails_names AS
 WITH labellike AS
(
SELECT '%' || label_name || '%' AS label_name, label_name as label, res FROM
 (
 SELECT DISTINCT label_name, res
 FROM chp02.trails
 WHERE label_name NOT LIKE '%&%'
 ORDER BY label_name, res
 ) AS label
)
SELECT t.gid, ll.label, ll.res
 FROM chp02.trails AS t, labellike AS ll
 WHERE t.label_name LIKE ll.label_name
 AND
 t.res = ll.res
 ORDER BY gid;

Now that we have a table of the relations, we need a table of the geometries associated with gid. This, in comparison, is quite easy, as shown in the following query:

CREATE TABLE chp02.trails_geom AS
 SELECT gid, the_geom
 FROM chp02.trails;

How it works...

In this example, we have generated a unique list of possible records in conjunction with a search for the associated records, in order to build table relationships. In one table, we have the geometry and a unique ID of each spatial record; in another table, we have the names associated with each of those unique IDs. Now we can explicitly leverage those relationships.

First, we need to establish our unique IDs as primary keys with the following query:

ALTER TABLE chp02.trails_geom ADD PRIMARY KEY (gid);

Now we can use that PRIMARY KEY as a FOREIGN KEY in our trails_names table with the following query:

ALTER TABLE chp02.trails_names ADD FOREIGN KEY (gid) REFERENCES chp02.trails_geom(gid)

This step isn't strictly necessary, but does enforce referential integrity for queries such as the following:

SELECT geo.gid, geo.geom, names.label FROM
 chp02.trails_geom AS geo, chp02.trails_names AS names
 WHERE geo.gid = names.gid

There's more...

If we had multiple fields we wanted to normalize, we could write CREATE TABLE queries for each of them.

It is interesting to note that the approach framed in this recipe is not limited to cases where we have a delimited field. This approach can provide a relatively generic solution to the problem of normalizing flat files. For example, if we have a case where we have multiple fields to represent relational info, such as label1, label2, label3, or similar multiple attribute names to a single record, we can write a simple query to concatenate them together before feeding that info into our query.

主站蜘蛛池模板: 万宁市| 信阳市| 蒙城县| 武义县| 刚察县| 邯郸市| 正镶白旗| 阜平县| 密山市| 福清市| 大宁县| 龙胜| 江城| 疏附县| 临湘市| 阿拉善右旗| 宜宾市| 房产| 基隆市| 大埔县| 修水县| 贵南县| 丽水市| 侯马市| 五莲县| 襄汾县| 潮安县| 固原市| 清苑县| 汤阴县| 金阳县| 新闻| 宜丰县| 沁阳市| 芒康县| 和田市| 香河县| 台州市| 怀远县| 三明市| 佳木斯市|