- PostGIS Cookbook
- Paolo Corti Thomas J. Kraft Stephen Vincent Mather Bborie Park
- 1198字
- 2021-07-19 18:29:44
Normalizing imports
Often data used in a spatial database is imported from other sources. As such it may not be in a form that is useful for our current application. In such a case, it may be useful to write functions that will aid in transforming the data into a form that is more useful for our application. This is particularly the case when going from flat file formats, such as shapefiles, to relational databases such as PostgreSQL.
Note
A shapefile is a de facto as well as formal standard for the storage of spatial data, and is probably the most common delivery format for vector spatial data. A shapefile, in spite of its name, is never just one file, but a collection of files. It consists of at least *.shp
(which contains geometry), *.shx
(an index file), and *.dbf
(which contains the tabular information for the shapefile). It is a powerful and useful format but, as a flat file, it is inherently nonrelational. Each geometry is associated in a one-to-one relationship with each row in a table.
There are many structures that might serve as a proxy for relational stores in a shapefile. We will explore one here—a single field with delimited text for multiple relations. This is a not-too-uncommon hack to encode multiple relationships into a flat file. The other common approach is to create multiple fields to store what in a relational arrangement would be a single field.
Getting ready
The dataset we will be working with is a trails dataset that has linear extents for a set of trails in a park system. The data is the typical data that comes from the GIS world—as a flat shapefile, there are no explicit relational constructs in the data.
First, we load the data using the following command:
shp2pgsql -s 3734 -d -i -I -W LATIN1 -g the_geom trails chp02.trails | psql -U me -d postgis_cookbook
Looking at the linear data in a Desktop GIS, we see some categories for trail use:

We want to retain this information as well as the name. Unfortunately, the label_name
field is a messy field with a variety of related names concatenated with an ampersand (&
), as shown in the following query:
SELECT DISTINCT label_name FROM chp02.trails WHERE label_name LIKE '%&%' LIMIT 10;
It will return the following output:
label_name ----------------------------------------------------------------- All Purpose Trail & Buckeye Trail & Sagamore Creek Loop Trail Hemlock Loop Trail & Prairie Loop Trail & Wildflower Loop Trail NC1 & NC2 Hinckley Hills Loop Trail & HI3 All Purpose Trail & North Ravine Loop Trail BR3 & BR4 & Buckeye Trail Bridle Trail & Hemlock Loop Trail Hemlock Trail & NC2 Hinckley Hills Loop Trail & HI1 Lake Isaac Trail & Lake to Lake Trail (10 rows)
This is where the normalization of our table will begin.
How to do it...
The first thing we need to do is find all the fields that don't have ampersands and use those as our unique list of available trails. In our case, we can do this, as every trail has at least one segment that is uniquely named and not associated with another trail name. This approach will not work with all datasets, so be careful in understanding your data before applying this approach to that data. To select the fields without ampersands, we use the following query:
SELECT DISTINCT label_name, res FROM chp02.trails WHERE label_name NOT LIKE '%&%';
It will return the following output:
label_name | res --------------------------------------------+------------------- South Quarry Loop Trail | Mill Stream Run Reservation Buckeye Trail | Hinckley Reservation Bridle Connector Trail | Rocky River Reservation West Channel Pond Loop Trail | Rocky River Reservation Green Milkweed Trail | Mill Stream Run Reservation All Purpose Trail | Euclid Creek Reservation Connector Trail | Bradley Woods Reservation North Chagrin Reservation Bridle Trail | North Chagrin Reservation Connector Trail | Garfield Park Reservation BR2 | Brecksville Reservation Connector Trail | Rocky River Reservation Buckeye Trail | South Chagrin Reservation
For sanity, we will also sort these entries as follows:
SELECT DISTINCT label_name, res FROM chp02.trails WHERE label_name NOT LIKE '%&%' ORDER BY label_name, res;
Next, we want to search for all the records that match any of these unique trail names. This will give us the list of records that will serve as relations. The first step in doing this search is to append the percent (%
) signs to our unique list in order to build a string on which we can search using a LIKE
query:
SELECT '%' || label_name || '%' AS label_name, label_name as label, res FROM ( SELECT DISTINCT label_name, res FROM chp02.trails WHERE label_name NOT LIKE '%&%' ORDER BY label_name, res ) AS label;
Finally, we'll use this in the context of a WITH
block to do the normalization itself. This will provide us with a table of unique IDs for each segment in our first column, along with the associated label
column. For good measure, we will do this as a CREATE TABLE
procedure as shown in the following query:
CREATE TABLE chp02.trails_names AS WITH labellike AS ( SELECT '%' || label_name || '%' AS label_name, label_name as label, res FROM ( SELECT DISTINCT label_name, res FROM chp02.trails WHERE label_name NOT LIKE '%&%' ORDER BY label_name, res ) AS label ) SELECT t.gid, ll.label, ll.res FROM chp02.trails AS t, labellike AS ll WHERE t.label_name LIKE ll.label_name AND t.res = ll.res ORDER BY gid;
Now that we have a table of the relations, we need a table of the geometries associated with gid
. This, in comparison, is quite easy, as shown in the following query:
CREATE TABLE chp02.trails_geom AS SELECT gid, the_geom FROM chp02.trails;
How it works...
In this example, we have generated a unique list of possible records in conjunction with a search for the associated records, in order to build table relationships. In one table, we have the geometry and a unique ID of each spatial record; in another table, we have the names associated with each of those unique IDs. Now we can explicitly leverage those relationships.
First, we need to establish our unique IDs as primary keys with the following query:
ALTER TABLE chp02.trails_geom ADD PRIMARY KEY (gid);
Now we can use that PRIMARY KEY
as a FOREIGN KEY
in our trails_names
table with the following query:
ALTER TABLE chp02.trails_names ADD FOREIGN KEY (gid) REFERENCES chp02.trails_geom(gid)
This step isn't strictly necessary, but does enforce referential integrity for queries such as the following:
SELECT geo.gid, geo.geom, names.label FROM chp02.trails_geom AS geo, chp02.trails_names AS names WHERE geo.gid = names.gid
There's more...
If we had multiple fields we wanted to normalize, we could write CREATE TABLE
queries for each of them.
It is interesting to note that the approach framed in this recipe is not limited to cases where we have a delimited field. This approach can provide a relatively generic solution to the problem of normalizing flat files. For example, if we have a case where we have multiple fields to represent relational info, such as label1
, label2
, label3
, or similar multiple attribute names to a single record, we can write a simple query to concatenate them together before feeding that info into our query.
- Progressive Web Apps with React
- 基于差分進化的優(yōu)化方法及應(yīng)用
- Building an RPG with Unity 2018
- RISC-V體系結(jié)構(gòu)編程與實踐(第2版)
- Lighttpd源碼分析
- Java程序員面試筆試寶典(第2版)
- 大話Java:程序設(shè)計從入門到精通
- JavaScript動態(tài)網(wǎng)頁編程
- 人工智能算法(卷1):基礎(chǔ)算法
- Java程序設(shè)計與項目案例教程
- Android Sensor Programming By Example
- Vue.js光速入門及企業(yè)項目開發(fā)實戰(zhàn)
- STM8實戰(zhàn)
- 高性能PHP 7
- Beginning C# 7 Hands-On:The Core Language