Often data used in a spatial database is imported from other sources. As such it may not be in a form that is useful for our current application. In such a case, it may be useful to write functions that will aid in transforming the data into a form that is more useful for our application. This is particularly the case when going from flat file formats, such as shapefiles, to relational databases such as PostgreSQL.
Note
A shapefile is a de facto as well as formal standard for the storage of spatial data, and is probably the most common delivery format for vector spatial data. A shapefile, in spite of its name, is never just one file, but a collection of files. It consists of at least *.shp (which contains geometry), *.shx (an index file), and *.dbf (which contains the tabular information for the shapefile). It is a powerful and useful format but, as a flat file, it is inherently nonrelational. Each geometry is associated in a one-to-one relationship with each row in a table.
There are many structures that might serve as a proxy for relational stores in a shapefile. We will explore one here—a single field with delimited text for multiple relations. This is a not-too-uncommon hack to encode multiple relationships into a flat file. The other common approach is to create multiple fields to store what in a relational arrangement would be a single field.
Getting ready
The dataset we will be working with is a trails dataset that has linear extents for a set of trails in a park system. The data is the typical data that comes from the GIS world—as a flat shapefile, there are no explicit relational constructs in the data.
First, we load the data using the following command:
Looking at the linear data in a Desktop GIS, we see some categories for trail use:
We want to retain this information as well as the name. Unfortunately, the label_name field is a messy field with a variety of related names concatenated with an ampersand (&), as shown in the following query:
SELECT DISTINCT label_name FROM chp02.trails WHERE label_name LIKE '%&%' LIMIT 10;
It will return the following output:
label_name ----------------------------------------------------------------- All Purpose Trail & Buckeye Trail & Sagamore Creek Loop Trail Hemlock Loop Trail & Prairie Loop Trail & Wildflower Loop Trail NC1 & NC2 Hinckley Hills Loop Trail & HI3 All Purpose Trail & North Ravine Loop Trail BR3 & BR4 & Buckeye Trail Bridle Trail & Hemlock Loop Trail Hemlock Trail & NC2 Hinckley Hills Loop Trail & HI1 Lake Isaac Trail & Lake to Lake Trail(10 rows)
This is where the normalization of our table will begin.
How to do it...
The first thing we need to do is find all the fields that don't have ampersands and use those as our unique list of available trails. In our case, we can do this, as every trail has at least one segment that is uniquely named and not associated with another trail name. This approach will not work with all datasets, so be careful in understanding your data before applying this approach to that data. To select the fields without ampersands, we use the following query:
SELECT DISTINCT label_name, res FROM chp02.trails WHERE label_name NOT LIKE '%&%';
It will return the following output:
label_name | res--------------------------------------------+-------------------South Quarry Loop Trail | Mill Stream Run ReservationBuckeye Trail | Hinckley ReservationBridle Connector Trail | Rocky River ReservationWest Channel Pond Loop Trail | Rocky River ReservationGreen Milkweed Trail | Mill Stream Run ReservationAll Purpose Trail | Euclid Creek ReservationConnector Trail | Bradley Woods ReservationNorth Chagrin Reservation Bridle Trail | North Chagrin ReservationConnector Trail | Garfield Park ReservationBR2 | Brecksville ReservationConnector Trail | Rocky River ReservationBuckeye Trail | South Chagrin Reservation
For sanity, we will also sort these entries as follows:
SELECT DISTINCT label_name, res FROM chp02.trails WHERE label_name NOT LIKE '%&%' ORDER BY label_name, res;
Next, we want to search for all the records that match any of these unique trail names. This will give us the list of records that will serve as relations. The first step in doing this search is to append the percent (%) signs to our unique list in order to build a string on which we can search using a LIKE query:
SELECT '%' || label_name || '%' AS label_name, label_name as label, res FROM ( SELECT DISTINCT label_name, res FROM chp02.trails WHERE label_name NOT LIKE '%&%' ORDER BY label_name, res ) AS label;
Finally, we'll use this in the context of a WITH block to do the normalization itself. This will provide us with a table of unique IDs for each segment in our first column, along with the associated label column. For good measure, we will do this as a CREATE TABLE procedure as shown in the following query:
CREATE TABLE chp02.trails_names AS WITH labellike AS(SELECT '%' || label_name || '%' AS label_name, label_name as label, res FROM ( SELECT DISTINCT label_name, res FROM chp02.trails WHERE label_name NOT LIKE '%&%' ORDER BY label_name, res ) AS label)SELECT t.gid, ll.label, ll.res FROM chp02.trails AS t, labellike AS ll WHERE t.label_name LIKE ll.label_name AND t.res = ll.res ORDER BY gid;
Now that we have a table of the relations, we need a table of the geometries associated with gid. This, in comparison, is quite easy, as shown in the following query:
CREATE TABLE chp02.trails_geom AS SELECT gid, the_geom FROM chp02.trails;
How it works...
In this example, we have generated a unique list of possible records in conjunction with a search for the associated records, in order to build table relationships. In one table, we have the geometry and a unique ID of each spatial record; in another table, we have the names associated with each of those unique IDs. Now we can explicitly leverage those relationships.
First, we need to establish our unique IDs as primary keys with the following query:
ALTER TABLE chp02.trails_geom ADD PRIMARY KEY (gid);
Now we can use that PRIMARY KEY as a FOREIGN KEY in our trails_names table with the following query:
ALTER TABLE chp02.trails_names ADD FOREIGN KEY (gid) REFERENCES chp02.trails_geom(gid)
This step isn't strictly necessary, but does enforce referential integrity for queries such as the following:
SELECT geo.gid, geo.geom, names.label FROM chp02.trails_geom AS geo, chp02.trails_names AS names WHERE geo.gid = names.gid
There's more...
If we had multiple fields we wanted to normalize, we could write CREATE TABLE queries for each of them.
It is interesting to note that the approach framed in this recipe is not limited to cases where we have a delimited field. This approach can provide a relatively generic solution to the problem of normalizing flat files. For example, if we have a case where we have multiple fields to represent relational info, such as label1, label2, label3, or similar multiple attribute names to a single record, we can write a simple query to concatenate them together before feeding that info into our query.