Pages

Sunday, August 7, 2016

Pure Python: Loading Shapefiles into PostGIS

In addition to being the best spatial database
engine 
available, PostGIS can be manipulated
using pure Python 
without any C-library
dependencies.

PostGIS is the spatial database engine for the open source PostgreSQL database.  For a decade weArcSDE or Oracle Spatial at NVision but over the last few years we have completely switched to PostGIS unless a client forces us to use a commercial solution.

One of the main things I like about PostgreSQL is fantastic Python support with several options for database drivers.  PostGIS is just a collection of PostgresSQL extensions and functions, so any database driver works. In this post, we're going to use the pure Python, dependency-free pg8000, and pyshp modules to load a point shapefile into a posts database.  Both modules can be quickly installed using pip or easy_install.

To get started, follow the PostGIS instructions to get up and running with a spatial database.  In this example we'll use a database named "test" that is spatially enabled by PostGIS.  We have pg8000 installed as our database interface and pushup to read a shapefile.  The shapefile we'll use is a point shapefile containing information about some cities in the state of Mississippi.  You can download the shapefile here.

The following code will set up a connection to our test geodatabase, create a table based on the fields of the shapefile, and then load the shapefile data into the table.
# Import the database driver and shapefile library
import pg8000
import shapefile

# Set up the datbase connection
connection = pg8000.connect(database="test", user="test", 
                            password="test")

# Get the database cursor to execute queries
cursor = connection.cursor()

# Set up a shapefile reader for our shapefile
r = shapefile.Reader("MSCities_Geo_Pts/MSCities_Geo_Pts.shp")

# Build a query to create our "cities" table
# with an id and geometry column.
table_query = """CREATE TABLE cities (
                 id SERIAL,
                 PRIMARY KEY (id),
                 geom GEOMETRY(Point, 4326),
                 """

# Get the shapefile fields but skip the first
# one which is a deletion flag used internally
# by dbf software
fields = r.fields[1:]

# We are going to keep track of the fields as a
# string. We'll use this query fragment for our
# insert query later.
field_string = ""

# Loop throug the fields and build our table
# creation query.
for i in range(len(fields)):
    # get the field information
    f = fields[i]
    # get the field name and lowercase it for consistency
    f_name = f[0].lower()
    # add the name to the query and our string list of fields
    table_query += f_name
    field_string += f_name
    # Get the proper type for each field. Note this check
    # is not comprehensive but it convers the types in
    # our sample shapefile.
    if f[1] == "F":
        table_query += " DOUBLE PRECISION"
    elif f[1] == "N":
        table_query += " INTEGER"
    else:
        table_query += " VARCHAR"
    # If this field isn' the last, we'll add a comma
    # and some formatting.
    if i != len(fields)- 1:
        table_query += ","
        table_query += "\n"
        table_query += "                 "
        field_string += ","
    # Close the query on the field.
    else:
        table_query += ")"
        field_string += ",geom"

# Execute the table query
cursor.execute(table_query)

# Commit the change to the database or it won't stick. 
# PostgreSQL is transactional which is good so nothing 
# is stored until you're sure it works.
connection.commit()

# Create a python generator for the 
# shapefiles shapes and records
shape_records = (shp_rec for shp_rec in r.iterShapeRecords())

# Loop through the shapefile data and add it to our new table.
for sr in shape_records:
    # Get our point data.
    shape = sr.shape
    x, y = shape.points[0]
    # Get the attribute data and set it up as
    # a query fragment.
    attributes = ""
    for r in sr.record:
        if type(r) == type("string"):
            r = r.replace("'", "''")
        attributes += "'{}'".format(r)
        attributes += ","        
    # Build our insert query template for this shape record.
    # Notice we are going to use a PostGIS function
    # which can turn a WKT geometry into a PostGIS
    # geometry.
    point_query = """INSERT INTO cities 
                     ({})
                     VALUES ({}
                     ST_GEOMFROMTEXT('POINT({} {})',4326))"""    
    # Populate our query template with actual data.
    format_point_query = point_query.format(field_string, 
                                            attributes, x, y)
    # Insert the point data
    cursor.execute(format_point_query)

# Everything went ok so let's update the database.
connection.commit()

The shapefile data is now available in PostGIS for loading into a GIS as a layer or performing spatial functions in PostGIS.  Here's a screenshot of it loaded into QGIS as a PostGIS layer:
I've used this technique to download datasets from the web such as weather data, and update a table on an automated basis as new data is available.  One thing we did not do in this example is add a spatial index.  PostGIS can manage a lot of data - more than you can efficiently manage on the file system.  Spatial indexing is a big part of that efficiency.  It also allows multiple users to access that data at the same time.  Once you load a data set, you typically do a query like the following to index the data:
index_query = """CREATE INDEX cities_geom_gix 
           ON cities USING GIST (geom);"""

That's it!  PostGIS has dozens of spatial functions that make analyzing data programmatically easy as well.  With pg8000 and a PostgreSQL server, PostGIS becomes a giant pure Python GIS library with no compiled dependencies on your client platform.