Pages

Sunday, August 7, 2016

Pure Python: Loading Shapefiles into PostGIS

In addition to being the best spatial database
engine 
available, PostGIS can be manipulated
using pure Python 
without any C-library
dependencies.

PostGIS is the spatial database engine for the open source PostgreSQL database.  For a decade weArcSDE or Oracle Spatial at NVision but over the last few years we have completely switched to PostGIS unless a client forces us to use a commercial solution.

One of the main things I like about PostgreSQL is fantastic Python support with several options for database drivers.  PostGIS is just a collection of PostgresSQL extensions and functions, so any database driver works. In this post, we're going to use the pure Python, dependency-free pg8000, and pyshp modules to load a point shapefile into a posts database.  Both modules can be quickly installed using pip or easy_install.

To get started, follow the PostGIS instructions to get up and running with a spatial database.  In this example we'll use a database named "test" that is spatially enabled by PostGIS.  We have pg8000 installed as our database interface and pushup to read a shapefile.  The shapefile we'll use is a point shapefile containing information about some cities in the state of Mississippi.  You can download the shapefile here.

The following code will set up a connection to our test geodatabase, create a table based on the fields of the shapefile, and then load the shapefile data into the table.
# Import the database driver and shapefile library
import pg8000
import shapefile

# Set up the datbase connection
connection = pg8000.connect(database="test", user="test", 
                            password="test")

# Get the database cursor to execute queries
cursor = connection.cursor()

# Set up a shapefile reader for our shapefile
r = shapefile.Reader("MSCities_Geo_Pts/MSCities_Geo_Pts.shp")

# Build a query to create our "cities" table
# with an id and geometry column.
table_query = """CREATE TABLE cities (
                 id SERIAL,
                 PRIMARY KEY (id),
                 geom GEOMETRY(Point, 4326),
                 """

# Get the shapefile fields but skip the first
# one which is a deletion flag used internally
# by dbf software
fields = r.fields[1:]

# We are going to keep track of the fields as a
# string. We'll use this query fragment for our
# insert query later.
field_string = ""

# Loop throug the fields and build our table
# creation query.
for i in range(len(fields)):
    # get the field information
    f = fields[i]
    # get the field name and lowercase it for consistency
    f_name = f[0].lower()
    # add the name to the query and our string list of fields
    table_query += f_name
    field_string += f_name
    # Get the proper type for each field. Note this check
    # is not comprehensive but it convers the types in
    # our sample shapefile.
    if f[1] == "F":
        table_query += " DOUBLE PRECISION"
    elif f[1] == "N":
        table_query += " INTEGER"
    else:
        table_query += " VARCHAR"
    # If this field isn' the last, we'll add a comma
    # and some formatting.
    if i != len(fields)- 1:
        table_query += ","
        table_query += "\n"
        table_query += "                 "
        field_string += ","
    # Close the query on the field.
    else:
        table_query += ")"
        field_string += ",geom"

# Execute the table query
cursor.execute(table_query)

# Commit the change to the database or it won't stick. 
# PostgreSQL is transactional which is good so nothing 
# is stored until you're sure it works.
connection.commit()

# Create a python generator for the 
# shapefiles shapes and records
shape_records = (shp_rec for shp_rec in r.iterShapeRecords())

# Loop through the shapefile data and add it to our new table.
for sr in shape_records:
    # Get our point data.
    shape = sr.shape
    x, y = shape.points[0]
    # Get the attribute data and set it up as
    # a query fragment.
    attributes = ""
    for r in sr.record:
        if type(r) == type("string"):
            r = r.replace("'", "''")
        attributes += "'{}'".format(r)
        attributes += ","        
    # Build our insert query template for this shape record.
    # Notice we are going to use a PostGIS function
    # which can turn a WKT geometry into a PostGIS
    # geometry.
    point_query = """INSERT INTO cities 
                     ({})
                     VALUES ({}
                     ST_GEOMFROMTEXT('POINT({} {})',4326))"""    
    # Populate our query template with actual data.
    format_point_query = point_query.format(field_string, 
                                            attributes, x, y)
    # Insert the point data
    cursor.execute(format_point_query)

# Everything went ok so let's update the database.
connection.commit()

The shapefile data is now available in PostGIS for loading into a GIS as a layer or performing spatial functions in PostGIS.  Here's a screenshot of it loaded into QGIS as a PostGIS layer:
I've used this technique to download datasets from the web such as weather data, and update a table on an automated basis as new data is available.  One thing we did not do in this example is add a spatial index.  PostGIS can manage a lot of data - more than you can efficiently manage on the file system.  Spatial indexing is a big part of that efficiency.  It also allows multiple users to access that data at the same time.  Once you load a data set, you typically do a query like the following to index the data:
index_query = """CREATE INDEX cities_geom_gix 
           ON cities USING GIST (geom);"""

That's it!  PostGIS has dozens of spatial functions that make analyzing data programmatically easy as well.  With pg8000 and a PostgreSQL server, PostGIS becomes a giant pure Python GIS library with no compiled dependencies on your client platform.

Sunday, April 3, 2016

Pure Python NetCDF Reader

Great news! Pure Python geospatial programmer Karim Bahgat created a depenency-free Pythonpyncf.  If you're not familiar with NetCDF, it is technically a collection of software libraries and machine-independent data formats commonly used for multi-dimensional scientific data of any type.  NetCDF currently uses a hierarchial data storage format called HDF5.  NetCDF has a concept of dimensions, variables, and attributes and can scale from one-to-one relationships up to many-to-many relationships in all directions.  In geospatial contexts, NetCDF is frequently used to store 2D and 3D raster or vector datasets over time.  It is very popular for ocean observations and climatological data.
NetCDF reader called

He's posted the code and some more usage information on Github too:  https://github.com/karimbahgat/pyncf

Karim goes into detail on the ideas behind the library on his blog.  But be careful, the examples on the blog aren't as current as the ones on Github: https://thepythongischallenge.wordpress.com/2016/03/26/pynetcdf-netcdf-files-in-pure-python/

As a quick test I installed pyncf using pip directly from Github:
pip install https://github.com/karimbahgat/pyncf/archive/master.zip

I then downloaded the sample time series orthogonal point data NetCDF sample file from here:
https://www.nodc.noaa.gov/data/formats/netcdf/v1.1/

The actual link to the netcdf file is here:
/thredds/fileServer/testdata/netCDFTemplateExamples/timeSeries/BodegaMarineLabBuoy.nc

What's nice about these NOAA samples is that they have a plain-text CDL (Common Data Language) version as well that let's you see what's in the file when you're experimenting with an API like this:
http://data.nodc.noaa.gov/testdata/netCDFTemplateExamples/timeSeries/BodegaMarineLabBuoy.cdl

First, I imported the library and created a NetCDF object from the sample file:

import pyncf
nc = pyncf.NetCDF(filepath="BodegaMarineLabBuoy.nc")
Then I created a variable to hold the header dictionary:
header = nc.header
The header metadata contains a variety of summary data as nested dictionaries and list.  We can access the product description by traversing that structure:

header["gatt_list"][1]["values"]

'These seawater data are collected by a moored fluorescence 
and turbidity instrument operated at Cordell Bank, California, USA, 
by CBNMS and BML. Beginning on 2008-04-23, fluorescence and turbidity 
measurements were collected using a Wetlabs ECO Fluorescence 
and Turbidity Sensor (ECO-FLNTUSB). The instrument depth of the 
water quality sensors was 01.0 meter, in an overall water depth 
of 85 meters (both relative to Mean Sea Level, MSL). 
The measurements reflect a 10 minute sampling interval.'

The time values in this dataset are stored as epoch seconds.  I accessed those and then converted the first and last into readable dates to see exactly what period this dataset spans:

import time
t = nc.read_dimension_values("time")
print time.ctime(t[0])
'Mon Jul 28 12:30:00 2008'
print time.time(t[-1])
'Wed Sep 10 10:31:00 2008'
The pyncf codebase is considered an alpha version and is currently read only, but what a great addition to your pure Python geospatial toolbox!  I wish this library was available when I updated "Learning Geospatial Analysis with Python"!

Friday, March 11, 2016

New Shapefile Browser

A developer named Adrián Eirís created a lightweight Python shapefile browser whose only compiled dependency is Tkinter which is included with most Python installations or is just a package manager command away.  The tool is called "Ship-Shape-File-Navigator".  It lets you preview a shapefile's geometry and attributes as well as perform common file operations like copying and renaming shapefiles as easily as if you were manipulating a single file (very handy!).  So if you need to do some quick shapefile management and you don't wan to fire up QGIS or ArcMap, check it out.  You can find out more at: http://ShipShapeFileNavigator.org


Here's a quick screenshot showing the geometry preview:

Friday, January 15, 2016

2016 Book Sale!

My publisher, Packt Publishing, started 2016 with a 50% off sale for the eBook versions of both the
first and second editions of "Learning Geospatial Analysis with Python"!  Why would you want both?  The first edition is all Python 2 and the second edition, besides additional code samples and other significant updates, is focused on Python 3.  There are links on the right side of this page or you can just use code LGAPS50 at https://www.packtpub.com.  This deal is good until Jan. 31 and I promise you won't be disappointed.


When the first edition launched in 2013, there was one other geospatial python book out - Erik Westra's excellent "Geospatial Python Development".  Now there are 12 on Packt's site alone and other publishers are working to catch up.  I started this blog because I love Python and I love geospatial analysis but couldn't find much information out there.  Now there is a wealth of excellent geospatial python info and software coming from all directions!

You can help to continue to increase the use of Python in the geospatial field by discussing it online in blogs and forums, linking to interesting software and reading material on Twitter, releasing or contributing software on Github, and buying relevant books.  Even in this digital age, books sales have a tremendous impact on the buzz and therefore the use of a given technology.  And at the end of the day, mind share usually drives an industry to use a given technology.  So however you participate, help spread the word that Python is the best language for geospatial technology (and pretty much everything else!).  Of all the ways of "voting" for Python, buying books is the most point-and-click, dead simple, yet still personally rewarding way to make an impact!