Monday, February 28, 2011

Changing a Shapefile's Type

A polygon, line, and point version of the same shapefile.
Sometimes you want to convert a shapefile from one type to another.  For example you may want to convert a line shapefile to a polygon or a polygon to a point or multipoint shapefile.  There are many reasons for this type of operations ranging from error checking, to special queries, to inconvenient distribution formats.  For example a lot of coastline data is distributed as line data but you may want to convert it to a polygon to estimate coastal erosion using area comparisons between two different dates.

Performing this type of conversion is very straightforward using the Python Shapefile Library.  In fact the conversion is basically a one-off version of the shapefile merge example I wrote about recently.  You read in one shapefile and write the features and records out to another of the correct type.  There are a couple of pitfalls you need to be wary of though.  One is the current version (1.0) of the PSL requires you to explicitly set the shape type of each record if you want to convert them.  The second issue is if you are converting to a single point shapefile where each point feature is a record you must compensate for the imbalance in the dbf records by copying the record from the parent feature for each point.  Instead of dealing with this issue you could simply create a multi-point shapefile where each shape record is allowed to be a collection of points.  Which method you choose depends on what you are trying to do with the output.  The examples below cover both methods.

The example in this post takes a state boundary polygon file and converts it to a line shapefile, then a multipoint shapefile, then a regular point shapefile.  Note the difference between the point shapefile and the line and multipoint examples.

Convert one shapefile type to another 

import shapefile

# Create a line and a multi-point 
# and single point version of
# a polygon shapefile

# The shapefile type we are converting to
newType = shapefile.POLYLINE

# This is the shapefile we are trying
# to convert. In this case it's a
# state boundary polygon file for 
# Mississippi with one polygon and
# one dbf record.
r = shapefile.Reader("Mississippi")

## POLYLINE version
w = shapefile.Writer(newType)
# You must explicity set the shapeType of each record.
# Eventually the library will set them to the same
# as the file shape type automatically.
for s in w.shapes():
  s.shapeType = newType
w.fields = list(r.fields)

## MULTIPOINT version
newType = shapefile.MULTIPOINT

w = shapefile.Writer(newType)
for s in w.shapes():
  s.shapeType = newType
w.fields = list(r.fields)

## POINT version
newType = shapefile.POINT

w = shapefile.Writer(newType)
# For a single point shapefile
# from another type we
# "flatten" each shape
# so each point is a new record.
# This means we must also assign
# each point a record which means
# records are usually duplicated.
for s in r.shapeRecords():
  for p in s.shape.points:
w.fields = list(r.fields)"Miss_Point")

You can download the state boundary polygon shapefile used in the example from the GeospatialPython Google Code Project Downloads section.  You can download the sample script above from the subversion repository of that same project.

And of course the Python Shapefile Library is here.

Tuesday, February 22, 2011

Clip a Raster using a Shapefile

Clipping a satellite image: Rasterize, Mask, Clip, Save
If you read this blog you see most of the material covers shapefiles.  The intent of this blog is to cover remote sensing as well and this article provides a great foundation for remote sensing in Python. In this post I'll demonstrate how to use several Python libraries to to create a script which can take any polygon shapefile and use it as a mask to clip a geospatial image.  Although I'm demonstrating a fairly basic process, this article and the accompanying sample script is densely-packed with lots of good information and tips that would take you hours if not days to piece together reading forum posts, mailing lists, blogs, and trial and error.  This post will get you well on your way to doing whatever you want to do with Python and Remote Sensing.

Satellite and aerial images are usually collected in square tiles more or less the same way your digital camera frames and captures a picture.  Geospatial images are data capturing different wavelengths of light reflected from known points on the Earth or even other planets.  GIS professionals routinely clip these image tiles to a specific area of interest to provide context for vector layers within a GIS map.  This technique may also be used for remote sensing to narrow down image processing to specific areas to reduce the amount of time it takes to analyze the image.

The Process

Clipping a raster is a series of simple button clicks in high-end geospatial software packages.  In terms of computing, geospatial images are actually very large, multi-dimensional arrays.  Remote Sensing at its simplest is performing mathematical operations on these arrays to extract information from the data. Behind the scenes here is what the software is doing (give or take a few steps):
  1. Convert the vector shapefile to a matrix which can be used as mask
  2. Load the geospatial image into a matrix
  3. Throw out any image cells outside of the shapefile extent
  4. Set all values outside the shapefile boundary to NODATA (null) values
  5. OPTIONAL: Perform a histogram stretch on the image for better visualization
  6. Save the resulting image as a new raster.
Geospatial Python Raster Clipping Workflow

Two things I try to do on this blog are build on techniques used in previous posts and focus on pure-Python solutions as much as possible.  The script featured in this post will use one of the shapefile rasterization techniques I've written about in the past.  However I did not go pure-Python on this for several reasons.  Geospatial image formats tend to be extremely complex.  You could make a career out of reading and writing the dozens of evolving image formats out there.  As the old saying goes TIFF stands for "Thousands of Incompatible File Formats".  So for this reason I use the Python bindings for GDAL when dealing with geospatial raster data.  The other issue is the size of most geospatial raster data.  Satellite and high-resolution aerial images can easily be in the 10's to 100's of megabytes size range.  Doing math on these images and the memory required to follow the six step process outlined above exceeds the capability of Python's native libraries in many instances.  For this reason I use the Numpy library which is dedicated to large, multi-dimensional matrix math.  Another reason to use Numpy is tight integration with GDAL in the form of the "GDALNumeric" module. (Numeric was the predecessor to Numpy) In past posts I showed a pure-Python way to rasterize a shapefile.  However I use the Python Imaging Library (PIL) in this example because it provides convenient methods to move data back and forth between Numpy.

Library Installation

So in summary you will need to install the following packages to make the sample script work.  Usually the Python Disutils system (i.e. the "easy_install" script) is the fastest and simplest way to install a Python library.  Because of the complexity and dependencies of some of these tools you may need to track down a pre-compiled binary for your platform.  Both Numpy and GDAL have them linked from their respective websites or the Python Package Index.

The Example

# - clip a geospatial image using a shapefile

import operator
from osgeo import gdal, gdalnumeric, ogr, osr
import Image, ImageDraw

# Raster image to clip
raster = "SatImage.tif"

# Polygon shapefile used to clip
shp = "county"

# Name of clip raster file(s)
output = "clip"

# This function will convert the rasterized clipper shapefile 
# to a mask for use within GDAL.    
def imageToArray(i):
    Converts a Python Imaging Library array to a 
    gdalnumeric image.
    return a

def arrayToImage(a):
    Converts a gdalnumeric array to a 
    Python Imaging Library Image.
    return i
def world2Pixel(geoMatrix, x, y):
  Uses a gdal geomatrix (gdal.GetGeoTransform()) to calculate
  the pixel location of a geospatial coordinate 
  ulX = geoMatrix[0]
  ulY = geoMatrix[3]
  xDist = geoMatrix[1]
  yDist = geoMatrix[5]
  rtnX = geoMatrix[2]
  rtnY = geoMatrix[4]
  pixel = int((x - ulX) / xDist)
  line = int((ulY - y) / yDist)
  return (pixel, line) 

def histogram(a, bins=range(0,256)):
  Histogram function for multi-dimensional array.
  a = array
  bins = range of numbers to match 
  fa = a.flat
  n = gdalnumeric.searchsorted(gdalnumeric.sort(fa), bins)
  n = gdalnumeric.concatenate([n, [len(fa)]])
  hist = n[1:]-n[:-1] 
  return hist

def stretch(a):
  Performs a histogram stretch on a gdalnumeric array image.
  hist = histogram(a)
  im = arrayToImage(a)   
  lut = []
  for b in range(0, len(hist), 256):
    # step size
    step = reduce(operator.add, hist[b:b+256]) / 255
    # create equalization lookup table
    n = 0
    for i in range(256):
      lut.append(n / step)
      n = n + hist[i+b]
  im = im.point(lut)
  return imageToArray(im)

# Load the source data as a gdalnumeric array
srcArray = gdalnumeric.LoadFile(raster)

# Also load as a gdal image to get geotransform 
# (world file) info
srcImage = gdal.Open(raster)
geoTrans = srcImage.GetGeoTransform()

# Create an OGR layer from a boundary shapefile
shapef = ogr.Open("%s.shp" % shp)
lyr = shapef.GetLayer(shp)
poly = lyr.GetNextFeature()

# Convert the layer extent to image pixel coordinates
minX, maxX, minY, maxY = lyr.GetExtent()
ulX, ulY = world2Pixel(geoTrans, minX, maxY)
lrX, lrY = world2Pixel(geoTrans, maxX, minY)

# Calculate the pixel size of the new image
pxWidth = int(lrX - ulX)
pxHeight = int(lrY - ulY)

clip = srcArray[:, ulY:lrY, ulX:lrX]

# Create a new geomatrix for the image
geoTrans = list(geoTrans)
geoTrans[0] = minX
geoTrans[3] = maxY

# Map points to pixels for drawing the 
# boundary on a blank 8-bit, 
# black and white, mask image.
points = []
pixels = []
geom = poly.GetGeometryRef()
pts = geom.GetGeometryRef(0)
for p in range(pts.GetPointCount()):
  points.append((pts.GetX(p), pts.GetY(p)))
for p in points:
  pixels.append(world2Pixel(geoTrans, p[0], p[1]))
rasterPoly ="L", (pxWidth, pxHeight), 1)
rasterize = ImageDraw.Draw(rasterPoly)
rasterize.polygon(pixels, 0)
mask = imageToArray(rasterPoly)   

# Clip the image using the mask
clip = gdalnumeric.choose(mask, \
    (clip, 0)).astype(gdalnumeric.uint8)

# This image has 3 bands so we stretch each one to make them
# visually brighter
for i in range(3):
  clip[i,:,:] = stretch(clip[i,:,:])

# Save ndvi as tiff
gdalnumeric.SaveArray(clip, "%s.tif" % output, \
    format="GTiff", prototype=raster)

# Save ndvi as an 8-bit jpeg for an easy, quick preview
clip = clip.astype(gdalnumeric.uint8)
gdalnumeric.SaveArray(clip, "%s.jpg" % output, format="JPEG")

Tips and Further Reading

The utility functions at the beginning of this script are useful whenever you are working with remotely sensed data in Python using GDAL, PIL, and Numpy.

If you're in a hurry be sure to look at the GDAL utility programs.  This collection has a tool for just about any simple operation including clipping a raster to a rectangle.  Technically you could accomplish the above polygon clip using only GDAL utilities but for complex operations like this Python is much easier.

The data referenced in the above script are a shapefile and a 7-4-1 Path 22, Row 39 Landsat image from 2006. You can download the data and the above sample script from the GeospatialPython Google Code project here.

I would normally use the Python Shapefile Library to grab the polygon shape instead of OGR but because I used GDAL, OGR is already there. So why bother with another library?

If you are going to get serious about Remote Sensing and Python you should check out OpenEV.  This package is a complete remote sensing platform including an ERDAS Imagine-style viewer.  It comes with all the GDAL tools, mapserver and tools, and a ready-to-run Python environment.

I've written about it before but Spectral Python is worth a look and worth mentioning again. I also recently found PyResample on Google Code but I haven't tried it yet.  

Beyond the above you will find bits and pieces of Python remote sensing code scattered around the web.  Good places to look are:
More to come!

UPDATE (May 4, 2011): I usually provide a link to example source code and data for instructional posts. I set up the download for this one but forgot to post it.  This zip file contains everything you need to perform the example above except the installation of GDAL, Numpy, and PIL:

Make sure the required libraries are installed and working before you attempt this example.  As I mention above the OpenEV package has a Python environment with all required packages except PIL.  It may take a little work to get PIL into this unofficial Python environment but in my experience it's less work than wrangling GDAL into place.

Saturday, February 12, 2011

Create a .prj Projection File for a Shapefile

An example of a cordiform map projection a.k.a. 
heart-shaped projection
. Happy Valentine's!
If you create a shapefile with ESRI software or receive one from someone who did you may see a ".prj" file included along with the shp, shx, and dbf files.  In fact, the prj file is one of up to 9 possible "official" file extensions for various indexes and other meta files.  Most of these file formats are proprietary.  There are an additional two formats created by the open source community to work around the closed formats created by ESRI for spatial indexing.

The shapefile format does not allow for specifying the map projection of the data. When ESRI created the shapefile format everyone worked with data in only one projection. If you tried to load a layer in a different projection into your GIS weird things would happen.  Not too long ago as hardware capability increased according to Moore's Law, GIS software packages developed the ability to reproject geospatial layers on the fly.  You could now load in layers in any projection and as long as you told the software what projections were involved the map would come together nicely.

ArcGIS 8.x allowed you to manually assign each layer a projection.  This information was stored in the prj file.  The prj file contains a WKT (Well-Known Text) string which has all the parameters for the map projection. So the format is quite simple and was created by the Open GIS Consortium.

But there are several thousand "commonly-used" map projections which were standardized by the European Survey Petroleum Group (EPSG). And there's no way to accurately detect the projection from the coordinates in the shapefile. For these reasons the Python Shapefile Library does not currently handle prj files.

If you need a prj file, the easiest thing to do is write one yourself. The following example creates a simple point shapefile and then the corresponding prj file using the WGS84 "unprojected" WKT.

import shapefile as sf
filename = 'test/point'

# create the shapefile
w = sf.Writer(sf.POINT)
w.point(37.7793, -122.4192)

# create the PRJ file
prj = open("%s.prj" % filename, "w")
epsg = 'GEOGCS["WGS 84",'
epsg += 'DATUM["WGS_1984",'
epsg += 'SPHEROID["WGS 84",6378137,298.257223563]]'
epsg += ',PRIMEM["Greenwich",0],'
epsg += 'UNIT["degree",0.0174532925199433]]'

I've thought about adding the ability to optionally write prj files but the list of "commonly-used" WKT strings is over .5 megs and would be bigger than the shapefile library itself.  I may eventually work something out though.

The easiest thing to do right now is just figure out what WKT string you need for your data and write a file after you save your shapefile. If you need a list of map projection names, epsg codes, and corresponding WKT strings you can download it from the geospatialpython project site here on Google Code.

A word of warning if you are new to GIS and shapefiles: the prj file is just metadata about your shapefile.  Changing the projection reference in the prj file will not change the actual projection of the geometry and will just confuse your GIS software.

Thursday, February 10, 2011

Merging Lots of Shapefiles (quickly)

Arne, over at, recently posted about merging shapefiles using a batch process. I can't remember the last time I merged two or more shapefiles but after googling around it is a very common use case.  GIS forums are littered with requests for the best way to batch merge a directory full of files.  My best guess is people have to work with automatically-generated, geographically disperse data with a common projection and database schema.  I imagine these files would be the result of some automated sensor output. If you know some use cases requiring merging many shapefiles I'd be curious to hear about it.

Arne pointed out that all the code samples out there iterate through each feature in a shapefile and add them to the merged file.  He says this method is slow. I agree to an extent (no pun intended).  However, at some point the underlying shapefile library MUST iterate through each feature in order to generate the summary information, namely the bounding box, required to write a valid shapefile header.  But it is theoretically slightly more efficient to wait until the merge is finished so there is only one iteration cycle.  At the very least, waiting till the end requires less code.

The following example merges all the shapefiles in the current directory into one file and it is quite fast.

# Merge a bunch of shapefiles with attributes quickly!
import glob
import shapefile
files = glob.glob("*.shp")
w = shapefile.Writer()
for f in files:
  r = shapefile.Reader(f)
w.fields = list(r.fields)"merged")

Wednesday, February 2, 2011

Python 3 Version of the Python Shapefile Library Released

I created a "hasty" port of the Python Shapefile Library to Python 3 at the request of a developer.  You can download it from the Subversion repository at the "pyshp" project on Google Code.  You will now find the subversion trunk contains "Python 2" and "Python 3" folders.  The documentation available in the "Downloads" section of the project site contains two sets of documentation now as well.  The Python 3 version is tagged with "Py3".

The versions function identically however the "Editor" class is broken in the Python 3 version and this issue is documented in the instructions and the code.  This class is not necessary to read and write shapefiles. It is just a convenience class and will be fixed in a future version.  I have only made the Py 3 version pass the doctests - nothing more.  So bug reports are welcome.

I also plan to make the Python 2 version compatible with Jython.  Initial testing shows that it too works except for the Editor class.  I'd appreciate any feedback if you are using it on that platform.  The same applies to IronPython and even PythonCE which I haven't tested at all.

Because of the reorganization of the source tree many of the links on previous posts may be broken.  I should have these repaired shortly. [UPDATE: All links have been corrected 2/11/11]

Both versions will be updated and maintained indefinitely as it will probably take several years for Python 3 to become mainstream - especially in the geospatial community. Many, many geospatial libraries will require porting to 3.  I used the "2to3" script but found that most of the work was in casting strings and byte arrays which is no longer implicit.  The library packs and unpacks data constantly so this change had a huge impact on the shapefile library.  Python 2 made it easy to be lazy with data types.