Pages

Tuesday, May 1, 2012

Pyshp 1.1.6-beta Release for Testing

Pyshp 1.1.6-beta ready for testing
A pre-release of pyshp 1.1.6 is available for testing.  This release addresses some major issues with reading/writing 3D shapefiles.  The issue was identified by John Burky.  I am currently working through several bug reports right now but this one was a show stopper so I wanted to get a fix out quickly as there seem to be several people working with z elevation values right now.  Also if you were having troubles with "m" measure values this release fixes a related issue.  The Editor class, z values, and m values are dark corners that are not as well tested as "regular" shapefile features so if you're working with these types of data keep a sharp eye out for anything weird.  I'll push this update out as an official release within the next couple of weeks if there are no complaints.

The release is available in the Pyshp Google Code site "Downloads" section here.

In other news we are still working on the sbn/sbx binning example for spatial indexes.  Very close but not there yet.

Tuesday, February 28, 2012

Pyshp shapeRecords() Method

The shapefile.Reader.shapeRecords()
method lets you juggle both the
geometry and dbf attributes at the
same time.
The shapefile.Reader.shapeRecords() method is a simple convenience method which allows you to simultaneously loop through both the geometry and records of a shapefile.  Normally  you would loop through the shape records and then loop through the dbf records seperately.  But sometimes it's easier to have both sides of the shapefile equation accessible at the same time.  This ability is important sometimes because the link between geometry and attributes is implied by their order in the file and not explicit which can make referencing one side or the other a pain.  Warning: the current implementation pulls everything into memory at once which can be a problem for very large shapefiles. This weakness will be updated in future versions.

Here’s a simple usage example followed by a detailed explanation and a few other posts where I use this method without much explanation.

Let’s say you have a simple point-location address shapefile named “addr.shp” with the following structure:

GEOMETRY ADDRESS CITY STATE ZIP
[-89.522996, 34.363596] 7018 South 8th Oxford MS 38655
[-89.520695, 34.360863] 1199 South 11th Oxford MS 38655
[-89.520927, 34.362924] 8005 Fillmore Ave Oxford MS 38655

You could then use the shapeRecords method like this:

>>> import shapefile
>>> r = shapefile.Reader(“addr”)
>>> sr = r.shapeRecords()
>>> # get the first shaperecord
>>> sr_test = sr[0]
>>> # Look at the geometry of the shape
>>> sr_test.shape.points
[[-89.522996, 34.363596]]
>>> # Look at the attributes of the dbf record
>>> sr_test.record
[‘7018 South 8’,’Oxford’,’MS’,’38655’]
>>> # Now let’s iterate through all of them
>>> for sr in r.shapeRecords():
...    print “x: “, sr.points[0][0]
...    print “y: “, sr.points[0][1]
...    # Output just the address field
...    print “Address: “, sr.record[0]
x: -89.522996
y: 34.363596
Address: 7018 South 8th
x: -89.520695
y: 34.360863
Address: 1195 South 11th
x: -89.520927
y: 34.362924
Address: 805 Fillmore Ave

Here’s how it works.

The shapeRecords() method returns a list.

Each entry in that list is a _ShapeRecord object instance.

A _ShapeRecord object has two attributes: shape, record

_ShapeRecord.record contains a simple list of the attributes.

_ShapeRecord.shape contains a _Shape object instance.

A _Shape object has, at a minimum, two attributes: shapeType, points

If the _Shape instance contains a polygon a “parts” attribute will appear.  This attribute contains the index in the point list of the beginning of a “part”.  Parts let you store multiple shapes in a single record.

The shapeType attribute provides a number telling you if the shapefile is  a point, polygon, line, etc. file.  These constants are listed in the shapefile spec document as well as near the top of the source code.

The points is just a list containing lists of the point coordinates.  Two things to note:  If the geometry has multiple parts, such as multiple polygons, the points for all parts are just lumped together.  You must separate them by referencing the parts index list.  Some shape types allow for z and m values which may appear in addition to the x,y pairs.

This method is really just a clumsy convenience method that basically zips up the results of the shapes() and records() methods you are already using.

I have a few blog posts where I call this method as well:

http://geospatialpython.com/2011/02/changing-shapefiles-type.html

http://geospatialpython.com/2011/01/point-in-polygon.html

http://geospatialpython.com/2010/12/dot-density-maps-with-python-and-ogr.html  (in the comments)

Wednesday, February 8, 2012

What Recession?

I know of several geospatial programmer and IT jobs along the Gulf Coast and in St. Louis, Mo. with our company and several others who ask me on a regular basis if I know of anybody.  If you're in these areas or willing to move let me know.  All of these jobs involve working on some very large geospatial systems which do amazing things that often end up as news headlines.  You must have the ability to obtain a government clearance though...

Wednesday, January 18, 2012

Today America is debating potential US laws that would affect geospatial python programmers around the world. 

Learn what you can do: https://www.google.com/landing/takeaction/

Friday, November 4, 2011

Deleting Shapefile Features

Sometimes, usually as a server-based operation, you need to delete all of the features in a shapefile. All you want left is the shapefile type, the dbf schema, and maybe the overall bounding box. This shapefile stub can then be updated by other processes. Pyshp currently doesn't have an explicit "delete" method. But because pyshp converts everything to native Python data types (strings, lists, dicts, numbers) you can usually manipulate things fairly easily. The solution is very similar to merging shapefiles but instead you are writing back to the same file instead of a new copy. There's only one hitch in this operation resulting from a minor difference in the pyshp Reader and Writer objects. In the reader the "bbox" property returns a static array of [xmin, ymin, xmax, ymax]. The Writer also has a "bbox" property but it is a method that is called when you save the shapefile. The Writer calculates the bounding box on the fly by reading all of the shapes just before saving. But in this case there are no shapes so the method would throw an error. So what we do is just override that method with a lambda function to return whatever bbox we want whether it be the original bbox or a made up one.
import shapefile 
# Read the shapefile we want to clear out
r = shapefile.Reader("myshape") 
# Create a Writer of the same type to save out as a blank
w = shapefile.Writer(r.shapeType) 
# This line will give us the same dbf schema 
w.fields = r.fields 
# Use the original bounding box in the header 
w.bbox = lambda: r.bbox 
# Save the featureless, attribute-less shapefile
w.save("myshape") 
Instead of using the original bounding box we could just populate it with 0's and let the update process repopulate it:
w.bbox = lambda: [0.0, 0.0, 0.0, 0.0]
Note coordinates in a shapefile must always be floating-point numbers. Sometimes you may not want to delete all of the features. You may want to select certain features by attribute or using a spatial operation.

Wednesday, November 2, 2011

Generating Shapefile shx Files

Shapefile shx files help software locate records
quickly but they are not strictly necessary. The
shapefile software can manually browse the
records to answer a query.
Lately I've been following traffic and responding to posts on the excellent site GIS StackExchange.  There are several questions about shapefile shx files which also point to even more questions in the ESRI forums on this topic.

If for some reason, you end up with a shapefile that is missing the shx file then most software is going to complain and refuse to deal with it.  The shapefile spec requires, at a minimum, that you have an shp, shx, and dbf file to have a complete file.  However this requirement is not a technical requirement and a lot of people seem to be confused about that. 

The shx file is a trivial index file that provides fixed-length records pointing to the byte offsets of records in  the shp file only.  It does not connect the shp file and dbf file in any way nor does it contain any sort of record number.  There are no record numbers stored in any of the three standard files which is often a point of confusion.  The software reading a shapefile has to count the number of records read to determine the record id (geometry and attributes).  If you wrote a program to randomly select a record from a shapefile there is no way to tell what the record number is by the record contents.

The purpose of the shx file is to provide faster access to a particular record in a shapefile without storing the entire record set of the shp and dbf files in memory.  The header of the shx file is 100 bytes long.  Each record is 8 bytes long.  So if I want to access record 3, I know that 2*8  = 16 and I can jump to byte 100+16=116 in the shx file, read the 8-byte record to get the offset and record length within the shp file, and then jump straight to that location in the shp file.

While the shx file is convienient it isn't necessary.  Most software balks if it is not there though.  However pyshp handles it gracefully.  If the shx index is there it is used for record access, if not then pyshp reads through the shp records into memory and handles the records as a python list.

Sometimes shx files become corrputed or go missing.  You can build a new shx index using pyshp.  It's kind of a hack but still very simple. In the following example we build an index file for a point shapefile named "myshape" that has two files: "myshape.shp" and "myshape.dbf"

# Build a new shx index file
import shapefile
# Explicitly name the shp and dbf file objects
# so pyshp ignores the missing/corrupt shx
myshp = open("myshape.shp", "rb")
mydbf = open("myshape.dbf", "rb")
r = shapefile.Reader(shp=myshp, shx=None, dbf=mydbf)
w = shapefile.Writer(r.shapeType)
# Copy everything from reader object to writer object
w._shapes = r.shapes()
w.records = r.records()
w.fields = list(r.fields)
# saving will generate the shx
w.save("myshape")

If the shx file is missing it will be created.  If it's corrupt it will be overwritten. So the moral of the story is because shapefiles consist of multiple files, it is actually a robust format. The data in the individual files can usually be accessed in isolation from the other files despite what the standard requires - assuming the software you're using is willing to cooperate.

Tuesday, October 4, 2011

Your Chance to Make GIS History

Update 3/2/12:  Getting closer still... Marc has determined there are two binning algorithms at work to determine which spatial bin a feature ends up in when a node on the tree is split as well as when a split should happen.  Also, another developer, named Francisco, has recently begun a C# implementation and seems to be making good progress.  In the comments on this post Francisco and Marc have been discussing the two algorithms.  It's hard to give a time estimate but Marc feels that this last binning/splitting case should complete the picture.


Update 11/3/11:  Work continues on this issue.  I hope to post soon on the remaining mysteries of the algorithm but we are pretty close.  I hope to do a more technical post on this issue soon to hopefully generate another round of feedback.  I will continue to post any breakthroughs here as well as create a new post when we are done.

Update 10/27/11:  Marc Pfister responded first and has been working tirelessly on this challenge. He's made a good bit of progress in unlocking the algorithm.  Si Parker joined us recently and brought some great insight that added some new direction to Marc's work.  While those guys worked on the hard stuff I created an sbx/sbn reader and writer module that can copy and manipulate these files. I believe at this point we could fool ESRI software but we are systematically closing in on the complete indexing scheme.  It's been fascinating to watch the techniques used which I'll share when it's done. 


This post is an open challenge to clever geospatial programmers everywhere. 

Background

A plot produced by sbn.py for a "world cities" shapefile spatial index
The tremendously successful shapefile format is generally considered an open format due to the fact that the shp, shx, and dbf files are documented.  But the 1998 "ESRI Shapefile Technical Description" doesn't tell the whole story.  Another part of the shapefile format is the spatial index. 

Spatial indexes create groups of features based on a given spatial clustering algorithm and define these clusters using a bounding box, usually mapped to an integer grid to avoid using floating point numbers.  When performing spatial operations you can eliminate irrelevant features by doing quick checks against rectangles before performing expensive point-in-polygon or nearest-neighbor operations.

The shapefile spatial index consists of two files: ".sbn" and ".sbx" - short for "spatial bins" and "spatial bin index".  ESRI never documented these file formats.  And whenever you edit a shapefile these indexes must be updated.  The open source community worked around this issue by creating an open spatial index for shapefiles called ".qix" - short for Quadtree index after the quadtree algorithm it uses.  But ESRI doesn't use qix and 3rd party shapefile software can't use sbn files.   This perputal incompatibility is a pain for everyone involved and results in corrupt indexes or extra coding to remove the incompatible files.

The Challenge

I am very close to ending this spatial index stalemate between ESRI software and open source.  But I have hit a wall.  I have been able to completey reverse engineer both the sbx and sbn file formats.  The problem is I don't fully understand a small portion of the sbn file.  I'm hoping the community at large will recognize the structure ESRI uses and open this format once and for all.

The Facts

The headers of both the sbx and sbn files are nearly identical to the shp and shx files.

The fixed-length record format of the sbx file is nearly identical to the documented shx file but references entries in the sbn file.  Just like the shp file any length fields are 16-bit words which you must double to get bytes.

The sbn file, as you might have guessed by now, contains "bins" which contain bounding boxes of individual features followed by the corresponding record number of that feature in the shp file.  For the bounding boxes in these bins ESRI did something very smart.  Most spatial index formats map floating point coordinates to an integer grid which is more efficient.  The point of a spatial index is not precision but relative accuracy.  However instead of using integers ESRI used chars allowing them to map coordinates to a 255x255 grid using only a single byte per coordinate instead of 4.   This trick stumped me for a while because I was looking for at least 4-byte ints.

After the header of the sbn file all of the bins are listed with a bin number and the number of records that bin contains.  After this portion the bin number is listed followed by each features bounding box and record number.  Fairly straight forward except for one thing.  In the bin list ESRI inserts empty bin records of bin # -1 to 0 with 0 # of shapes.  These "spacers" do not appear in the actual bin structure and seem to follow regular patterns of 1 to 14 empty bins in a row at different points. I considered that these empty bins were artifacts left over when an index is updated by ESRI software after editing.  However if you copy a shapefile, delete the index, and regenerate it using ArcGIS the structure looks exactly the same.  These blank bins are intentional.

Theories

I'm nearly certain ESRI did not use a quadtree for these bins because if you plot their extents they overlap which a quadtree avoids to my knowledge.  It is possible it is some version of an RTree and that the empty bins, if counted, represent tree nodes.  Because RTrees are built recursively and can contain hierarchial "empty" nodes this structure might make since.  But I just don't know enough about quadtree and RTree implementations to know what I'm looking at.  Once these zero bins are deciphered creating an ESRI-compatible reader and writer will not be difficult.  I also suspect that if we understand the structure of the file that the clustering algorithm used won't matter as long as the resulting index format is compliant.

What You Need to Get Started

I have created a zip file with enough tools and information to get a good look inside sbn and sbx files.  Download the Spatial Index Kit from the "Downloads" section of the pyshp Google Code site.  In this zip file you will find:
  • A very simple sbn format specification following the main shapefile spec conventions in pdf and excel format 
  • A very simple sbx format specification following the main shapefile spec conventions in pdf and excel format
  • A directory full of test shapefiles
  • A script, sbn.py, which reads a directory with sbn files and dumps the information in each sbn file into a text file with the same name as the shapefile. It also plots the bounding box of each bin and optionally each features box into an image using PIL and named as the shapefile. Configuration is done using variables inside the script.
  • A script, sbx.py, which also translates the contents of each sbx file into text files.
  • A script, fmtDecode.py, which is a brute force script I used to cycle through all possible data types when stepping through a binary file.  I'm vaguely aware of better tools for reverse engineering but this dumb script did the trick.
Good Luck!  Let me know if you have any questions.  I will update this post and create another post if anybody comes up with the answer.