Friday, November 4, 2011

Deleting Shapefile Features

Sometimes, usually as a server-based operation, you need to delete all of the features in a shapefile. All you want left is the shapefile type, the dbf schema, and maybe the overall bounding box. This shapefile stub can then be updated by other processes. Pyshp currently doesn't have an explicit "delete" method. But because pyshp converts everything to native Python data types (strings, lists, dicts, numbers) you can usually manipulate things fairly easily. The solution is very similar to merging shapefiles but instead you are writing back to the same file instead of a new copy. There's only one hitch in this operation resulting from a minor difference in the pyshp Reader and Writer objects. In the reader the "bbox" property returns a static array of [xmin, ymin, xmax, ymax]. The Writer also has a "bbox" property but it is a method that is called when you save the shapefile. The Writer calculates the bounding box on the fly by reading all of the shapes just before saving. But in this case there are no shapes so the method would throw an error. So what we do is just override that method with a lambda function to return whatever bbox we want whether it be the original bbox or a made up one.
import shapefile 
# Read the shapefile we want to clear out
r = shapefile.Reader("myshape") 
# Create a Writer of the same type to save out as a blank
w = shapefile.Writer(r.shapeType) 
# This line will give us the same dbf schema 
w.fields = r.fields 
# Use the original bounding box in the header 
w.bbox = lambda: r.bbox 
# Save the featureless, attribute-less shapefile"myshape") 
Instead of using the original bounding box we could just populate it with 0's and let the update process repopulate it:
w.bbox = lambda: [0.0, 0.0, 0.0, 0.0]
Note coordinates in a shapefile must always be floating-point numbers. Sometimes you may not want to delete all of the features. You may want to select certain features by attribute or using a spatial operation.

Wednesday, November 2, 2011

Generating Shapefile shx Files

Shapefile shx files help software locate records
quickly but they are not strictly necessary. The
shapefile software can manually browse the
records to answer a query.
Lately I've been following traffic and responding to posts on the excellent site GIS StackExchange.  There are several questions about shapefile shx files which also point to even more questions in the ESRI forums on this topic.

If for some reason, you end up with a shapefile that is missing the shx file then most software is going to complain and refuse to deal with it.  The shapefile spec requires, at a minimum, that you have an shp, shx, and dbf file to have a complete file.  However this requirement is not a technical requirement and a lot of people seem to be confused about that. 

The shx file is a trivial index file that provides fixed-length records pointing to the byte offsets of records in  the shp file only.  It does not connect the shp file and dbf file in any way nor does it contain any sort of record number.  There are no record numbers stored in any of the three standard files which is often a point of confusion.  The software reading a shapefile has to count the number of records read to determine the record id (geometry and attributes).  If you wrote a program to randomly select a record from a shapefile there is no way to tell what the record number is by the record contents.

The purpose of the shx file is to provide faster access to a particular record in a shapefile without storing the entire record set of the shp and dbf files in memory.  The header of the shx file is 100 bytes long.  Each record is 8 bytes long.  So if I want to access record 3, I know that 2*8  = 16 and I can jump to byte 100+16=116 in the shx file, read the 8-byte record to get the offset and record length within the shp file, and then jump straight to that location in the shp file.

While the shx file is convienient it isn't necessary.  Most software balks if it is not there though.  However pyshp handles it gracefully.  If the shx index is there it is used for record access, if not then pyshp reads through the shp records into memory and handles the records as a python list.

Sometimes shx files become corrputed or go missing.  You can build a new shx index using pyshp.  It's kind of a hack but still very simple. In the following example we build an index file for a point shapefile named "myshape" that has two files: "myshape.shp" and "myshape.dbf"

# Build a new shx index file
import shapefile
# Explicitly name the shp and dbf file objects
# so pyshp ignores the missing/corrupt shx
myshp = open("myshape.shp", "rb")
mydbf = open("myshape.dbf", "rb")
r = shapefile.Reader(shp=myshp, shx=None, dbf=mydbf)
w = shapefile.Writer(r.shapeType)
# Copy everything from reader object to writer object
w._shapes = r.shapes()
w.records = r.records()
w.fields = list(r.fields)
# saving will generate the shx"myshape")

If the shx file is missing it will be created.  If it's corrupt it will be overwritten. So the moral of the story is because shapefiles consist of multiple files, it is actually a robust format. The data in the individual files can usually be accessed in isolation from the other files despite what the standard requires - assuming the software you're using is willing to cooperate.