Wednesday, November 2, 2011

Generating Shapefile shx Files

Shapefile shx files help software locate records
quickly but they are not strictly necessary. The
shapefile software can manually browse the
records to answer a query.
Lately I've been following traffic and responding to posts on the excellent site GIS StackExchange.  There are several questions about shapefile shx files which also point to even more questions in the ESRI forums on this topic.

If for some reason, you end up with a shapefile that is missing the shx file then most software is going to complain and refuse to deal with it.  The shapefile spec requires, at a minimum, that you have an shp, shx, and dbf file to have a complete file.  However this requirement is not a technical requirement and a lot of people seem to be confused about that. 

The shx file is a trivial index file that provides fixed-length records pointing to the byte offsets of records in  the shp file only.  It does not connect the shp file and dbf file in any way nor does it contain any sort of record number.  There are no record numbers stored in any of the three standard files which is often a point of confusion.  The software reading a shapefile has to count the number of records read to determine the record id (geometry and attributes).  If you wrote a program to randomly select a record from a shapefile there is no way to tell what the record number is by the record contents.

The purpose of the shx file is to provide faster access to a particular record in a shapefile without storing the entire record set of the shp and dbf files in memory.  The header of the shx file is 100 bytes long.  Each record is 8 bytes long.  So if I want to access record 3, I know that 2*8  = 16 and I can jump to byte 100+16=116 in the shx file, read the 8-byte record to get the offset and record length within the shp file, and then jump straight to that location in the shp file.

While the shx file is convienient it isn't necessary.  Most software balks if it is not there though.  However pyshp handles it gracefully.  If the shx index is there it is used for record access, if not then pyshp reads through the shp records into memory and handles the records as a python list.

Sometimes shx files become corrputed or go missing.  You can build a new shx index using pyshp.  It's kind of a hack but still very simple. In the following example we build an index file for a point shapefile named "myshape" that has two files: "myshape.shp" and "myshape.dbf"

# Build a new shx index file
import shapefile
# Explicitly name the shp and dbf file objects
# so pyshp ignores the missing/corrupt shx
myshp = open("myshape.shp", "rb")
mydbf = open("myshape.dbf", "rb")
r = shapefile.Reader(shp=myshp, shx=None, dbf=mydbf)
w = shapefile.Writer(r.shapeType)
# Copy everything from reader object to writer object
w._shapes = r.shapes()
w.records = r.records()
w.fields = list(r.fields)
# saving will generate the shx"myshape")

If the shx file is missing it will be created.  If it's corrupt it will be overwritten. So the moral of the story is because shapefiles consist of multiple files, it is actually a robust format. The data in the individual files can usually be accessed in isolation from the other files despite what the standard requires - assuming the software you're using is willing to cooperate.

No comments:

Post a Comment