Pages

Wednesday, November 2, 2011

Generating Shapefile shx Files

Shapefile shx files help software locate records
quickly but they are not strictly necessary. The
shapefile software can manually browse the
records to answer a query.
Lately I've been following traffic and responding to posts on the excellent site GIS StackExchange.  There are several questions about shapefile shx files which also point to even more questions in the ESRI forums on this topic.

If for some reason, you end up with a shapefile that is missing the shx file then most software is going to complain and refuse to deal with it.  The shapefile spec requires, at a minimum, that you have an shp, shx, and dbf file to have a complete file.  However this requirement is not a technical requirement and a lot of people seem to be confused about that. 

The shx file is a trivial index file that provides fixed-length records pointing to the byte offsets of records in  the shp file only.  It does not connect the shp file and dbf file in any way nor does it contain any sort of record number.  There are no record numbers stored in any of the three standard files which is often a point of confusion.  The software reading a shapefile has to count the number of records read to determine the record id (geometry and attributes).  If you wrote a program to randomly select a record from a shapefile there is no way to tell what the record number is by the record contents.

The purpose of the shx file is to provide faster access to a particular record in a shapefile without storing the entire record set of the shp and dbf files in memory.  The header of the shx file is 100 bytes long.  Each record is 8 bytes long.  So if I want to access record 3, I know that 2*8  = 16 and I can jump to byte 100+16=116 in the shx file, read the 8-byte record to get the offset and record length within the shp file, and then jump straight to that location in the shp file.

While the shx file is convienient it isn't necessary.  Most software balks if it is not there though.  However pyshp handles it gracefully.  If the shx index is there it is used for record access, if not then pyshp reads through the shp records into memory and handles the records as a python list.

Sometimes shx files become corrputed or go missing.  You can build a new shx index using pyshp.  It's kind of a hack but still very simple. In the following example we build an index file for a point shapefile named "myshape" that has two files: "myshape.shp" and "myshape.dbf"

# Build a new shx index file
import shapefile
# Explicitly name the shp and dbf file objects
# so pyshp ignores the missing/corrupt shx
myshp = open("myshape.shp", "rb")
mydbf = open("myshape.dbf", "rb")
r = shapefile.Reader(shp=myshp, shx=None, dbf=mydbf)
w = shapefile.Writer(r.shapeType)
# Copy everything from reader object to writer object
w._shapes = r.shapes()
w.records = r.records()
w.fields = list(r.fields)
# saving will generate the shx
w.save("myshape")

If the shx file is missing it will be created.  If it's corrupt it will be overwritten. So the moral of the story is because shapefiles consist of multiple files, it is actually a robust format. The data in the individual files can usually be accessed in isolation from the other files despite what the standard requires - assuming the software you're using is willing to cooperate.

4 comments:

  1. had an issue with auto cad where i would use map export to create my shape file, then when i import the file into another drawing to check that it did it correctly i would get an error about the dbf and shp files not matching, your software corrected it, thanks very much!

    ReplyDelete
  2. oops, not the shp file, the error said the dbf is corrupt or does not have same number of objects as the shx file, run your program and it works, pretty simple, I wonder why its not working properly though, any ideas?

    ReplyDelete
  3. This is just what I needed! I modified it slightly into a standalone module/program so I could fix several shapefiles which were missing their SHX.

    (Sorry, indentations don't show up.... makes it harder to use this sample code!)

    def RebuildShx(path):
    '''This code based on http://geospatialpython.com/2011/11/generating-shapefile-shx-files.html'''
    print(path)

    # Build a new shx index file
    import shapefile
    # Explicitly name the shp and dbf file objects
    # so pyshp ignores the missing/corrupt shx
    myshp = open(path+".shp", "rb")
    mydbf = open(path+".dbf", "rb")
    r = shapefile.Reader(shp=myshp, shx=None, dbf=mydbf)
    w = shapefile.Writer(r.shapeType)
    # Copy everything from reader object to writer object
    w._shapes = r.shapes()
    w.records = r.records()
    w.fields = list(r.fields)
    # saving will generate the shx
    w.save(path+"_fixed")

    # I got this idea from the python help: 6 Modules 6.1.1 Executing modules as scripts
    # Basically it means you can just run this module as a command with the item after the module name as input.
    if __name__ == "__main__":
    import sys
    if len(sys.argv)>1:
    RebuildShx(sys.argv[1])

    ReplyDelete
  4. Hello,

    Novice Python user here. I pasted code into Pyscripter, supplied pathname to my shapefile. I get "Import Error: No module named shapefile." How to avoid this error? Thank you.

    ReplyDelete