Update 3/2/12: Getting closer still... Marc has determined there are two binning algorithms at work to determine which spatial bin a feature ends up in when a node on the tree is split as well as when a split should happen. Also, another developer, named Francisco, has recently begun a C# implementation and seems to be making good progress. In the comments on this post Francisco and Marc have been discussing the two algorithms. It's hard to give a time estimate but Marc feels that this last binning/splitting case should complete the picture.
Update 11/3/11: Work continues on this issue. I hope to post soon on the remaining mysteries of the algorithm but we are pretty close. I hope to do a more technical post on this issue soon to hopefully generate another round of feedback. I will continue to post any breakthroughs here as well as create a new post when we are done.
Update 10/27/11: Marc Pfister responded first and has been working tirelessly on this challenge. He's made a good bit of progress in unlocking the algorithm. Si Parker joined us recently and brought some great insight that added some new direction to Marc's work. While those guys worked on the hard stuff I created an sbx/sbn reader and writer module that can copy and manipulate these files. I believe at this point we could fool ESRI software but we are systematically closing in on the complete indexing scheme. It's been fascinating to watch the techniques used which I'll share when it's done.
This post is an open challenge to clever geospatial programmers everywhere.
Background
A plot produced by sbn.py for a "world cities" shapefile spatial index |
Spatial indexes create groups of features based on a given spatial clustering algorithm and define these clusters using a bounding box, usually mapped to an integer grid to avoid using floating point numbers. When performing spatial operations you can eliminate irrelevant features by doing quick checks against rectangles before performing expensive point-in-polygon or nearest-neighbor operations.
The shapefile spatial index consists of two files: ".sbn" and ".sbx" - short for "spatial bins" and "spatial bin index". ESRI never documented these file formats. And whenever you edit a shapefile these indexes must be updated. The open source community worked around this issue by creating an open spatial index for shapefiles called ".qix" - short for Quadtree index after the quadtree algorithm it uses. But ESRI doesn't use qix and 3rd party shapefile software can't use sbn files. This perputal incompatibility is a pain for everyone involved and results in corrupt indexes or extra coding to remove the incompatible files.
The Challenge
I am very close to ending this spatial index stalemate between ESRI software and open source. But I have hit a wall. I have been able to completey reverse engineer both the sbx and sbn file formats. The problem is I don't fully understand a small portion of the sbn file. I'm hoping the community at large will recognize the structure ESRI uses and open this format once and for all.
The Facts
The headers of both the sbx and sbn files are nearly identical to the shp and shx files.
The fixed-length record format of the sbx file is nearly identical to the documented shx file but references entries in the sbn file. Just like the shp file any length fields are 16-bit words which you must double to get bytes.
The sbn file, as you might have guessed by now, contains "bins" which contain bounding boxes of individual features followed by the corresponding record number of that feature in the shp file. For the bounding boxes in these bins ESRI did something very smart. Most spatial index formats map floating point coordinates to an integer grid which is more efficient. The point of a spatial index is not precision but relative accuracy. However instead of using integers ESRI used chars allowing them to map coordinates to a 255x255 grid using only a single byte per coordinate instead of 4. This trick stumped me for a while because I was looking for at least 4-byte ints.
After the header of the sbn file all of the bins are listed with a bin number and the number of records that bin contains. After this portion the bin number is listed followed by each features bounding box and record number. Fairly straight forward except for one thing. In the bin list ESRI inserts empty bin records of bin # -1 to 0 with 0 # of shapes. These "spacers" do not appear in the actual bin structure and seem to follow regular patterns of 1 to 14 empty bins in a row at different points. I considered that these empty bins were artifacts left over when an index is updated by ESRI software after editing. However if you copy a shapefile, delete the index, and regenerate it using ArcGIS the structure looks exactly the same. These blank bins are intentional.
Theories
I'm nearly certain ESRI did not use a quadtree for these bins because if you plot their extents they overlap which a quadtree avoids to my knowledge. It is possible it is some version of an RTree and that the empty bins, if counted, represent tree nodes. Because RTrees are built recursively and can contain hierarchial "empty" nodes this structure might make since. But I just don't know enough about quadtree and RTree implementations to know what I'm looking at. Once these zero bins are deciphered creating an ESRI-compatible reader and writer will not be difficult. I also suspect that if we understand the structure of the file that the clustering algorithm used won't matter as long as the resulting index format is compliant.
What You Need to Get Started
I have created a zip file with enough tools and information to get a good look inside sbn and sbx files. Download the Spatial Index Kit from the "Downloads" section of the pyshp Google Code site. In this zip file you will find:
- A very simple sbn format specification following the main shapefile spec conventions in pdf and excel format
- A very simple sbx format specification following the main shapefile spec conventions in pdf and excel format
- A directory full of test shapefiles
- A script, sbn.py, which reads a directory with sbn files and dumps the information in each sbn file into a text file with the same name as the shapefile. It also plots the bounding box of each bin and optionally each features box into an image using PIL and named as the shapefile. Configuration is done using variables inside the script.
- A script, sbx.py, which also translates the contents of each sbx file into text files.
- A script, fmtDecode.py, which is a brute force script I used to cycle through all possible data types when stepping through a binary file. I'm vaguely aware of better tools for reverse engineering but this dumb script did the trick.