Pages

Tuesday, October 4, 2011

Your Chance to Make GIS History

Update 6/13/12:  We finally reverse engineered the sbn indexing algorithm to an Esri-compatible level!  Please see this post: SBN Mystery Solved!


Update 3/2/12:  Getting closer still... Marc has determined there are two binning algorithms at work to determine which spatial bin a feature ends up in when a node on the tree is split as well as when a split should happen.  Also, another developer, named Francisco, has recently begun a C# implementation and seems to be making good progress.  In the comments on this post Francisco and Marc have been discussing the two algorithms.  It's hard to give a time estimate but Marc feels that this last binning/splitting case should complete the picture.


Update 11/3/11:  Work continues on this issue.  I hope to post soon on the remaining mysteries of the algorithm but we are pretty close.  I hope to do a more technical post on this issue soon to hopefully generate another round of feedback.  I will continue to post any breakthroughs here as well as create a new post when we are done.

Update 10/27/11:  Marc Pfister responded first and has been working tirelessly on this challenge. He's made a good bit of progress in unlocking the algorithm.  Si Parker joined us recently and brought some great insight that added some new direction to Marc's work.  While those guys worked on the hard stuff I created an sbx/sbn reader and writer module that can copy and manipulate these files. I believe at this point we could fool ESRI software but we are systematically closing in on the complete indexing scheme.  It's been fascinating to watch the techniques used which I'll share when it's done. 


This post is an open challenge to clever geospatial programmers everywhere. 

Background

A plot produced by sbn.py for a "world cities" shapefile spatial index
The tremendously successful shapefile format is generally considered an open format due to the fact that the shp, shx, and dbf files are documented.  But the 1998 "ESRI Shapefile Technical Description" doesn't tell the whole story.  Another part of the shapefile format is the spatial index. 

Spatial indexes create groups of features based on a given spatial clustering algorithm and define these clusters using a bounding box, usually mapped to an integer grid to avoid using floating point numbers.  When performing spatial operations you can eliminate irrelevant features by doing quick checks against rectangles before performing expensive point-in-polygon or nearest-neighbor operations.

The shapefile spatial index consists of two files: ".sbn" and ".sbx" - short for "spatial bins" and "spatial bin index".  ESRI never documented these file formats.  And whenever you edit a shapefile these indexes must be updated.  The open source community worked around this issue by creating an open spatial index for shapefiles called ".qix" - short for Quadtree index after the quadtree algorithm it uses.  But ESRI doesn't use qix and 3rd party shapefile software can't use sbn files.   This perputal incompatibility is a pain for everyone involved and results in corrupt indexes or extra coding to remove the incompatible files.

The Challenge

I am very close to ending this spatial index stalemate between ESRI software and open source.  But I have hit a wall.  I have been able to completey reverse engineer both the sbx and sbn file formats.  The problem is I don't fully understand a small portion of the sbn file.  I'm hoping the community at large will recognize the structure ESRI uses and open this format once and for all.

The Facts

The headers of both the sbx and sbn files are nearly identical to the shp and shx files.

The fixed-length record format of the sbx file is nearly identical to the documented shx file but references entries in the sbn file.  Just like the shp file any length fields are 16-bit words which you must double to get bytes.

The sbn file, as you might have guessed by now, contains "bins" which contain bounding boxes of individual features followed by the corresponding record number of that feature in the shp file.  For the bounding boxes in these bins ESRI did something very smart.  Most spatial index formats map floating point coordinates to an integer grid which is more efficient.  The point of a spatial index is not precision but relative accuracy.  However instead of using integers ESRI used chars allowing them to map coordinates to a 255x255 grid using only a single byte per coordinate instead of 4.   This trick stumped me for a while because I was looking for at least 4-byte ints.

After the header of the sbn file all of the bins are listed with a bin number and the number of records that bin contains.  After this portion the bin number is listed followed by each features bounding box and record number.  Fairly straight forward except for one thing.  In the bin list ESRI inserts empty bin records of bin # -1 to 0 with 0 # of shapes.  These "spacers" do not appear in the actual bin structure and seem to follow regular patterns of 1 to 14 empty bins in a row at different points. I considered that these empty bins were artifacts left over when an index is updated by ESRI software after editing.  However if you copy a shapefile, delete the index, and regenerate it using ArcGIS the structure looks exactly the same.  These blank bins are intentional.

Theories

I'm nearly certain ESRI did not use a quadtree for these bins because if you plot their extents they overlap which a quadtree avoids to my knowledge.  It is possible it is some version of an RTree and that the empty bins, if counted, represent tree nodes.  Because RTrees are built recursively and can contain hierarchial "empty" nodes this structure might make since.  But I just don't know enough about quadtree and RTree implementations to know what I'm looking at.  Once these zero bins are deciphered creating an ESRI-compatible reader and writer will not be difficult.  I also suspect that if we understand the structure of the file that the clustering algorithm used won't matter as long as the resulting index format is compliant.

What You Need to Get Started

I have created a zip file with enough tools and information to get a good look inside sbn and sbx files.  Download the Spatial Index Kit from the "Downloads" section of the pyshp Google Code site.  In this zip file you will find:
  • A very simple sbn format specification following the main shapefile spec conventions in pdf and excel format 
  • A very simple sbx format specification following the main shapefile spec conventions in pdf and excel format
  • A directory full of test shapefiles
  • A script, sbn.py, which reads a directory with sbn files and dumps the information in each sbn file into a text file with the same name as the shapefile. It also plots the bounding box of each bin and optionally each features box into an image using PIL and named as the shapefile. Configuration is done using variables inside the script.
  • A script, sbx.py, which also translates the contents of each sbx file into text files.
  • A script, fmtDecode.py, which is a brute force script I used to cycle through all possible data types when stepping through a binary file.  I'm vaguely aware of better tools for reverse engineering but this dumb script did the trick.
Good Luck!  Let me know if you have any questions.  I will update this post and create another post if anybody comes up with the answer.

9 comments:

  1. I tried a connect-the-dots approach to examine the order the bins were written in, and when you draw lines connecting the bounding box centroids in order, you start seeing the fingerprints of a z-order curve or some sort of quadtree:

    http://flic.kr/p/atrEim

    ReplyDelete
  2. I searched for an option of reading/writing shapefiles in Python without arcgisscripting and found this post. Without looking at this problem in particular I'd just like to leave a quick comment: Did you have a look if and how other open source programs solved that issue, e.g. MapWindow (http://www.mapwindow.org/apps/wiki/doku.php?id=sourcecode) ?
    Maybe you can find a hint in their code. Anyway, thanks for providing the python shapefile library!

    ReplyDelete
  3. First of all thanks for what you have done so far in the reverse ingineering process of sbn file. It help me a lot with the C# library I built to read shapefiles. Here are my two cents to this process

    The sbn is an spatial index based on a binary tree, where each node contains one or more boxes. When a node has too many boxes it splits. The split draws a line in the middle. Any box that touches the line is kept in the node, while any box on one side of the line is sent to the first child, and any box on the other side of the line is sent to the second child. A vertical split divides the square with a vertical line into two rectangles, and its first child receives the boxes to the right of the line and the second child receives the boxes to the left of the line. A horizontal split divides the rectangle with a horizontal line into two squares. The boxes above the line are sent to the first child while the boxes below to the second child. The root is a vertical split. The children of a vertical split are always horizontal splits, while the children of a horizontal split are always vertical splits.

    The list of bins is an array contaning the tree. the first element is the root node, the second and third elements its children the 4th and 5th are the children of the the 2nd bin and the 6th and 7th element are the children of the third bin, and so on. If a node has not been split, then all its descendants will be empty bin records. If a node has been split, but no box touches the splitting line, then than bin will be an empty bin record.

    The one thing I have not figure out yet is what's the splitting threshold, i.e., when a node is considered full and splitting occur. The files I have looked at have different numbers. One of them, which had a few shapes, does not have more than 30 boxes in a node, while another that has lots of shapes has no more than 700 shapes in a node. This is a problem when writing the sbn file. However is not a problem when reading it and using it. With the large file, when I read it without the index it took a few seconds to retrive the shapes that intersected certain small box, but when I started using the index file, it worked in the blink of an eye.

    ReplyDelete
  4. Francisco,

    Yes, it is a binary division. It's essentially an "Interval KD Tree" but the splits are at the midpoint of the node's range, not at the median point in the set. Also, there's no tree applied to the seam features.

    There is a splitting threshold of 8 features per cell initially. However, the tree size is fixed based on the total number of features. It selects the size of the tree so that there are 8 features or less per cell on average.

    Because you can only split the range of the 8 bit mapping so many times, the maximum size of the tree is fixed. How the space is split is also a bit odd. It doesn't split on a single value - there's actually a 1 unit "gutter" between the cells. There's also padding on the cells on the low edges.

    Once you implement all this you'll run into some strange behavior - features on the seams between splits won't end up in the bins where your algorithm predicts. It appears there's a second algorithm for the seam features. I've figured it out, but it only seems to apply to first 6 levels of the tree. I can see the justification for why there could be two algorithms, but I haven't completely solved the problem.

    But we are very close.

    ReplyDelete
  5. Mark,

    The seam problem is not a second algorithm. I figure it out back when I was trying to understand what's going on. It is not exactly a continous split, but rather a discrete one. Following these instructions you should be able to recreate the tree with no seam problem.

    Suppose we are doing the first vertical split, thus we have the whole shapefile bounding box, with xmin=0 and xmax=255. Then you calculate center = xmin + (xmax - xmin) / 2 + 1, where the division is an integer division, not real. As you shall see in a bit, this will always result in an even number.

    Suppose we have a shape and that shape.xmin and shape.xmax are the byte range for the x's. Then we have shape.xmin < shape.xmax, even if the feature is just a point. This is because when changing the real number to a byte, the xmin gets rounded down and xmax gets rounded up.

    To place the shape in the proper node in the tree, do the following. If shape.xmin <= center and shape.xmax>center, then the shape bounding box is assumed to touch the vertical line, even if the not-rounded bounding box does not touch the exactly vertical line. If shape.xmin > center, then the shape is placed on the first child or one of its descendant, and for the next vertical split xmin=center. If shape.xmax <= center, then the shape is placed on the second child and for the next vertical split xmax=center-1.

    Suppose we start with xmin=0 and xmax=255. Then center=128. Say we now want to search on the right. Then xmin=128 and xmax=255 and center=192. Say we want to go to the left now. Then xmin=128 and xmax=191 and center=160. xmin will always be an even number and xmax an odd number, in fact the width (xmax - xmin) is always a power of 2 minus one (255, 127, 63, 31, etc), which is what causes the center to always be even.

    The same principle applies to horizontal splits. I followed this "discrete" splitting scheme and checked with all the features in my shapefiles, and I got them all placed in the correct bins. Please let me know if it works for you too.

    ReplyDelete
  6. Francisco,

    Yes, that is basically the definition of binary division and we have been using it for a while. You can see it in this photo which shows the 1 integer unit seams between cells.

    However, if you test enough shapefiles with it you will start to find a small percentage of seam features do not end up where you expect them to be. Typically, they will be one to two levels higher in the tree. There appears to be a second (in the sense of additional, not different) algorithm that is run on the seam features, probably for some sort of optimization.

    ReplyDelete
  7. I think I have the second algorithm figured out. I can successfully bin my 3000 test shapefiles (randomly generated with fixed extents). I'm still getting some random misbins on the World Cities shapefile but I think these are due to minor errors in implementation.

    In a nutshell, the algorithm examines a bin and checks to see if its grandchildren contain features. If not, it pulls up any features from its child nodes.

    ReplyDelete
  8. When I was researching on some of engineering jobs, this was one of the topics they asked me about. It was actually a hot issue with them and that reverse engineering is one of their tools for security purposes.

    ReplyDelete
  9. I found the best post.I found best information to update my current skills.I got a confidence to apply for GIS Jobs in Hyderabad.

    ReplyDelete