Wednesday, June 26, 2013

Dots in Shapefile Names

I came across a post from another blogger recently which highlighted an issue I hadn't considered before.  Most operating systems allow for an arbitrary number of periods "." in a file name.  For example your shp file in a shapefile set might be named "cities.2013.v.1.2.shp".  While I'm aware of that fact I did not really take it into consideration when designing PyShp.  Here are some factors I did consider:
1. There are at least 3 and up to 9 file types in a single shapefile data set.
2. Dbf files are a fairly common data format for file-based databases.
3. The shx file is just an index and does not contain critical data.

Given these considerations I made it possible to just specify the the base name of a shapefile when reading or writing or you can add an extension IF you want.  Pyshp doesn't rely on the extension.  So in the above example I could just use "cities.2013.v.1.2" in theory.  

That way if one of the files was missing or for some reason you had no control over how the file name was passed to your program the software would just work.  I figured this approach would be the most robust and intuitive, user-friendly method.  In fact if you DO specify a file extension PyShp just chops is off to get the base name using Python's os.path.splitext() module method and then rebuilds the requisite file names before checking to see if they even exist.

There's only one problem.  If you use the base file name AND you have extra dots in your file name as in the example, Python still chops off everything after the last dot assuming it is the file extension.  This functionality is fair since there's no way programatically to be sure what is an extension and what is not.  You could use a look-up table in this context but that can lead to other problems.  So in the above example if you pass the base file name to the PyShp Reader object the base name would be shortened to "cities.2013.v.1" which would result in IO exceptions.

The safe workaround is to just always include a three-letter extension in the file name if you might have dots. I may eventually try to program around this but it's kind of a corner case with a reasonable workaround.  

1 comment:

  1. Here's a possible fix that I've used for my own scripts. Since most people use pyshp to either access the shapefile in its entirety or just to read the Dbf file you could just test if the pathname ends with .dbf or .shp with the .endswith() method. If it finds those ending letters you can safely assume that those are extension names and you can proceed with cutting them of with your .splitext() method. If they aren't found then you already have the base name so just run with that. The input pathname can have all the dots it wants :) If you're busy with other things I'd be happy to contribute the code to the google code website...