February 14, 2013

TopoJSON

Last week I updated Shape Escape to convert shapefiles to GeoJSON and TopoJSON, in the hopes of making it easier for developers to quickly get that web-unfriendly shapefile into some clientside useful vector format.

First, it's worth noting that TopoJSON has got a bunch of webmap folks excited. And rightfully so. Many people seem excited because topology (although I haven't seen many non-demo sites taking advantage of this yet). And also because it advertises a more compact representation of data. Clearly if the topology it makes available will help your visualization, TopoJSON is the way to go. But what's this about a smaller representation?

As noted on the wiki page, 'TopoJSON can also be more efficient to render since shared control points need only be projected once. To further reduce file size, TopoJSON uses fixed-precision delta-encoding for integer coordinates rather than floats. This eliminates the need to round the precision of coordinate values (e.g., LilJSON), without sacrificing accuracy'. Sounds good! But what it doesn't mention explicitly is that in order to be efficient about the topology (shared vertices) and the deltas, the coordinate pairs undergo quantization.

No idea what that last paragraph was all about? Delta encoding remind you of an airline? Quantization not on your daily word calendar?

Regarding the delta encoding of coordinate pairs, it's a great way to save bits by referencing the location of vertices as a relative offset -- for example encoded polylines use the technique to great effect (except per geometry, ignoring topology). Version control for your software (think diffs) works similarly. Anyway, that accounts for some great space savings; excellent.

So what about the quantization part? One way to think of it is the geometries are simplified in a "snap to grid" fashion, which means the size of the grid you're snapping to provides you with a tradeoff between compactness and accuracy
. The more course your grid, the more vertices may get snapped to single location (and the further away they may be moved from their original location). Since rounding your original coordinates (e.g. lopping off significant digits from your lat/lng) in essence does the same thing, the quantization part of the conversion does cause some loss of accuracy. So what is the accuracy loss? Even if it's not discernable for a non-zoomable map, what does it mean for the traditional slippy-map developer?

To help illustrate, ShpEscape outputs TopoJSON with a few different size grids (or as the documentation refers to the -q option, 'max. differentiable points in one dimension', such that you can select from a variety of output options (figure 1) for each upload.

Ok great, but which option should you choose for your mapping needs?

A quick experiment: I uploaded some Natural Earth country borders, and the US Census California Counties to ShpEscape. There's a big difference in these images, and my conclusion is if you want to use TopoJSON in a slippy map, you should consider how far your users may zoom in, the importance of not losing detail, and probably avoid the default 10k quantize parameter.

Additionally, if you do decide to stick with GeoJSON, don't feel too bad: The 90% savings that first jumps out at you might not end up as big as you think. Below, for example, are the numbers (in kB) for the CA Counties:



GeoJSON 6,322
TopoJSON [default] 454
TopoJSON [100M] 1,539
GeoJSON [gzip] 1,418
TopoJSON [100M gzip] 556


TopoJSON still the clear winner in this experiment, at a bit over 1/3 the size when sent over the wire. But there's still some cost (for example the additional topojson.js library), and I also didn't experiment with liljson which could potentially save some space on the GeoJSON side.

Finally, don't take the above figures too seriously -- YMMV with different datasets; this one for example has polygons with shared borders, and a relatively even distribution. Instead, be thankful we have a new awesome option for sending vectors around, and use it with care.