February 14, 2013


Last week I updated Shape Escape to convert shapefiles to GeoJSON and TopoJSON, in the hopes of making it easier for developers to quickly get that web-unfriendly shapefile into some clientside useful vector format.

First, it's worth noting that TopoJSON has got a bunch of webmap folks excited. And rightfully so. Many people seem excited because topology (although I haven't seen many non-demo sites taking advantage of this yet). And also because it advertises a more compact representation of data. Clearly if the topology it makes available will help your visualization, TopoJSON is the way to go. But what's this about a smaller representation?

As noted on the wiki page, 'TopoJSON can also be more efficient to render since shared control points need only be projected once. To further reduce file size, TopoJSON uses fixed-precision delta-encoding for integer coordinates rather than floats. This eliminates the need to round the precision of coordinate values (e.g., LilJSON), without sacrificing accuracy'. Sounds good! But what it doesn't mention explicitly is that in order to be efficient about the topology (shared vertices) and the deltas, the coordinate pairs undergo quantization.

No idea what that last paragraph was all about? Delta encoding remind you of an airline? Quantization not on your daily word calendar?

Regarding the delta encoding of coordinate pairs, it's a great way to save bits by referencing the location of vertices as a relative offset -- for example encoded polylines use the technique to great effect (except per geometry, ignoring topology). Version control for your software (think diffs) works similarly. Anyway, that accounts for some great space savings; excellent.

So what about the quantization part? One way to think of it is the geometries are simplified in a "snap to grid" fashion, which means the size of the grid you're snapping to provides you with a tradeoff between compactness and accuracy
. The more course your grid, the more vertices may get snapped to single location (and the further away they may be moved from their original location). Since rounding your original coordinates (e.g. lopping off significant digits from your lat/lng) in essence does the same thing, the quantization part of the conversion does cause some loss of accuracy. So what is the accuracy loss? Even if it's not discernable for a non-zoomable map, what does it mean for the traditional slippy-map developer?

To help illustrate, ShpEscape outputs TopoJSON with a few different size grids (or as the documentation refers to the -q option, 'max. differentiable points in one dimension', such that you can select from a variety of output options (figure 1) for each upload.

Ok great, but which option should you choose for your mapping needs?

A quick experiment: I uploaded some Natural Earth country borders, and the US Census California Counties to ShpEscape. There's a big difference in these images, and my conclusion is if you want to use TopoJSON in a slippy map, you should consider how far your users may zoom in, the importance of not losing detail, and probably avoid the default 10k quantize parameter.

Additionally, if you do decide to stick with GeoJSON, don't feel too bad: The 90% savings that first jumps out at you might not end up as big as you think. Below, for example, are the numbers (in kB) for the CA Counties:

GeoJSON 6,322
TopoJSON [default] 454
TopoJSON [100M] 1,539
GeoJSON [gzip] 1,418
TopoJSON [100M gzip] 556

TopoJSON still the clear winner in this experiment, at a bit over 1/3 the size when sent over the wire. But there's still some cost (for example the additional topojson.js library), and I also didn't experiment with liljson which could potentially save some space on the GeoJSON side.

Finally, don't take the above figures too seriously -- YMMV with different datasets; this one for example has polygons with shared borders, and a relatively even distribution. Instead, be thankful we have a new awesome option for sending vectors around, and use it with care.

June 12, 2011


I've played briefly with TileMill before, but after learning more about the advances that Development Seed is putting into their MapBox stack (such as Node.js integration with Mapnik, utfgrid and more) I realized it was time to sit down and play with it for real.

My main interest right now is in getting a feel for the CSS-like syntax used in Carto, but as long as I needed to set up a full instance of TileMill to play with I figured I might as well make it into an Amazon EC2 AMI, so anyone can easily boot up an instance and get started.

Setup was extremely simple (thanks to Dane telling me how to cut and paste their very straightforward install instructions; not that he noticed I was installing everything to /tmp). After some much needed sleep and attempt #2 here on the plane, you can now go to the Amazon AWS console and load up ami-56ae563f, or search for tilemill, and you're done. Just wait for the instance to start up, and TileMill should be running on port 80 (if you want to get to the tilemill console, ssh in with the keypair you associated with the instance, and type in:

sudo su
screen -r tilemill

December 20, 2010

Shape Escape

Well it's been a while, so just a quick note: Since the last post, I started working full time for Google. And with that out of the way, here's a post on how and why I made shpescape.com, which lets you upload shapefiles to Google Fusion Tables.

Why shpescape?

Google Fusion tables makes it easy to import and visualize data from spreadsheets and KML, and while it has increasingly robust spatial support it does not currently let you upload shapefiles directly. And since shapefiles are still incredibly common in the wild, I thought I'd make a quick tool to let people upload shapefiles to Fusion Tables.

Which platform?

I thought I'd try Google App Engine to avoid any hosting costs (given this will likely not be an extremely popular website), but while there's a decent shapefile reader or two for python, there's not a lot of support for things like reprojecting and other geometry manipulation without additional c++ libraries that App Engine won't run. So I just went for a simple GeoDjango app.


I used my colleague Kathryn's Fusion Tables python client to handle the authentication (OAuth). And I decided against having OpenID in the mix as well for actually associating an account with various uploads. The downside is that you can't log in and view your previous uploads. But you can always go to the main Fusion Tables page to see all your tables, and the upside was one less thing for me to consider (for example, if you are logged in with multiple accounts in the same browser, OAuth does not return which account gave permissions for). [Edit: It turns out you can actually request the email address of an authorized user using the scope noted at http://sites.google.com/site/oauthgoog/Home/emaildisplayscope]

Handling a Shapefile Upload

I used a simple fork of Dane Springmeyer's django-shapes app to handle the shapefile import. The customizations let users upload a zipfile that has a shapefile in a subfolder, and/or multiple shapefiles in a single zip. I had never really noticed shapefiles being zipped up this way, and it really surprised me how common these scenarios are with shapefiles from various US counties and other agencies -- my first 3 test users all had their uploads fail until I added this. After the upload is verified as valid, it creates a shapeUpload object, which is processed separately so the end user can view a reloading page with updated status.

Processing the Upload

My initial attempt was pretty straightforward:
  • Attempt to get the projection of the shapefile (from the .prj)
  • For each feature, get it's KML and attributes
  • Upload 'em to Fusion Tables, ensuring a max of a few hundred, or <1MB, at a time (the API can handle at most 500 rows and 1MB per POST)

Additional Features

Next up, I started adding a few extra bits, which led to an import method begging for a refactor.
  • Simplification for KML over 1M characters long (which is the max characters allowed by Fusion Tables per cell)
  • Process/Upload 10k rows at a time (so we don't use too much memory for very large shapefiles)
  • Added numeric styling columns for string fields that don't have too many unique values (Fusion Tables only allows robust styling like gradients and buckets on numeric fields)
  • Allow users to specify some additional geometry columns:
    • Simplified Geometry
    • Centroid (only works for polygon shapefiles)
    • Buffered Centroid (so you can apply the more robust polygon styling rules on the 'centroid')

Finishing up

This whole project was a pretty quick attempt at what I hope is a useful solution to a common problem, so any comments on how to make it better are appreciated. And if you want to see how it all works in more detail, I also open sourced the code. Enjoy!

February 5, 2010

Oceans Showcase

Last night at the San Francisco Ocean Film Festival Google launched the Oceans Showcase, which is the second contract I've had the opportunity to work on with them. The showcase is a set of Google Earth based Tours for playing in a webpage (plugin required) or via download.

The Ocean Film Festival is going on until Sunday, and has a really interesting lineup - check it out if you're in the Bay Area. Either way, take a peek at some of the Tours: There's some really amazing content available for the Oceans layer in Google Earth that I was totally unaware of before looking more closely.

November 8, 2009

Fun with Layars

Last night I installed Layar on my phone, and had some fun checking out the twitter and wikipedia layers. So I signed up for an API key, and 30 seconds later saw a tweet mentioning the California Data Camp. Perfect! After a rare and blissful sleep-in, I wandered over to see what was going on at Citizen Space, thinking I'd try get a proof of concept demo showing some City of SF data in Layar.

Turns out, despite a number of interesting conversations taking precedence over my coding, I managed to get a simple demo working, and even win Honorary Mention (and an iPod touch) for my efforts. And a couple of Layars (crime data and handicapped parking spaces) are just waiting for publishing approval from Layar, and will hopefully be available in a few hours. Just search for "datasf" in your Layar app.

Since GeoDjango was the reason I was able to get a mockup going so quickly, I thought I'd just write a few short notes on the steps I took to build the Layar compatible API, and make the code available. Note that the code here is not particularly pretty - it's the result of a partial afernoon of work (including finding/downloading the layers, going over the layar api docs, and dealing with incredibly spotty internet connectivity). Nonetheless it may be interesting to some geodjango newbies as an example of a quick proof of concept.

The homepage is at http://code.google.com/p/geodjango-layar/, and the wiki includes some play-by-play instructions if you're just getting started with this stuff. Enjoy!

[EDIT]: I now see that someone else (sunlight foundation) had already built a more robust generic view for django, called django-layar. You should definitely use theirs instead .. but if you're curious for more ways to load spatial data into geodjango you might find my notes on google code interesting anyway.

October 15, 2009

Tiling Kibera

The upcoming Map Kibera project acquired some imagery recently, and I got ahold of it yesterday to set up a quick tilecache preview. There's actually been quite a few requests here recently for getting some tiles up quickly from various sets of source imagery, so I thought I'd write a few blog posts on some different ways to go about it.

First, I'm assuming the end user will be requesting tiles, and that these tiles will be projected in Spherical Mercator for viewing on the web in a browser like OpenLayers or Google Maps (so I'm skipping over the bits for creating tiles that might be used in a browser like Google Earth). With that in mind, there are a few ways to get your tiles.  Note that the Kibera imagery is a nice simple example, because the area of the imagery is not that large (about 25 square km), and the source file is only a couple hundred megs as an uncompressed TIF.

Option A: Pre-generate all your tiles in advance

The easiest way to generate all your tiles in advance is probably to use the newish MapTiler software, which is a nice graphic interface for the gdal2tiles project.  After installing MapTiler, I just selected my projection, selected my single TIF file (note your source files do not have to match the output projection -- and because my data was a GeoTiff with appropriate metadata, MapTiler automatically figured out the projection info and appropriate transformation by itself), selected my zoom levels and other options, and hit Render.  Because I wanted tiles all the way up to zoom level 18, it took just under 15 minutes.  The output of MapTiler is just awesome - it creates not only the tiles but also sample Google Maps and OpenLayers html, each of which is full of nice features.  I'm impressed (though I'd like to see a CloudMade tile layers or two in the OpenLayers example).

If for any reason MapTiler isn't working out for you, you can also use gdal2tiles directly.  Mano Marks recently wrote a nice tutorial on using gdal2tiles for creating KML superoverlays.  The concept is the same for creating spherical mercator tiles - you just need to change the warping projection (to use EPSG:3785 instead of EPSG:4326) and remove the geodetic option from gdal2tiles and you should be good to go (note that epsg:900913 is equivelent to epsg:3785, and if you do not have one of them in your epsg file, you may need to add it manually)

The source imagery was .6 meters/pixel, and because we're near the equator, tiles at level 18 are close to the scale of the original image.  Going up to zoom 19 added a little viewing clarity, but it took ~4x the space as was required for my level 18 tiles, not to mention the time to render them.  In this case, rendering zoom level 19 alone took over 45 minutes.

Option B:  Generate tiles on demand

Often, you are dealing with a larger dataset than Kibera, and rendering all the tiles might take many hundreds of gigabytes (or much more).  In addition, it's very likely that the vast majority of your tiles will never be requested by any user -- rendering the middle of a 'boring' area up to zoom level 20 is basically a waste of space.  But since you can't be exactly sure which tiles will be requested, you may want to render them on-demand, and then cache those requested tiles under the assumption that if they were requested once, they're more likely to be requested again.  Another reason to do this is time:  It only took 15 minutes to render Kibera up to zoom level 18, but what if you just got imagery for Afghanistan, and you'd like to start looking at the tiles _right now_ instead of waiting overnight (or longer) for the pre-rendering to finish?  One answer is TileCache

A common use-case for TileCache is to put it as middleware between an existing WMS server and the end users.  This works great, but requires you've got a WMS server already configured.  However, TileCache can also read GDAL Data Formats directly, and then spit out the tiles.  To use this, it's important you have both PIL and Numpy installed (along with GDAL and TileCache, of course).  Here's a simple tilecache configuration for creating google-map compatible tiles:


In addition, however, you need to make sure your source data is in the matching spherical mercator projection.  To reproject (or transform) the Kibera imagery, I used this command:

gdalwarp -t_srs epsg:3785 09FEB19_BOOST.tif kibera.tif

Finally, you can also use tilecache_seed to pre-render some or all of the tiles using tilecache itself.  It can be useful for example to seed all but the last couple zoom levels (these will take relatively little disk space) so the first users of the map won't have to wait for tiles to render until they zoom way in to see some detail.

Tips and Tricks

There's a few things you can do to speed up tile generation and lessen the load on your server.  With a small dataset like this, it's not a big deal - but when dealing with bigger data sources, speeding up your render time can mean hours or days of computer time saved.

Transforming your Source Data:  Making sure your source data is in the same projection as your output tiles means more then creating a VRT with the metadata for the projection transformation - it means actually transforming the raw data so it doesn't have to be transformed on the fly during tile creation.  This has to be done for the tilecache option above, but if using MapTiler or gdal2tiles, you may wish to use gdalwarp as noted at the end of the TileCache section above to actually output a new tiff file to use as your source.  The disadvantage of this is that you end up using extra space for the source data while you render, but if you're plan is to pre-render all the tiles then disk space is probably not your concern.  

Creating Overviews: In the Kibera example above, only zoom levels 18 and 19 were near the source dataset resolution.  All of the lower zoom levels could have been rendered more quickly if we had them reading from a more course (downsampled) data source.  Fortunately GDAL ships with a utility to let us create these downsampled "overviews", which will in turn be used by any of the above rendering methods.  To create overviews of my gdal data source I run:

gdaladdo kibera.tif -r 2 4 8 16

I can also add the "-r" parameter to the gdaladdo command which will create a separate overview file rather than incorporate them directly into my source tiff.  Either way, this can potentially speed up rendering time for all but your most detailed zoom levels.

Post Processing:  As mentioned by MapTiler during the tile creation process, you can save half your disk space or more by minimizing the output tile size using PNGNQ.  There's a thread here discussing ways to recurse through all your png files on windows or linux.

June 11, 2009

Featureserver on AppEngine

AppEngine is awesome. The more we use it, the more we like it.

Recently, someone contacted us who needed a site up, in a hurry, to serve up some points on a google map. The catch was there were about 50k points (so it seemed server side clustering might be nice). Also they wanted to be able to serve up at _least_ tens of millions of requests a day. And maybe quite a lot more.

Given the scaling requirements, it seemed like AppEngine might be a nice fit, since then we wouldn't have to worry so much about tons of caching, or ensuring clients made similar bounding box requests, and so forth. And as for the posting/getting of points to/from appengine, we decided to go for using FeatureServer as a base.

If you're not familiar with featureserver, a quick overview: It makes it easy to (amongst other things) post/update your features to some datastore, and pull them out with bounding box and/or attribute queries in a variety of vector formats (kml, json, wfs, etc). Also it not only supports a bunch of different backend datastores (shapefiles, twitter, postgis, flickr, etc.), but it makes creating new ones simple. And, thanks to crschmidt's usual paving-the-way, setting up FeatureServer on AppEngine was trivial.

So there I am with a nice little featureserver running on AppEngine. We set up some cron jobs to do the clustering, and with 50,000 points I run some tests at about 75 queries/second. Everything seems great.

But on further examination, the FeatureServer datastore that currently exists for AppEngine has a couple problems:
* Because it is based on geohash , it uses up your only inequality query on your location (bounding box) search, which means you can't filter on other stuff.
* The geohash implementation it uses has some quirks (but that's for another post)

Fortunately, WhereCamp was on during the time I was thinking about how to solve this, so I was able to ask all kinds of smart people their advice. One of them immediately pointed out to me that a colleague had implemented a clever method for storing points on AppEngine that might just do the trick: GeoModel

And so it was that I gave GeoModel a try, and it did indeed solve the problems I was having with the geohash implementation. On the downside GeoModel currently only works with points, but as that is all this particular project needs, it's not a problem at all. Long story short, I simplified our custom datastore this morning, and committed it to the featureserver codebase. So if you want to very quickly put up a scalable, reasonably robust geo-point datastore, with a restful (sorry, sean) interface, GeoModel on AppEngine might be a good way to go.