February 20, 2009

Walk On

A few months ago, Walkscore.com launched their new API, aimed a Real Estate sites, academic and other largescale studies, and anyone else who might want programmatic access to the walkability of a set of locations.
Recently, a number of large sites (zillow.com, Estately.com, BaseEstate.com, and ColoProperty.com amongst many others) have started to incorporate walkscore in their listings, which has led to the happy situation of ensuring the site can handle the popularity.

When the good folks at Walkscore first contacted me about designing the API, we talked about the likelyhood we'd be serving many, many millions of daily requests shortly after launch. This quickly led into discussions about what framework we wanted to build it on, and how much IT we were interested in taking on. Eventually, we decided to use Google's AppEngine. Given the current, and quickly growing, popularity of the API, I think this ended up being a great choice - no worries about having to even think about spawning new EC2 instances or load balancing, let alone how to best optimize the database and apache and caching configurations.

The rest of this post is about the first quirk we encountered on AppEngine -- I'll probably post a few more items in the future with details about some of my other experiences working on this very fun project.

Quirk 1: Because counting things is not quite as simple on what is basically a key/value datastore as it is in a relational database.

The obvious, and nicely documented, choice for replacing a count(*) type query is using sharded counters to keep track of how many things you have. Unfortunately, the amount of counting that we do in this app is quite a bit: not only all the various items in the datastore, but also what type of request each user makes. And we need to summarize a bunch of these counts reasonably frequently (mostly to ensure users are under quota, but also for various other reports).

Although AppEngine scales super nicely in a lot of areas, there are still limits to how long a request can take before AppEngine decides you are using too many resources of one kind or another, and kills that request. In my testing, as the overall queries/second went up, requests that had multiple datastore reads (let alone writes) took longer and longer -- once we hit about 40QPS, HTTP 500 errors due to requests timing out increased dramatically. This held true while testing even their own sample sharded counter application, with about as simple a model as you can build.

My solution to this, after some more tests and various discussions, was building a (sharded) counter that relies more heavily on memcache - it only writes to the datastore when there have been a few hundred new counts to deal update. This makes for a couple order of magnitudes less datastore chitchat, and does not seem to have caused us any inaccurate counts (which we can test by manually counting each of a certain type of entity, and comparing against the our counter)

So quirk 1 was not a major issue, but a great reminder that you still need to think about scaling, even when building on top a massively scalable infrastructure.