Tribune DataViz

Matters of interest, from the data reporters and developers across Tribune Publishing

Election night traffic survival tips from scarred apps developers

with 13 comments

Today our friend Andy Boyle wrote a blog post, Servers on fire, recounting his bad experience with last night’s election traffic. Unfortunately, he wasn’t the only one who had a bad experience.

The Chicago Tribune Elections Center did not go down–not when our results widget went up on the homepage, not when the traffic started to climb, and not when the first races were called. It didn’t even go down when our traffic reached truly epic proportions. By epic, I refer to Chartbeat’s reckoning that their were 15,128 concurrent users on our site at 9:30 PM.

The reasons for this are not intrinsic to our Tribune-branded awesomeness, but stem from the same pain some of you are feeling now. During the February primaries of this year our then-new Elections Center crashed… hard. Within moments of the AP numbers beginning to come in the site flat-lined and was essentially down for a half hour. We were running memcache, but it wasn’t enough for the volume of traffic we had to handle and as a consequence the site was never as stable as we would have liked. We learned a lot of lessons from that experience as well as others we’ve had through the year. So here, in a nutshell, is how to keep your elections application from falling over:

The Election Night Performance Checklist

  1. Do not serve widgets.
    That awesome elections widget you are putting on your newspaper’s homepage should not be being served by your Slicehost 256 VPS. Use the magic of caching and put that thing on Amazon S3. It can handle the traffic and you can push out updates as often as you need to, without your servers falling over. Trust us, this is the single best thing you can do to save your servers and your bacon.
  2. Use database connection pooling.
    If you’re using Postgres, use pgpool. If you’re using another database find out what options it has for connection pooling. If you open more connections than your database can process simultaneously then your database server is going to fry. Keep the connections low and let users queue naturally.
  3. Build read only applications.
    Elections results need to be disseminated. Do not clutter up your results page with something that people are going to be editing. This will wreak havoc with your caching and mean your application is trying to work through loads of updates while also being optimized for a very high read load.
  4. Bring 3 extra servers online.
    If you’re application servers get behind on election night they will never catch up–they will crash. Do yourself a favor, put down $10 and bring three extra servers online for the day before and the day of elections. It’s cheap. It’s easy. It will save your bacon. (You are using cloud hosting, right?)
  5. Use state of the art caching–this means Varnish.
    Cache your site and cache it very aggressively. Even the results pages on the Elections Center are cached for an hour. Targeted invalidation takes care of purging pages from the cache when new content comes in. Need an example to get started? You can find our configuration on Github!
  6. Enable grace mode.
    Even the best cache won’t save you if improperly configured. If you don’t enable grace mode with at least a one-minute timeout your servers won’t be able to regenerate pages as fast as they are purged from the cache. Your servers will fall into the clutches of the dog-pile/thundering-herd problem and they will die.
  7. Massage your cache hit rates.
    On election night if your cache hit rate isn’t 99% then it isn’t good enough. Even 1% of thousands is still a lot. With a 99% cache hit ratio our application servers were literally idle most of the night. If necessary, use Edge Side Includes and other techniques to ensure your pages are very well cached, but whatever you do do not let users pass the cache unless you are certain that page content has changed.
  8. Next year: use Thinner.
    We didn’t do it this year, but we learned from it and next year we hope use something like Propublica’s Thinner to stagger our cache invalidation. Invalidating large sections of cache all at once risks invoking the dog-pile problem. Thinner is an elegant solution for large sites.

For those who had problems this year: if you do these things I can all but guarantee you will have a better experience next time around. We all look forward to many more stable elections sites next time around. Thanks for reading.


Written by Christopher Groskopf

November 3, 2010 at 1:40 pm

Posted in Infrastructure

13 Responses

Subscribe to comments with RSS.

  1. Great tips. We’re already installing Varnish as I write this, and setting it up so we’ve got multiple servers that we can spin up on high traffic apps.

    Andy Boyle

    November 3, 2010 at 2:10 pm

  2. This is terrific. A couple three things I’d add:

    1) Use Varnish for load balancing to all those extra servers. It’s nice and lightweight and works great. The Trib crew does this already if you look at their vcl; just wanna say it again.

    2) A 99% cache hit rate is ideal, but if you’ve got a few pages that have short TTLs, like RSS feeds or your stylesheets — or if your admin users see uncached pages, it’s probably fine. If your hit rate goes under 90%, I’d freak out if I were you.

    3) Deploy a simple app that does some math and returns a number. Call it “heartbeat” or something clever like that. Use Varnish’s “probe” config to poke that app every 5 seconds. If it doesn’t return, it means your whole server is borked and Varnish should increase its grace period to a long time. Our Varnish vcl does this:

    4) Look at the bit about X & Y coordinates in our vcl around line 69 — if you use graphical submit buttons in your form, X and Y get sent along with the posted data, which makes your pages pretty much uncacheable, so tell Varnish to ignore them.

    Scott Klein

    November 3, 2010 at 2:47 pm

    • 1) Yes, absolutely. Varnish makes a fine load balancer and introducing another dependency is just silly.

      2) Your right, of course. Not all sites will achieve 99%, but I think for traffic spikes like election night its a pretty reasonable target to be aiming for. If load testing has shown you can survive at the 90% mark, bless your app servers.

      3) That’s a good point. I’ve never gotten around to setting up a heartbeat monitor, but its certainly worth doing. Thanks for sharing your VCL!

      4) Right, we haven’t had that problem, but good tip.

      Christopher Groskopf

      November 4, 2010 at 12:41 pm

  3. The best part about a night like last night is watching the needle on Chartbeat do 360-degree spins as it struggles to keep up.

    Managed to hit 6 out of 8 on the checklist (and still mopping up sweat from the ones I didn’t think of — Varnish moves up the to-do list)

    Jeffrey Long

    November 3, 2010 at 2:54 pm

    • Absolutely, gave us some wonderful metrics and their manager was kind enough to give us some extra juice when we exceeded capacity.

      Christopher Groskopf

      November 4, 2010 at 12:42 pm

  4. Been waiting for this post. Awesome tips for other news hackers trying to scale django and convince others that it really does scale.

    I suppose you use post_save signal and absolute_url or something to invalidate single pages for objects, but I always struggle to find a decent way of invalidating filter pages, pages listing objects etc. Do you have some neat solution that you guys use for that? Or is it just a standard caching for a couple of minutes? As always: caching is easy, invalidation is not..

    Anders Eriksen

    November 4, 2010 at 12:03 pm

    • I still haven’t gotten invalidation right, because doing it right (handling all those extraneous pages you describe) takes as much time as building the app!

      WIth that in mind we’ve always taken the low-road: invalidate where its easy or necessary (post_save is obviously great for this) and use fixed TTLs for other parts of the site.

      ESI fragments can also be great for breaking up your pages into components that can easily be invalidated, even if not all of it can.

      Christopher Groskopf

      November 4, 2010 at 12:45 pm

      • My colleague Jeff Larson wrote a small ruby script — which has a command-line executable for you Djangonauts and PHPniks — called Thinner, which helps you invalidate a big mess of URLs at once without melting down. Thinner takes a list of URLs and invalidates them at the (slow) pace you set.

        Scott Klein

        November 6, 2010 at 8:07 am

  5. […] to keeping its performance nimble. Here are a few tips I picked up this week after reading this post and thinking about how to apply those ideas to my own site:1. Optimize the database.WordPress […]

  6. Can I ask a painfully stupid question?

    If you’re relying on caching, you’re pretty much acknowledging that the results might not be quite real-time. So why bring so much database horsepower into play at all, when you can use one database server to periodically generate static pages?

    I gotta be missing something obvious. But some things are less obvious at 1:45 a.m. when the dog woke you up. =)

    Mike Stucka

    November 5, 2010 at 11:46 pm

    • Mike, are you referring to point #4? I’m really only talking about the web tier in this case. We only use one small database server. If what you’re suggesting is that we could use the cron->update->render-static-pages approach (movable type style), rather than letting user requests kick off the updates you’re absolutely right. In general I think the reason we don’t do that is that it doesn’t generalize as well and we try to keep all our deployments fairly methodologically consistent. Having user requests update your page renders isn’t intrinsically bad, so long as they are appropriately throttled.

      As for the logic of having the extra web servers when we know the traffic is throttled, its simply a case of better safe than sorry. Election nights are hard to product and one mistake on our part could bring the whole elections rig. Servers are dirt cheap and the overhead of bringing them online is very low.

      Christopher Groskopf

      November 6, 2010 at 8:11 am

    • I guess it depends on what you mean by “real-time.” You can use varnish to cache every page in your app for a minute, which means a) your results are never more than a minute old, assuming your underlying data are also changing that fast, and b) your back-end webservers are only handling one request per minute for each page of your app, which even an Amazon Small instance can probably do.

      Varnish can do sub-minute TTLs, though I suspect if you need genuine real-time updates at scale you’d be better off a rack/building full of servers.

      Scott Klein

      November 6, 2010 at 8:14 am

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: