Archive for the ‘Infrastructure’ Category
Today we released Tarbell 0.9 Beta 6. (Tarbell is our open-source static site generator based on Google spreadsheets, made with newsrooms in mind. Read more here!) This is our biggest release to date and should be our last release before a stable 1.0 version to come in the next week or two. Here are some notable changes:
- New naming conventions: The “base template” naming was confusing to users. We have switched to the term “Tarbell blueprints” to better reflect the role and function of this key Tarbell concept. The “_base” directory has been renamed “_blueprint” and the documentation now refers to “Tarbell blueprints” instead of “base templates.” Projects created with previous versions of Tarbell will still work.
- Expanded documentation: We greatly expanded and improved the Tarbell documentation, including a more in-depth tutorial.
- New hook system: Developers can now trigger actions during project installation. Create a repository and tickets when creating a new project or refresh the Facebook cache when publishing.
- Improved command line interface: Better wording, formatting, and line-wrapping.
- Better credentials: Tarbell now supports publishing from non-interactive environments.
- Support for project requirements: Tarbell projects and blueprints can now specify 3rd party Python libraries as dependencies.
Get started by installing Tarbell! Already a Tarbell user? Upgrade with:
pip install -U tarbell
Special thanks goes to Heather Billings, who did tremendous work on this release.
For the last few months my coworkers Ryan Mark and Ryan Nagle have been engaged in rebuilding the Tribune’s blog-network, ChicagoNow.com (new version not yet launched). Much of the rebuild has revolved around migrating to a new architecture–Apache, WordPress, and Varnish. Being the “Varnish guy” on the team, I was asked to implement a similar caching strategy to what we had developed for TribLocal.com last summer. Perhaps the most important lesson I learned from TribLocal’s development was that our cache layer was extremely fragile–it was prone to break when a new version of a site was deployed. Worse yet, minor changes to the VCL configuration frequently had unforeseen consequences that cascaded across the site.
My intention with ChicagoNow was to do a much better job of compartmentalizing the various elements of the configuration to ensure changes were idempotent, however, Brian had an even better idea: unit tests. For those who want to cut to the chase, here is the Python unittest script for testing our Varnish cache layer. The script knits together three unrelated technologies in order to allow us to test how changes to the site affect the cache:
- Python’s standard library unittest module
- Kenneth Reitz’s Python requests library
- Max Cutler’s python-wordpress-xmlrpc library
The unittest module provides a convenient framework for isolating tests from one another and python-requests dramatically simplifies the process of making various types of requests and testing the response headers. The VCL for the site attaches “X-Cache” and “X-Cache-Hit” debug headers indicating whether the response was a cache hit or miss and if it was a hit how many hits there have been. The final component in the stack is the python-wordpress-xmlrpc library, which makes it nearly trivial to create posts and comments programmatically. This is crucial to testing if the cache is being correctly invalidated. Let’s take a look at a couple tests to see how these three components work together.
class TestLoggedOut(TestCachingBase): """ Tests for logged-out users. """ def test_homepage(self): url = build_url('/') response = self.get_twice(url) self.assertHit(response) self.assertMaxAge(response, 300) def test_homepage_login_logout(self): url = build_url('/') cookies = self.login() cookies = self.logout(cookies) response = self.get_twice(url, cookies=cookies) self.assertHit(response)
Here we see two tests for logged-out users. The first test fetches the homepage of the site twice and assures that the second request was a hit and also that the max-age header is what we declared it to be in the VCL. The second test logs the user in and then out and then, while maintaining the same cookie jar, hits the homepage twice to ensure that the user gets cached results after logging out. Let’s look at one more test from
TestLoggedOut to see how python-wordpress-xmlrpc comes into play:
def test_edit_post(self): url = build_url('/%s/' % BLOG) post_id = self.new_post(BLOG) response = self.get_twice(url) self.assertHit(response) self.edit_post(BLOG, post_id) response = requests.get(url) self.assertMiss(response)
This test creates a new post on a particular blog, then hits the blog homepage twice to ensure it is cached. Then it edits the post that was created and hits the blog homepage once more to ensure that the cache was invalidated. Most of the actual functionality has been abstracted into the
TestCachingBase base class so the individual tests are brief and readable. The full test suite includes several other complex scenarios.
It’s unfortunate that we lack a good “code coverage” metric for this sort of testing and of course these aren’t true “unit tests,” but using unit testing tools in this way goes a long way toward organizing the logic. The best thing about this cache testing rig is knowing that if we change something we can easily check if we have broken anything else (i.e. regression testing). This was a frequent problem with TribLocal and I think this implementation is going to go a long way toward preventing that from happening again.
The Chicago Tribune Elections Center did not go down–not when our results widget went up on the homepage, not when the traffic started to climb, and not when the first races were called. It didn’t even go down when our traffic reached truly epic proportions. By epic, I refer to Chartbeat’s reckoning that their were 15,128 concurrent users on our site at 9:30 PM.
The reasons for this are not intrinsic to our Tribune-branded awesomeness, but stem from the same pain some of you are feeling now. During the February primaries of this year our then-new Elections Center crashed… hard. Within moments of the AP numbers beginning to come in the site flat-lined and was essentially down for a half hour. We were running memcache, but it wasn’t enough for the volume of traffic we had to handle and as a consequence the site was never as stable as we would have liked. We learned a lot of lessons from that experience as well as others we’ve had through the year. So here, in a nutshell, is how to keep your elections application from falling over:
The Election Night Performance Checklist
- Do not serve widgets.
That awesome elections widget you are putting on your newspaper’s homepage should not be being served by your Slicehost 256 VPS. Use the magic of caching and put that thing on Amazon S3. It can handle the traffic and you can push out updates as often as you need to, without your servers falling over. Trust us, this is the single best thing you can do to save your servers and your bacon.
- Use database connection pooling.
If you’re using Postgres, use pgpool. If you’re using another database find out what options it has for connection pooling. If you open more connections than your database can process simultaneously then your database server is going to fry. Keep the connections low and let users queue naturally.
- Build read only applications.
Elections results need to be disseminated. Do not clutter up your results page with something that people are going to be editing. This will wreak havoc with your caching and mean your application is trying to work through loads of updates while also being optimized for a very high read load.
- Bring 3 extra servers online.
If you’re application servers get behind on election night they will never catch up–they will crash. Do yourself a favor, put down $10 and bring three extra servers online for the day before and the day of elections. It’s cheap. It’s easy. It will save your bacon. (You are using cloud hosting, right?)
- Use state of the art caching–this means Varnish.
Cache your site and cache it very aggressively. Even the results pages on the Elections Center are cached for an hour. Targeted invalidation takes care of purging pages from the cache when new content comes in. Need an example to get started? You can find our configuration on Github!
- Enable grace mode.
Even the best cache won’t save you if improperly configured. If you don’t enable grace mode with at least a one-minute timeout your servers won’t be able to regenerate pages as fast as they are purged from the cache. Your servers will fall into the clutches of the dog-pile/thundering-herd problem and they will die.
- Massage your cache hit rates.
On election night if your cache hit rate isn’t 99% then it isn’t good enough. Even 1% of thousands is still a lot. With a 99% cache hit ratio our application servers were literally idle most of the night. If necessary, use Edge Side Includes and other techniques to ensure your pages are very well cached, but whatever you do do not let users pass the cache unless you are certain that page content has changed.
- Next year: use Thinner.
We didn’t do it this year, but we learned from it and next year we hope use something like Propublica’s Thinner to stagger our cache invalidation. Invalidating large sections of cache all at once risks invoking the dog-pile problem. Thinner is an elegant solution for large sites.
For those who had problems this year: if you do these things I can all but guarantee you will have a better experience next time around. We all look forward to many more stable elections sites next time around. Thanks for reading.
You don’t get to start optimizing until you’ve made the servers fucking cry.
In June we launched Chicago Breaking Business, which is, at least in terms of traffic, our largest site so far. Both before and during that project I had been iterating toward a load-testing solution that would meet the challenge of the above quote and provide us with a reasonable certainty that sites of this size (and larger) would not collapse under sudden waves of traffic. The traditional load-testing tools we tried (ab, flood, jmeter, etc.), while often useful, suffered from a number of problems when trying to simulate this sort of event:
- Testing clients may have an operating system imposed maximum number of out-bound connections.
- Testing a staging environment from a development workstation introduces network latency, firewalling, and traffic shaping as variables in the tests.
- Testing from a single client will cause Amazon’s Elastic Load Balancer to route all connections to a single application server, no matter how many are available. (so-called “sticky sessions”)
What we needed was a solution that allowed us to use multiple, topographically-close clients to all simultaneously assault our servers with traffic.
Put it another way, what we needed were bees with machine guns.
Bees with machines guns is a fabric script that I put together to aggressively load-test our sites. It provides methods to spin up an army of EC2 instances (the bees), assault a target server with a configurable amount of traffic, and then spin those instances back down.
Here is an example of what a Bees session might look like:
$ fab -f bees.py up:2 Connecting to the hive. Attempting to call up 2 bees. Waiting for bees to load their machine guns... . . . . Bee i-df450ab4 is ready for the attack. . Bee i-dd450ab6 is ready for the attack. The swarm has assembled 2 bees. Done. $ fab -f bees.py attack:http://YOUR-DOMAIN.com/,1000,100 Read 2 bees from the roster. Connecting to the hive. Assembling bees. Each of 2 bees will fire 500 rounds, 50 at a time. Stinging URL so it will be cached for the attack. [localhost] run: curl http://YOUR-DOMAIN.com/ >> /dev/null Organizing the swarm. Bee 0 is joining the swarm. Bee 1 is joining the swarm. Bee 0 is firing his machine gun. Bang bang! Bee 1 is firing his machine gun. Bang bang! Bee 0 is out of ammo. Bee 1 lost sight of the target (connection timed out). Offensive complete. Target failed to fully respond to 1 bees. Complete requests: 500 Requests per second: 81.670000 [#/sec] (mean) Time per request: 612.188000 [ms] (mean) 50% response time: 358.000000 [ms] (mean) 90% response time: 1541.000000 [ms] (mean) Mission Assessment: Target successfully fended off the swarm. The swarm is awaiting new orders. Done. $ fab -f bees.py down Read 2 bees from the roster. Connecting to the hive. Calling off the swarm. Stood down 2 bees. Done.
(Note: Bees assumes you have AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set as environment variables–just as you would if you were using the Amazon command line tools.)
Using the Bees I was able to verify that our new load-balanced and Varnished servers could handle a sustained load significantly beyond our requirements. Obviously this testing strategy doesn’t attempt to address all scenarios and it should be used in conjunction with traditional testing tools such as jmeter. That being said, it also provides a pretty interesting example of using EC2 to solve problems that can really only be properly addressed with distributed computing. This particular example is very well-suited to validating caching and scaling strategies.
If you decide to use the Bees, please keep in mind the following important caveat: they are, more-or-less a distributed denial-of-service attack in a fancy package and, therefore, if you point them at any server you don’t own you will behaving unethically, have your Amazon Web Services account locked-out, and be liable in a court of law for any downtime you cause.
You have been warned.
Now then–how does your server handle thousands of simultaneous connections?
For some time now we have been eagerly awaiting the release of a new tool for data normalization called Freebase Gridworks. Developed by David Huynh and Stefano Mazzocchi, research hackers at Metaweb Technologies (developers of Freebase), Gridworks is an open source, Java-based tool for refining big ugly datasets into glistening gems of consistency.
It accomplishes this by providing quick access to a robust set of tools for identifying commonality amongst values in a dataset and then collapsing all of those values into a single representation. This might sound obtuse, but if you’ve ever worked with a raw dataset (especially one keyed by hand) you’ve probably addressed these exact problems before–there always seems to be one user who spells it “Gray,” another who spells it “grey,” and a third who can’t bother to decide and spells it “GRY.” These sorts of inconsistencies might be alright when the data is referenced one row at a time, but they completely undermine any ability to properly analyze or aggregate the data in a useful way.
Gridworks addresses these issues with some familiar tools: programmatic column splitting, collapsing of internal whitespace, regular expression transformations, faceting (finding all unique values in a column), Levenshtein distance clustering; as well as some methods that you may never have heard of: n-gram clustering, metaphone clustering, and reconciliation to Freebase types.
However, the key to the incredible utility of Gridworks lies not in its selection of algorithms, but rather in its providing one simple front-end for applying all of them. All of these tools for normalization are well known, but the traditional way of using them would require writing some code and then that implementation would inevitably be tied to a particular dataset. The genius of Gridworks is that it is generic enough to work for a wide variety of datasets without the need to write any code at all. Even better the resulting operations are portable, so the process used to clean up 2009’s data can be repeated for 2010. It is almost impossible to overestimate how much time this will save us.
We want to thank the incredible hackers at Metaweb for making Gridworks open source. This is a tremendous contribution to the community, especially given that comparable commercial products are priced well out of range of small businesses, independent developers, and newsrooms. While we can’t tell you about any of the projects we are using it on just yet, we can say that its changing the way we look at data on a daily basis. We really can’t say enough about what a great application Gridworks is and about its myriad uses for hacker journalists and data-nerds of all stripes.
The impossibility of running a private staging environment behind an Amazon EC2 Elastic Load Balancer (ELB)
As part of our advocating best practices we encourage anyone developing software to have a proper staging environment. What that means, in a nutshell, is an environment that is in every way identical to your production environment, but is not accessible by the public. This allows the developer to thoroughly test a site—bug testing, usability testing, load testing—before it is deployed to the public, and reasonably expect the behavior of both sites to be identical.
Normally, this is a straight-forward task, but we have discovered a fundamental problem when trying to setup a staging environment behind an Elastic Load Balancer—Amazon’s fully automated pseudo-server that magically routes traffic amongst any number of application servers.
Here is the rub: requests to the app servers will appear to come from the ELB. (The original client’s IP address is added as an X-Forwarded-For header.) ELB’s do not exist within a security group—they are a completely independent entity. Thus the ELB must be specified as allowed within the staging security group in order for traffic to reach the server. In a production environment this isn’t an issue—traffic from any IP address is allowed on port 80. However, this causes a big problem for staging, because we only allow traffic for specific IPs. If we add the ELB’s IP to the security group then we have effectively punched a hole in our firewall—all traffic appears to come from the ELB, so all traffic is allowed.
Our resident artist, Brian, has illustrated it thusly:
This means that we can not control access to our staging environment at the security group level. Of course, we could use Apache filtering on the X-Forwarded-For header, but this (and similar solutions) must be applied at the server level and would therefore cause our staging and production environments to diverge.
What options does this leave us if we insist on maintaining parity between production and staging? Well, we could dispense with the ELB and implement our own load balancing, but that requires making a change to a production setup that is working fine. More to the point, this seems like an EC2 design flaw: none of us can contrive a reason why the ELBs aren’t embedded in security groups.
And while we are on the subject of ELB problems: why can’t they have Elastic IP addresses? Even if we could just add the ELB to the security group, it wouldn’t be stable because its IP address is not static. This is another problem with Amazon’s implementation.
(Incidentally, there isn’t even a straightforward way to determine the appropriate internal IP of the ELB. It is not the same IP that nslookup resolves, either within Amazon’s network or from outside. The only way I have found it is by examining the traffic logs on the application server.)
It is probably reasonable to assume that Elastic Load Balancers have these problems because Amazon implemented them in the simplest way possible and that precluded access to all the rest of the magic in their infrastructure. Certainly, they work fine without all that in the typical case. However, the need for a controlled environment that is functionally identical to our production environment (especially for load testing) really makes this a significant issue.
This all leaves us feeling a bit exasperated. Does anyone have a better solution for these issues? How do you handle your EC2 staging environment?
Yesterday, our friends at ProPublica released the source to their tremendously useful table-setter application. (Which our own Brian Boyer helped develop during his time there.) In brief, table-setter allows you to quickly generate beautiful interactive tables from CSV files or Google Spreadsheets. For a really great example, see this table documenting stimulus dollars spending (click through for the interactive version):
As you can see this table has sorting, highlighting, graphical components–the works.
These tables can be generated dynamically and hosted on a Rack-capable webserver or they can be built-out to static files. While we are perfectly capable of deploying a Ruby application, it was this second part that really peaked our interest.
Fast, robust and dirt cheap
After a bit of consideration, we decided to use the same solution we use for other static content: S3! By deploying the HTML and all assets to S3 we can serve these tables with no overhead to the rest of our operations. It takes five minutes to setup and will literally cost us a few dollars a month. And by doing it this way we dodge the twin bullets of configuring and maintaining a Ruby environment.
- Rebuilds the tables
- Removes the table index and makes tables individual files in the root directory, rather than index.html files in their own directory (this cleans up the URLs since S3 does not support default files for directories)
- Gzips the tables and all non-image assets
- Synchronizes the entire directory tree to our staging or production S3 instances
You can see the completed script on Gist. And because we have a DNS redirect for our production S3 domain, our URLs in end up looking really nice:
We would like to thank ProPublica for releasing this really tremendous tool for putting tabular data online. Gracious sharing such as this is what makes open source work.