Tribune DataViz

Matters of interest, from the data reporters and developers across Tribune Publishing

Archive for the ‘Python’ Category

Better web cartography with dot density maps and new tools

with 19 comments

Between Brian, Joe, and myself there hasn’t been a time in the last six months where at least one of us wasn’t working with census data.

Back in February we attacked the less-detailed (redistricting) data for print and the web. In April, May, and June we contributed to a joint effort with an esteemed cadre of news nerds to develop census.ire.org, a site intended to make it easier for journalists to report from census data. And to prepare for this recent release, we even spent a week hacking near-complete prototype maps using data that the census had already released, Kings County, New York.

We learned hard lessons about the scale and nuance of the census in the last few months, and along the way, further built out our toolkit for making maps. Last week the Census Bureau released detailed (summary file) data for Illinois, and we used our new tools to produce a couple of maps we’re pretty excited about:

These maps demonstrate a map style we haven’t attempted before: dot density mapping. Dot maps let us represent multi-variate data more richly than choropleth maps–for example, illustrating variation in race and population density simultaneously. We were inspired in this effort by Bill Rankin’s Radical Cartography project and Dennis McClendon’s race map for the Encyclopedia of Chicago.

Many of the tools needed to create the maps we wanted didn’t exist. Using the fantastic TileMill as our starting point, we began to build a toolkit.

Invar

Invar automates the generation of map tiles, and the deployment of tiles to S3. It is the first and least glamorous of the tools we created, but crucially, it’s very, very fast. Fast!

The first time we ever tried to create our own tileset, it took hours to render and twice as long to deploy. Thanks to invar’s parallelizing these tasks, we can now produce a map in minutes and deploy it just as fast. In fact, we now deploy our maps to four separate S3 buckets so that we can take advantage of Leaflet‘s support for round-robining tile requests to multiple subdomains. Fast!

Englewood

Next we needed to distribute dots across geographies. We found one implementation of dot distribution in Python, which we extended into a module for reuse.

Englewood (named after an ailing Chicago neighborhood that the newspaper writes many sad stories about) uses the Python bindings for GDAL to load data from PostGIS or shapefile. It scatters points within each feature and then writes the points out to a table or new shapefile.

A small snippet of Python is required to configure Englewood. The following code renders the dots for our map of children less than five from a database. (A demo using shapefiles can be found in the repository):

#!/usr/bin/env python

from englewood import DotDensityPlotter 

def get_data(feature):
    """
    This function is called for each feature Englewood processes and needs to return a
    dictionary of classes, with a number assigned to each. Englewood will divide this
    number by a "dots_per" value set below and create that many dots for that class
    within the geography.
    """
    return {
        'hispanic': feature.GetFieldAsInteger(feature.GetFieldIndex('hispanic_under5')),
        'black': feature.GetFieldAsInteger(feature.GetFieldIndex('black_under5')),
        'asian': feature.GetFieldAsInteger(feature.GetFieldIndex('asian_under5')),
        'nhwhite': feature.GetFieldAsInteger(feature.GetFieldIndex('nhwhite_under5'))
    }

# Example argument values passed into the DotDensityPlotter
# In this case features are read from a PostGIS table (under_5_by_race_blocks_shapes)...
source = 'PG:dbname=chicagocensus host=localhost'
source_layer = 'under_5_by_race_blocks_shapes'
# ...and written into another PostGIS table (under_five_dots)
dest_driver = 'PostgreSQL'
dest = 'PG:dbname=chicagocensus host=localhost'
dest_layer = 'under_five_dots'
get_data_callback = get_data
dots_per = 1

dots = DotDensityPlotter(source, source_layer, dest_driver, dest, dest_layer, get_data_callback, dots_per)
dots.plot()

Deployment

A fast and stable process is useless if you can’t repeat it. We’ve built out a fabric configuration which allows us to make these maps in the quickest and most efficient way possible. Among other things, it allows us to keep some configuration (such as a bounding box) in a per-map YAML file. It parses this file and handles passing the correct arguments to invar for rendering and deployment. Perhaps most exciting, if you’re using the new TileMill 0.4 (available for OSX or Ubuntu) it can completely automate the production of Wax interactivity grids, such as we used to do the highlighting in our recent maps.

Via Crayonsman (CC BY-SA 3.0)

Styling dots

Creating dot density maps created new challenges with regards to styling. Brian tried numerous approaches to color and size the dots, but ultimately we settled on a few principles that worked pretty well:

  • Use a dark, sparse base-layer (we used a custom-styled Google Maps layer, but would like to move to an Open Street Map base-layer in the future).
  • Make your dots to stand out brightly. Try the fluorescent colors from the palette of Crayola crayons.
  • Play with transparency–you may want to take advantage of the effect of overlapping transparent dots.
  • Make Dots scale on zoom.
  • Whenever possible, use one dot per individual. It’ll make for a more interesting map.

Here is the style we settled on:

#under-five {
  marker-fill:#f00;
  marker-line-width:0;
  marker-allow-overlap: true;
  marker-opacity:1;
  [group="asian"] {marker-fill:#FF496C;}
  [group="black"] {marker-fill:#76FF7A;}
  [group="hispanic"] {marker-fill:#FFCF48;}
  [group="nhwhite"] {marker-fill:#7366BD;}
  
  [zoom=9] {marker-height:.2;}
  [zoom=10] {marker-height:.3;}
  [zoom=11] {marker-height:.5; marker-opacity:.7;}
  [zoom=12] {marker-height:.8; marker-opacity:.7;}
  [zoom=13] {marker-height:1; marker-opacity:.8;}
  [zoom=14] {marker-height:1.5; marker-opacity:.8;}
}

Wrapping up

Although I’ve linked to a number of projects and code snippets in this post, you may find it useful to see a complete project. This week, with Illinois under our belt, I decided to apply the same methodology to my side-project, Hack Tyler. I produced a map of race in Smith County, Texas (related blog post). Part of Hack Tyler’s modus operandi is developing in a completely transparent manner. As a result, you can see complete examples of both our backend and client-side mapping rigs in the following projects:

We hope that we’ve pressed the envelope a bit with these new maps. Someone said that this was the year cartographers retake the internet. I hope that’s true. Its about time that online maps were more than just shading boxes.

Advertisements

Written by Christopher Groskopf

August 12, 2011 at 4:02 pm

Don’t break the homepage! Using unit tests to validate your WordPress + Varnish caching strategy

with 3 comments

For the last few months my coworkers Ryan Mark and Ryan Nagle have been engaged in rebuilding the Tribune’s blog-network, ChicagoNow.com (new version not yet launched). Much of the rebuild has revolved around migrating to a new architecture–Apache, WordPress, and Varnish. Being the “Varnish guy” on the team, I was asked to implement a similar caching strategy to what we had developed for TribLocal.com last summer. Perhaps the most important lesson I learned from TribLocal’s development was that our cache layer was extremely fragile–it was prone to break when a new version of a site was deployed. Worse yet, minor changes to the VCL configuration frequently had unforeseen consequences that cascaded across the site.

My intention with ChicagoNow was to do a much better job of compartmentalizing the various elements of the configuration to ensure changes were idempotent, however, Brian had an even better idea: unit tests. For those who want to cut to the chase, here is the Python unittest script for testing our Varnish cache layer. The script knits together three unrelated technologies in order to allow us to test how changes to the site affect the cache:

The unittest module provides a convenient framework for isolating tests from one another and python-requests dramatically simplifies the process of making various types of requests and testing the response headers. The VCL for the site attaches “X-Cache” and “X-Cache-Hit” debug headers indicating whether the response was a cache hit or miss and if it was a hit how many hits there have been. The final component in the stack is the python-wordpress-xmlrpc library, which makes it nearly trivial to create posts and comments programmatically. This is crucial to testing if the cache is being correctly invalidated. Let’s take a look at a couple tests to see how these three components work together.

class TestLoggedOut(TestCachingBase):
    """
    Tests for logged-out users.
    """
    def test_homepage(self):
        url = build_url('/')

        response = self.get_twice(url)

        self.assertHit(response)
        self.assertMaxAge(response, 300)

    def test_homepage_login_logout(self):        
        url = build_url('/')

        cookies = self.login()
        cookies = self.logout(cookies)

        response = self.get_twice(url, cookies=cookies)

        self.assertHit(response)

Here we see two tests for logged-out users. The first test fetches the homepage of the site twice and assures that the second request was a hit and also that the max-age header is what we declared it to be in the VCL. The second test logs the user in and then out and then, while maintaining the same cookie jar, hits the homepage twice to ensure that the user gets cached results after logging out. Let’s look at one more test from TestLoggedOut to see how python-wordpress-xmlrpc comes into play:

    def test_edit_post(self):
        url = build_url('/%s/' % BLOG)

        post_id = self.new_post(BLOG)

        response = self.get_twice(url)
        self.assertHit(response)

        self.edit_post(BLOG, post_id)

        response = requests.get(url)
        self.assertMiss(response)

This test creates a new post on a particular blog, then hits the blog homepage twice to ensure it is cached. Then it edits the post that was created and hits the blog homepage once more to ensure that the cache was invalidated. Most of the actual functionality has been abstracted into the TestCachingBase base class so the individual tests are brief and readable. The full test suite includes several other complex scenarios.

It’s unfortunate that we lack a good “code coverage” metric for this sort of testing and of course these aren’t true “unit tests,” but using unit testing tools in this way goes a long way toward organizing the logic. The best thing about this cache testing rig is knowing that if we change something we can easily check if we have broken anything else (i.e. regression testing). This was a frequent problem with TribLocal and I think this implementation is going to go a long way toward preventing that from happening again.

Get the code for the unittest script.
Get the code for the ChicagoNow Varnish Configuration.
Get the code for our simple Newsapps Varnish plugin for WordPress.

Written by Christopher Groskopf

June 28, 2011 at 4:01 pm

Recent open source work and other projects we’re excited about

with 3 comments

In case you missed ’em, here’s some fun stuff we’ve been up to lately. For updates on our work in-between blog posts, follow us at @tribapps and github.com/newsapps.

appengine-autotweeter

A simple application for auto-tweeting an RSS feed. Built because TwitterFeed and other services weren’t fast enough for breaking news. Powers @ChicagoBreaking.

census.ire.org

A simple browser for census data, built with journalists in mind. Made in conjunction with news hackers from CNN, NYT, Spokesman-Review, University of Nebraska and USA Today. Check out the source code and the other great census resources provided by IRE.

Cloud Commander

Scripts you can use to quickly build and launch servers in Amazon’s cloud — you won’t forget how you built the server if you wrote a script to do it.

csvkit

csvkit is a library of utilities for working with CSV, the king of tabular file formats. If you’re working with CSV files or other spreadsheets of data, try it out. We use it every day.

invar

Command line tools for generating map images from Mapnik configuration. Use it to render the custom map tile sets you’re working on in TileMill. It’s fast. (If you’re not making your own tile sets, check out tutorial and get on it!)

The Newsapps Boundary Service, now with 100% more django

The Boundary Service is a ready-to-deploy system for aggregating regional boundary data. We’ve recently made it a pluggable django app. Check out Hack Tyler‘s instance for the hot demo boundaryservice.hacktyler.com/.

PANDA

PANDA is your new data research toolkit (to help you search and compare data) & newsroom data management system (‘cuz data stuck on your hard drive is sad data). Special thanks for the Knight Foundation for funding the development! For more details and updates on our progress, follow @pandaproject.

Written by Brian Boyer

June 24, 2011 at 8:39 am

PyCon 2011: Newshackers assemble!

with one comment

PyCon 2011, Atlanta, March 9-17

At PyCon 2011 I will be delivering a talk entitled Best Practices for Impossible Deadlines, which will go in depth on the incredible variety of ways we’ve found to shave precious hours off the software development process. Special attention will be paid to Python and Django tools developed in the newsroom.

I know that many of you reading this will be blowing your conference budgets to attend the 2011 CAR conference in February, but if you’re not, or if you can make it to both I strongly encourage you to come to PyCon this year. Its got an impressive line-up of talks and tutorials. I’m also planning to organize an OpenSpace for news developers and perhaps another for those working with government data, if someone doesn’t beat me to them.

If you’re going to be there, please let me know so we can plan to connect early in the conference. Hope to meet many of you there!

Written by Christopher Groskopf

January 10, 2011 at 9:58 am

Posted in Events, Python, Talks

From spreadsheet to HTML in 15 minutes with python-tablefu, Jinja and Flask

with 2 comments

The best Christmas carol

We often need to take a spreadsheet of info and lay it out in HTML on deadline. Typically we use ProPublica’s TableSetter which takes a google spreadsheet and generates an HTML table. TableSetter can be tweaked in a bunch of different ways to customize the generated table, but as with all specialized tools it has it’s limits. Luckly it’s easy to create a rudementary tablesetter clone in python quite quickly.

This week I got a spreadsheet of Christmas carols with YouTube embed codes to go along with a story about a Northern Illinois University professor and carol expert who recently died. The shape of the data in the spreadsheet lent itself more to a top-ten-style list than a table, so TableSetter was not the best tool for the job. The spreadsheet was only 25 rows, but I was not about build all the HTML by hand.

I exported the spreadsheet to a csv file, created a simple html template and wrote a simple script to mash the two together. I used Chris Amico’s python clone of ProPublica’s TableFu and the great templating library Jinja.

Here is the result.

Here is the code

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import jinja2
import table_fu

TEMPLATE_DIR  = 'templates'
DATA_FILE           = 'data/xmas-carols.csv'
TEMPLATE_FILE = 'xmas-carols.html'
OUTPUT_FILE     = 'build/xmas-carols.html'

# Set up jinja templates. Look for templates in the TEMPLATE_DIR
env = jinja2.Environment(loader=jinja2.FileSystemLoader(TEMPLATE_DIR))

# Open the csv file and load it up with TableFu
table = table_fu.TableFu(open(DATA_FILE, 'U'))

# Get the template and render it to a string. Pass table in as a var called table.
html = env.get_template(TEMPLATE_FILE).render(table=table)

# Write the html string to our OUTPUT_FILE
o = open(OUTPUT_FILE, 'w')
o.write(html)
o.close()

Jinja templates are very similar to Django templates. The biggest difference that I can discern is the much more powerful template syntax. It does stuff that I always thought Django templates should be able to do.

Well this script is kinda boring without the template:

<ul id="xmas-songs">
{% for row in table.rows %}
	<li>
        {% if row['Year featured']|trim %}
<div class="featured">{{ row['Year featured'] }}</div>
        {% endif %}
<div class="name">{{ row['Name'] }}</div>
        {% if row['youtube'] %}
<div class="youtube">{{ row['youtube']|safe }}</div>
        {% endif %}
<div class="written"><small>Written:</small>

            <strong>{{ row['Year written'] }}</strong></div>
<div class="origin"><small>Country of origin:</small>

            <strong>{{ row['country of origin'] }}</strong></div>
<div class="composer"><small>Originally by:</small>

            <strong>{{ row['original composer/lyricist'] }}</strong>

            {{ row['Other composers/lyricists'] }}</div>
        {% if row['keywords']|trim %}
<div class="keywords"><small>Memorable lyrics</small>

            {{ row['keywords'] }}</div>
        {% endif %}
        {% if row['famous versions']|trim %}
<div class="famous"><small>Famous renditions</small>

            {{ row['famous versions'] }}</div>
        {% endif %}
        {% if row['noteable']|trim %}
<div class="notable"><small>Did you know?</small>

            {{ row['noteable'] }}</div>
        {% endif %}</li>
{% endfor %}</ul>

That’s it. There are a lot of cool things you can do with TableFu that aren’t illustrated here, and Jinja does a ton of stuff, their documentation is pretty extensive.

Oh there’s one other thing. If you don’t feel like dropping back to the shell to run the script to update your HTML with your new CSS or HTML changes, you may want to …

Sprinkle a little Flask into the mix

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import jinja2
import table_fu
from flask import Flask

TEMPLATE_DIR = 'templates'
DATA_FILE = 'data/xmas-carols.csv'
TEMPLATE_FILE = 'xmas-carols.html'
OUTPUT_FILE = 'build/xmas-carols.html'

app = Flask(__name__)

@app.route("/")
def main():
    env = jinja2.Environment(loader=jinja2.FileSystemLoader(TEMPLATE_DIR))
    table = table_fu.TableFu(open(DATA_FILE, 'U'))

    html = env.get_template(TEMPLATE_FILE).render(table=table)

    o = open(OUTPUT_FILE, 'w')
    o.write(html)
    o.close()

    return html

if __name__ == "__main__":
    app.run()

There you go. Just run your script, visit http://localhost:5000/ and see your handiwork. Reload the page when you make changes. And when you’re all done, just grab the output file and drop it where it needs to go.

Written by Ryan Mark

December 7, 2010 at 5:07 pm

Posted in Python, Recipes

Bees with machine guns! Low-cost, distributed load-testing using EC2.

with 36 comments

The challenge

You don’t get to start optimizing until you’ve made the servers fucking cry.
Brian Boyer

In June we launched Chicago Breaking Business, which is, at least in terms of traffic, our largest site so far. Both before and during that project I had been iterating toward a load-testing solution that would meet the challenge of the above quote and provide us with a reasonable certainty that sites of this size (and larger) would not collapse under sudden waves of traffic. The traditional load-testing tools we tried (ab, flood, jmeter, etc.), while often useful, suffered from a number of problems when trying to simulate this sort of event:

  • Testing clients may have an operating system imposed maximum number of out-bound connections.
  • Testing a staging environment from a development workstation introduces network latency, firewalling, and traffic shaping as variables in the tests.
  • Testing from a single client will cause Amazon’s Elastic Load Balancer to route all connections to a single application server, no matter how many are available. (so-called “sticky sessions”)

What we needed was a solution that allowed us to use multiple, topographically-close clients to all simultaneously assault our servers with traffic.

Put it another way, what we needed were bees with machine guns.

The solution

Bees with machines guns is a fabric script that I put together to aggressively load-test our sites. It provides methods to spin up an army of EC2 instances (the bees), assault a target server with a configurable amount of traffic, and then spin those instances back down.

Here is an example of what a Bees session might look like:

$ fab -f bees.py up:2
Connecting to the hive.
Attempting to call up 2 bees.
Waiting for bees to load their machine guns...
.
.
.
.
Bee i-df450ab4 is ready for the attack.
.
Bee i-dd450ab6 is ready for the attack.
The swarm has assembled 2 bees.
Done.

$ fab -f bees.py attack:http://YOUR-DOMAIN.com/,1000,100
Read 2 bees from the roster.
Connecting to the hive.
Assembling bees.
Each of 2 bees will fire 500 rounds, 50 at a time.
Stinging URL so it will be cached for the attack.
[localhost] run: curl http://YOUR-DOMAIN.com/ >> /dev/null
Organizing the swarm.
Bee 0 is joining the swarm.
Bee 1 is joining the swarm.
Bee 0 is firing his machine gun. Bang bang!
Bee 1 is firing his machine gun. Bang bang!
Bee 0 is out of ammo.
Bee 1 lost sight of the target (connection timed out).
Offensive complete.
     Target failed to fully respond to 1 bees.
     Complete requests:		500
     Requests per second:	81.670000 [#/sec] (mean)
     Time per request:		612.188000 [ms] (mean)
     50% response time:		358.000000 [ms] (mean)
     90% response time:		1541.000000 [ms] (mean)
Mission Assessment: Target successfully fended off the swarm.
The swarm is awaiting new orders.
Done.

$ fab -f bees.py down
Read 2 bees from the roster.
Connecting to the hive.
Calling off the swarm.
Stood down 2 bees.
Done.

(Note: Bees assumes you have AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set as environment variables–just as you would if you were using the Amazon command line tools.)

Using the Bees I was able to verify that our new load-balanced and Varnished servers could handle a sustained load significantly beyond our requirements. Obviously this testing strategy doesn’t attempt to address all scenarios and it should be used in conjunction with traditional testing tools such as jmeter. That being said, it also provides a pretty interesting example of using EC2 to solve problems that can really only be properly addressed with distributed computing. This particular example is very well-suited to validating caching and scaling strategies.

The caveat

If you decide to use the Bees, please keep in mind the following important caveat: they are, more-or-less a distributed denial-of-service attack in a fancy package and, therefore, if you point them at any server you don’t own you will behaving unethically, have your Amazon Web Services account locked-out, and be liable in a court of law for any downtime you cause.

You have been warned.

Now then–how does your server handle thousands of simultaneous connections?

Written by Christopher Groskopf

July 8, 2010 at 9:12 am

Advanced django project layout

with 24 comments

Default django project layout versus news apps project layout

Default django project layout versus news apps project layout

We’re releasing our project layout for Django, based on Gareth Rushgrove’s lovely django-project-templates. If you’ve found yourself unsatisfied with the default layout, or you’re using our fabfile or ec2 image, you might be interested in using our project layout.

The default Django project layout makes it dead simple to learn the framework and get an application up and running. But it can quickly get cumbersome as your application grows and you have to figure out how to handle deployment. A few projects, most notably Pinax, have their own ways to organize large projects.

Here are the things that we need that the default layout doesn’t provide a solution for:

  • Separate settings, Apache configuration files and WSGI handlers for local development, a staging server and a production server.
  • A separate place for the various primary source data files (CSV, JSON, shape files) we typically have in a project.
  • A place to put Django apps that does not clutter up the root directory of the project.
  • A library directory to keep various reusable helper functions that are not Django applications.
  • A template directory and media directory for the entire project.

Gareth’s project is really well organized and addresses all of these issues. We tweaked his templates to match our use case.

Getting off the ground

  1. Clone my fork of django-project-templates.
    git clone git://github.com/ryanmark/django-project-templates.git
  2. Install the templates. It will install the dependencies: PasteScript, Cheetah and Fabric. You may want to use a new virtualenv.
    python setup.py install
  3. Create a new project from the News Apps Paste template.
    paster create --template=newsapps_project example_project
  4. You’ll be asked for staging and production domains, a git repository location and a database password. These setting will be put in the fabfile and will be used for deployment. You’ll also be asked for a secret key which is used internally by Django. It’s okay to press enter and accept the defaults. The template will still get created, you’ll just have to edit the fabfile later if you plan on deploying the project to a staging or production server.

The template contains a lot of personal preference but it’s been very useful for us and  handful of projects. We are all quite satisfied with it. Take it, use it, tell us what you think!

Written by Ryan Mark

March 8, 2010 at 2:30 pm

Posted in Infrastructure, Python