Tribune DataViz

Matters of interest, from the data reporters and developers across Tribune Publishing

Archive for the ‘Open Source’ Category

Better web cartography with dot density maps and new tools

with 19 comments

Between Brian, Joe, and myself there hasn’t been a time in the last six months where at least one of us wasn’t working with census data.

Back in February we attacked the less-detailed (redistricting) data for print and the web. In April, May, and June we contributed to a joint effort with an esteemed cadre of news nerds to develop census.ire.org, a site intended to make it easier for journalists to report from census data. And to prepare for this recent release, we even spent a week hacking near-complete prototype maps using data that the census had already released, Kings County, New York.

We learned hard lessons about the scale and nuance of the census in the last few months, and along the way, further built out our toolkit for making maps. Last week the Census Bureau released detailed (summary file) data for Illinois, and we used our new tools to produce a couple of maps we’re pretty excited about:

These maps demonstrate a map style we haven’t attempted before: dot density mapping. Dot maps let us represent multi-variate data more richly than choropleth maps–for example, illustrating variation in race and population density simultaneously. We were inspired in this effort by Bill Rankin’s Radical Cartography project and Dennis McClendon’s race map for the Encyclopedia of Chicago.

Many of the tools needed to create the maps we wanted didn’t exist. Using the fantastic TileMill as our starting point, we began to build a toolkit.

Invar

Invar automates the generation of map tiles, and the deployment of tiles to S3. It is the first and least glamorous of the tools we created, but crucially, it’s very, very fast. Fast!

The first time we ever tried to create our own tileset, it took hours to render and twice as long to deploy. Thanks to invar’s parallelizing these tasks, we can now produce a map in minutes and deploy it just as fast. In fact, we now deploy our maps to four separate S3 buckets so that we can take advantage of Leaflet‘s support for round-robining tile requests to multiple subdomains. Fast!

Englewood

Next we needed to distribute dots across geographies. We found one implementation of dot distribution in Python, which we extended into a module for reuse.

Englewood (named after an ailing Chicago neighborhood that the newspaper writes many sad stories about) uses the Python bindings for GDAL to load data from PostGIS or shapefile. It scatters points within each feature and then writes the points out to a table or new shapefile.

A small snippet of Python is required to configure Englewood. The following code renders the dots for our map of children less than five from a database. (A demo using shapefiles can be found in the repository):

#!/usr/bin/env python

from englewood import DotDensityPlotter 

def get_data(feature):
    """
    This function is called for each feature Englewood processes and needs to return a
    dictionary of classes, with a number assigned to each. Englewood will divide this
    number by a "dots_per" value set below and create that many dots for that class
    within the geography.
    """
    return {
        'hispanic': feature.GetFieldAsInteger(feature.GetFieldIndex('hispanic_under5')),
        'black': feature.GetFieldAsInteger(feature.GetFieldIndex('black_under5')),
        'asian': feature.GetFieldAsInteger(feature.GetFieldIndex('asian_under5')),
        'nhwhite': feature.GetFieldAsInteger(feature.GetFieldIndex('nhwhite_under5'))
    }

# Example argument values passed into the DotDensityPlotter
# In this case features are read from a PostGIS table (under_5_by_race_blocks_shapes)...
source = 'PG:dbname=chicagocensus host=localhost'
source_layer = 'under_5_by_race_blocks_shapes'
# ...and written into another PostGIS table (under_five_dots)
dest_driver = 'PostgreSQL'
dest = 'PG:dbname=chicagocensus host=localhost'
dest_layer = 'under_five_dots'
get_data_callback = get_data
dots_per = 1

dots = DotDensityPlotter(source, source_layer, dest_driver, dest, dest_layer, get_data_callback, dots_per)
dots.plot()

Deployment

A fast and stable process is useless if you can’t repeat it. We’ve built out a fabric configuration which allows us to make these maps in the quickest and most efficient way possible. Among other things, it allows us to keep some configuration (such as a bounding box) in a per-map YAML file. It parses this file and handles passing the correct arguments to invar for rendering and deployment. Perhaps most exciting, if you’re using the new TileMill 0.4 (available for OSX or Ubuntu) it can completely automate the production of Wax interactivity grids, such as we used to do the highlighting in our recent maps.

Via Crayonsman (CC BY-SA 3.0)

Styling dots

Creating dot density maps created new challenges with regards to styling. Brian tried numerous approaches to color and size the dots, but ultimately we settled on a few principles that worked pretty well:

  • Use a dark, sparse base-layer (we used a custom-styled Google Maps layer, but would like to move to an Open Street Map base-layer in the future).
  • Make your dots to stand out brightly. Try the fluorescent colors from the palette of Crayola crayons.
  • Play with transparency–you may want to take advantage of the effect of overlapping transparent dots.
  • Make Dots scale on zoom.
  • Whenever possible, use one dot per individual. It’ll make for a more interesting map.

Here is the style we settled on:

#under-five {
  marker-fill:#f00;
  marker-line-width:0;
  marker-allow-overlap: true;
  marker-opacity:1;
  [group="asian"] {marker-fill:#FF496C;}
  [group="black"] {marker-fill:#76FF7A;}
  [group="hispanic"] {marker-fill:#FFCF48;}
  [group="nhwhite"] {marker-fill:#7366BD;}
  
  [zoom=9] {marker-height:.2;}
  [zoom=10] {marker-height:.3;}
  [zoom=11] {marker-height:.5; marker-opacity:.7;}
  [zoom=12] {marker-height:.8; marker-opacity:.7;}
  [zoom=13] {marker-height:1; marker-opacity:.8;}
  [zoom=14] {marker-height:1.5; marker-opacity:.8;}
}

Wrapping up

Although I’ve linked to a number of projects and code snippets in this post, you may find it useful to see a complete project. This week, with Illinois under our belt, I decided to apply the same methodology to my side-project, Hack Tyler. I produced a map of race in Smith County, Texas (related blog post). Part of Hack Tyler’s modus operandi is developing in a completely transparent manner. As a result, you can see complete examples of both our backend and client-side mapping rigs in the following projects:

We hope that we’ve pressed the envelope a bit with these new maps. Someone said that this was the year cartographers retake the internet. I hope that’s true. Its about time that online maps were more than just shading boxes.

Advertisements

Written by Christopher Groskopf

August 12, 2011 at 4:02 pm

Making maps, part 6: side-by-side comparison and high speed tile rendering and deployment

with 8 comments

This is a bonus post of a five-part series about our recent explorations making choropleth maps using PostGIS, TileMill, Mapnik and Google Maps. Read parts one, two, three, four and five for the full effect. The impatient may grab the source code from GitHub, or try out the demo.

Reshaping Chicago's political map: race, ward-by-ward

If you haven’t noticed, we here at News Apps love our maps and are always looking for new ways to improve each iteration. With Census data for Illinois dropping any week now, we brainstormed ways to visualize what people were most interested in: the changes between 2000 and 2010. The ideas thrown on the table were many and varied, but we were particularly geeked about the idea of creating side-by-side maps and created a rig we ended up using for the recent ward redistricting story.

Side-by-side map comparisons aren’t new. Texas Tribune recently created one showing a congressional redistricting proposal. The Center for Urban Research created comparison maps for block-by-block race and ethnicity trends. Both are great maps that show big picture trends and shifts, but we were itching to take our map a step further, bring in interactivity old school style. Instead of a sliding bar, which can make changes and trends difficult to track, why not use two maps that show the same center and move and zoom synchronously?

Using some snappy javascript and Chris’ new tile rendering and deployment code, we made ourselves a beautiful side-by-side comparison of population change by race.

Synchronization

Using Google Maps JavaScript API, we added an event listener with a loop that detects movement from one map and sets its buddy map to the same center and zoom. The function ends with a very brief setTimeout to reduce jerky movement caused by syncing the maps as they move.

function MapMoveListener(map) {
    this.listener = google.maps.event.addListener(map, 'bounds_changed', function() {
        if (!window.ignore_move_end) {
            window.ignore_move_end = true;
            for (var i = 0; i < maps.length; i++) {
                if (maps[i] != map) {
                    maps[i].setCenter(map.getCenter());
                    maps[i].setZoom(map.getZoom());
                }
            }
            window.location.hash = make_hash(map);
            setTimeout("window.ignore_move_end = false;",100)
        }
    });
}

function sync_maps(maps) {
    window.maps = maps;
    window.ignore_move_end = false;
    for (var i = 0; i < maps.length; i++) {
        var mml = new MapMoveListener(maps[i]);
    }
}

Invar

Say goodbye to the days of leaving the computer on overnight in hopes that the tiles will be properly rendered and deployed. Now, with Chris’ new toolkit, Invar, we just pass in a series of parameters—the latitude, longitude, minimum and maximum zoom points for the bounding box—and we can render and deploy tens of thousands of tiles to S3 in less than 15 minutes.

Here’s the command that we used to render tiles. In this case we are using two processes and rendering a buffer of 512 pixels around every tile. This buffer prevents large features, like ward numbers, from being clipped at high zoom levels.

ivtile map.xml tiles_dir 42.123 -88.0401 41.5446 -87.4241 10 16 --process_count 2 --buffer 512

To deploy the tiles, we use the ivs3 command from invar:

ivs3 tiles_dir media.example.com/map/tiles --concurrency 32

If you are interested in making your own synchronized maps, we have made our base tile layers and data resources available, so hack away. Download invar from Github.

There are many more maps to come, so please check out the finished map. We are always looking for feedback.

Written by Emily Chow

July 18, 2011 at 3:06 pm

Apps for Metro Chicago

with 2 comments

Sharing our work is one of the core values of the News Apps team. With every project, we ask ourselves if we can feasibly offer the base data for download, and the answer is usually yes. We also release open-source code whenever we find something that can be generalized for re-use.

We’d like to invite the community of developers participating in the Apps for Metro Chicago contest to use our data and resources however you see fit. Here’s a list of things which may be useful:

JSON Data APIs

  • School data: we’ve recently enabled JSON access to the data which drives our School Report Cards application.
  • Census data: we’re part of a team that built a 2010 US Census site which also provides JSON access to the data, as well as bulk exports which help with the complexity of comparing data with the 2000 census. Illinois data hasn’t been released yet, but you can experiment with 18 other states so far, so that you’re ready when Illinois data drops.
  • Geographic data: we have a running instance of django-boundaryservice with locally-oriented shapefiles. This is not designed for heavy-load, so for a production project, you should plan to set up your own boundary service. We may need to retract the service if traffic is too high, but we’re happy to have you use it for building a proof-of-concept.

Base Tile Layers for Maps

Slippy-map javascript libraries like the Google Maps JavaScript API and Leaflet.js support custom layers. You can use the layers we’ve built in your own app.

In general, custom layers work by providing a number of files under a root URL so that the map javascript can compute a URL for a specific tile and retrieve it. Using the Leaflet URL template syntax, the basic URL format for using any of our base layers is

You can see layers we’ve made at our map tile browser. The name in the browser control panel is the value you should use for “layer_name” in the URL format above.

Whenever we can, we’ll make layers we create for our own maps available through that browser. We’re still working out exactly how to document things, so for now, comment on this post or hit us up on Twitter (@TribApps) if you need guidance.

The “wards” and “neighborhoods” layers are adapted from maps we’ve made for other stories and are pretty straightforwardly derived from shapefiles provided at http://data.cityofchicago.org. The “wardchange2000” and “wardchange2010” layers come from this map: Reshaping Chicago’s political map: race, ward-by-ward.

Other tools

Besides the above, we’ve open-sourced a number of tools on the team’s GitHub account, and Chris, in particular, has made other useful projects available on his own GitHub account.

Written by Joe Germuska

July 15, 2011 at 4:02 pm

Posted in Mapping, Open Source

Don’t break the homepage! Using unit tests to validate your WordPress + Varnish caching strategy

with 3 comments

For the last few months my coworkers Ryan Mark and Ryan Nagle have been engaged in rebuilding the Tribune’s blog-network, ChicagoNow.com (new version not yet launched). Much of the rebuild has revolved around migrating to a new architecture–Apache, WordPress, and Varnish. Being the “Varnish guy” on the team, I was asked to implement a similar caching strategy to what we had developed for TribLocal.com last summer. Perhaps the most important lesson I learned from TribLocal’s development was that our cache layer was extremely fragile–it was prone to break when a new version of a site was deployed. Worse yet, minor changes to the VCL configuration frequently had unforeseen consequences that cascaded across the site.

My intention with ChicagoNow was to do a much better job of compartmentalizing the various elements of the configuration to ensure changes were idempotent, however, Brian had an even better idea: unit tests. For those who want to cut to the chase, here is the Python unittest script for testing our Varnish cache layer. The script knits together three unrelated technologies in order to allow us to test how changes to the site affect the cache:

The unittest module provides a convenient framework for isolating tests from one another and python-requests dramatically simplifies the process of making various types of requests and testing the response headers. The VCL for the site attaches “X-Cache” and “X-Cache-Hit” debug headers indicating whether the response was a cache hit or miss and if it was a hit how many hits there have been. The final component in the stack is the python-wordpress-xmlrpc library, which makes it nearly trivial to create posts and comments programmatically. This is crucial to testing if the cache is being correctly invalidated. Let’s take a look at a couple tests to see how these three components work together.

class TestLoggedOut(TestCachingBase):
    """
    Tests for logged-out users.
    """
    def test_homepage(self):
        url = build_url('/')

        response = self.get_twice(url)

        self.assertHit(response)
        self.assertMaxAge(response, 300)

    def test_homepage_login_logout(self):        
        url = build_url('/')

        cookies = self.login()
        cookies = self.logout(cookies)

        response = self.get_twice(url, cookies=cookies)

        self.assertHit(response)

Here we see two tests for logged-out users. The first test fetches the homepage of the site twice and assures that the second request was a hit and also that the max-age header is what we declared it to be in the VCL. The second test logs the user in and then out and then, while maintaining the same cookie jar, hits the homepage twice to ensure that the user gets cached results after logging out. Let’s look at one more test from TestLoggedOut to see how python-wordpress-xmlrpc comes into play:

    def test_edit_post(self):
        url = build_url('/%s/' % BLOG)

        post_id = self.new_post(BLOG)

        response = self.get_twice(url)
        self.assertHit(response)

        self.edit_post(BLOG, post_id)

        response = requests.get(url)
        self.assertMiss(response)

This test creates a new post on a particular blog, then hits the blog homepage twice to ensure it is cached. Then it edits the post that was created and hits the blog homepage once more to ensure that the cache was invalidated. Most of the actual functionality has been abstracted into the TestCachingBase base class so the individual tests are brief and readable. The full test suite includes several other complex scenarios.

It’s unfortunate that we lack a good “code coverage” metric for this sort of testing and of course these aren’t true “unit tests,” but using unit testing tools in this way goes a long way toward organizing the logic. The best thing about this cache testing rig is knowing that if we change something we can easily check if we have broken anything else (i.e. regression testing). This was a frequent problem with TribLocal and I think this implementation is going to go a long way toward preventing that from happening again.

Get the code for the unittest script.
Get the code for the ChicagoNow Varnish Configuration.
Get the code for our simple Newsapps Varnish plugin for WordPress.

Written by Christopher Groskopf

June 28, 2011 at 4:01 pm

Recent open source work and other projects we’re excited about

with 3 comments

In case you missed ’em, here’s some fun stuff we’ve been up to lately. For updates on our work in-between blog posts, follow us at @tribapps and github.com/newsapps.

appengine-autotweeter

A simple application for auto-tweeting an RSS feed. Built because TwitterFeed and other services weren’t fast enough for breaking news. Powers @ChicagoBreaking.

census.ire.org

A simple browser for census data, built with journalists in mind. Made in conjunction with news hackers from CNN, NYT, Spokesman-Review, University of Nebraska and USA Today. Check out the source code and the other great census resources provided by IRE.

Cloud Commander

Scripts you can use to quickly build and launch servers in Amazon’s cloud — you won’t forget how you built the server if you wrote a script to do it.

csvkit

csvkit is a library of utilities for working with CSV, the king of tabular file formats. If you’re working with CSV files or other spreadsheets of data, try it out. We use it every day.

invar

Command line tools for generating map images from Mapnik configuration. Use it to render the custom map tile sets you’re working on in TileMill. It’s fast. (If you’re not making your own tile sets, check out tutorial and get on it!)

The Newsapps Boundary Service, now with 100% more django

The Boundary Service is a ready-to-deploy system for aggregating regional boundary data. We’ve recently made it a pluggable django app. Check out Hack Tyler‘s instance for the hot demo boundaryservice.hacktyler.com/.

PANDA

PANDA is your new data research toolkit (to help you search and compare data) & newsroom data management system (‘cuz data stuck on your hard drive is sad data). Special thanks for the Knight Foundation for funding the development! For more details and updates on our progress, follow @pandaproject.

Written by Brian Boyer

June 24, 2011 at 8:39 am

Making maps, part 5: Overlay on a Google Map and deploy

with 9 comments

This is part five of a five-part series about our recent explorations making choropleth maps using PostGIS, TileMill, Mapnik and Google Maps. Read parts one, two, three and four for the full effect. Bonus post: part six! The impatient may grab the source code from GitHub, or try out the demo.

Running the render script should have created a mess of directories and images. Let’s finally put them on a Google map. The key bits of javascript are relatively simple. You create a callback function to hand off the URL of any given tile to Google, and configure a new overlay map, and push it onto the stack, like so:

snip from census-demo/map.js

function fetch_tile(coord, zoom) {
    return "http://localhost:8000/census-demo/.tiles/" + zoom + "/" + coord.x + "/" + coord.y + ".png";
    //return "http://media.apps.chicagotribune.com/election-maps/mayor-2011/.tiles/election2011/" + zoom + "/" + coord.x + "/" + coord.y + ".png";
}

…and…

$(document).ready(function() {

    census_demo_options = {
        getTileUrl: fetch_tile,
        tileSize: new google.maps.Size(256, 256),
        isPng: true
    }
    census_demo = new google.maps.ImageMapType(census_demo_options);

    map_options = {
        minZoom: 7,
        maxZoom: 10,
        zoom: 7,
        center: center,
        mapTypeControl: false,
        mapTypeId: "simple"
    };

    simple = new google.maps.StyledMapType(backdrop_styles, { name: "Illinois population 2010" });
    map = new google.maps.Map(document.getElementById("map_canvas"), map_options);
    map.mapTypes.set("simple", simple);
    map.overlayMapTypes.push(census_demo);
}

You may have noticed a mysterious variable, backdrop_styles, in the code above. We do a few tricky things to make the base map look nicer with an overlay — turning off some labels, etc.:

snip from census-demo/map.js

backdrop_styles = [
        {
            featureType: "administrative",
            elementType: "labels",
            stylers: [
                { lightness: 10 }
            ]
        },{
            featureType: "poi",
            elementType: "labels",
            stylers: [
                { visibility: "off" }
            ]
        },{
            featureType: "poi.park",
            elementType: "geometry",
            stylers: [
                { visibility: "off" }
            ]
        },{
            featureType: "road",
            elementType: "geometry",
            stylers: [
                { visibility: "simplified" },
                { saturation: -100 },
                { lightness: 0 }
            ]
        },{
            featureType: "road.arterial",
            elementType: "labels",
            stylers: [
                { gamma: 10 }
            ]
        }
    ];

The geo search, the legend and the tricky hash-URLs are left as an exercise to the reader. It’s all in the demo code.

(Not to miss: The legend background color is the same background color as the base map, and the legend opacity matches the overlay opacity. This ensures that the colors in the legend match the colors on the map.)

To try it all out, fire up your handy one-line web server in the project directory…

Run in your project directory

python -m SimpleHTTPServer

…and point your browser at http://localhost:8000/census-demo/

Joy.

Deploying your map

Finally, something simple. Just copy the census-demo directory (containing your HTML, Javascript, styles and tiles) out to Amazon’s cheap and easy S3 hosting service and if you’ve got all the paths right, it’ll work like a charm. You can also grab a few VPSes from Hostgator or another major provider – using some sort of HostGator Black Friday Deal 2016 coupon always saves you a few bucks. No fancy servers necessary.

Beware: Deploying to S3 seems quick until you try to push out a ton of files. The number of tiles you cut increases exponentially with each zoom level. Our census map of northeast Illinois, max zoom level 16, amounted to more than 130,000 tiles, and took several hours to deploy.

We’re working on strategies to speed up rendering and deployment. Stay tuned to the apps blog for further developments. We’ll post it when we’ve got it.

« Previous: Render with Mapnik

Written by Brian Boyer

March 8, 2011 at 2:10 pm

Making maps, part 4: Render with Mapnik

with 7 comments

This is part four of a five-part series about our recent explorations making choropleth maps using PostGIS, TileMill, Mapnik and Google Maps. Read parts one, two, three and five for the full effect. Bonus post: part six! The impatient may grab the source code from GitHub, or try out the demo.

Requirements

You’ll need pip, virtualenv and virtualenvwrapper to follow along. (You’ll also need Mapnik, which you should have installed during part three.) If you’re not familiar with virtualenv yet, this will be a happy moment in your life as a Python developer.

First, let’s fire up an env and install our requirements (I would normally use —no-site-packages here, but I want Mapnik).

Run this in your project directory…

mkvirtualenv making-maps-demo
pip install -r requirements.txt

Carto -> Mapnik

To render your tile layer, you’ve got to first compile the Carto to Mapnik XML. We can use the copy of Carto that’s installed with Mapnik, there’s no need to compile your own. I’ve got a little script that copies the files to the project directory (good for version control!) and compiles things. You will need to adjust the path to TileMill for your installation location.

Run the style script…

./style

#! /bin/sh
cp /Applications/TileMill/files/project/census_demo/* tilemill
/Applications/TileMill/bin/carto tilemill/census_demo.mml > tilemill/census_demo.xml

Scripting Mapnik

Mapnik is not a tile-rendering system. It is an image generator. So, we must explicitly instruct it on how to cut our tiles. Luckily, the folks at Open Street Map already did the hard work. Our Mapnik script (source included in the demo code) is nearly the same, just generalized a bit.

To run the rendering script, you’ll need to know the extents of your state in latitude and longitude. There’s certainly a better way to do this, but my solution has been the LatLng plugins for Google Maps Labs. Grab the coordinates of the upper-left and lower-right corners of the area you’d like to render.

The script clears out the tiles directory (prefixed with a . as to not upset TextMate, it’s bad at lots of files) and tells our Mapnik script where to find your styles, where it should put the tiles, the extents of the area to be rendered, the minimum and maximum zoom levels and the number of processor cores to use.

Finally, run the render script…

./render

#! /bin/sh
rm -rf census-demo/.tiles
mkdir census-demo/.tiles
python render_tiles.py tilemill/census_demo.xml census-demo/.tiles/ 42.57 -91.65 36.9 -87.23 7 10 2

The full cycle

If you’ve got everything right, then rendering your tiles should be as easy as pie. (And by that, I mean as difficult as pie. Pie is a damned hard thing to get right. But I digress…) When I’m fine-tuning my maps, I’ll run through this cycle many times:

./setup
./style
./render

Looking in your .tiles dir, you will hopefully find a pile of pretty tiles, ready to be overlayed on a map.

Written by Brian Boyer

March 8, 2011 at 2:10 pm