Tribune DataViz

Matters of interest, from the data reporters and developers across Tribune Publishing

Archive for the ‘Data Visualization’ Category

Flat Files And Server Denials: Covering Elections At Three News Orgs

with 5 comments


Covering elections is a staple in American journalism. I’ve covered elections as a reporter and I’ve helped display election data in drastically different ways at three news organizations.

So first, a little primer on elections data. Generally speaking, on election night, the data for vote totals is tabulated by county boards of election and then sent to a state-level board. Next, the data is harvested by vendors such as Ipsos and the Associated Press. Until recently, the only nationwide election data vendor for news organizations was the AP. While other data vendors exist, they usually focus on more niche markets, such as campaigns and political parties.

The AP has a physical person in every U.S. county to report back to them what the current vote totals are for different races. It’s incredibly costly, but means you can dive deep into trends in data. The AP has a system that lets you FTP in and download the data in XML or CSV format, which your publication can then display.

The AP doesn’t always get state, county or local-level election data in this same manner. Thankfully, most states (and some counties) have online data portals, RSS feeds or APIs that can be downloaded, scraped or accessed to get the data you’re looking for. In some places, though, a real person has to sit in an election board’s offices and get the election data back to the news organization somehow, typically by calling or emailing.

While displaying data online may get a lot of attention these days, remember that many news organizations still print something every day. So news organizations have also needed to solve the problem of importing AP election data into their print editions, too — generally through decades-old pagination systems.

Now let’s talk about the differences between the three places I’ve wrangled election data for.

The New York Times Regional Media Group’s Election Data

In 2010, I was a newbie developer at the now-renamed The New York Times Regional Media Group. I started a few weeks before the 2010 midterm elections. My new coworkers had already built a system to FTP into the AP, import the data into a MySQL database and then display it on our 14 news websites using iframes hitting tables built in PHP.

I helped by load-testing, or seeing how much traffic the project could take, while we were running importation tests of the AP’s test data runs. By my estimations using Siege, I thought we were in the clear, with 2,500 hits a minute not crippling anything. If election night traffic had indeed been 2,500 hits a minute, we might have been in the clear. We were not.

If memory serves, we had one EC2 medium instance running to import and display the data and a medium MySQL instance running for the database. I didn’t know about caching and thought it was just something that was turned on automatically. It wasn’t.

On election night, we had two newspapers who received election data first, and things ran smoothly with them, as they were in smaller markets. Then the Florida papers started getting heavy traffic. Our EC2 instances became bottlenecked, stuck at 99 percent CPU usage, unable to read the AP data, let alone write it to the database with updates.


This brought all 14 of the newspaper websites to a crawl because these iframes were getting loaded before almost anything else on the page. In the end, homepage editors took the iframes off the pages, a coworker wrote some SQL to hand-optimize the election tables and, by then, traffic to the sites had subsided to reasonable levels.

It was the scariest night of my professional life. Thankfully, most of the newspapers were happy, as they hadn’t ever even attempted to display live election data on their websites, so this was still an improvement for them. And I learned to set up caching — in later cases, Varnish — when attempting to hit a live database in any way.

The Boston Globe’s Election Data

Next, I was at the Boston Globe during the 2012 general primaries. As then-hopeful Mitt Romney was the former governor of Massachusetts, the Boston Globe was a major source for news and coverage of the GOP primary battle. And the New Hampshire primaries were that paper’s bread and butter.

But the team I worked on had a fun logistical problem: We needed to display the data on two websites, and the newly-launched Each ran in a different content management system, each had different styles and each wanted the data displayed a little differently.

The first problem we had to solve was how to pull in the data. The Boston Globe’s CMS was Methode, which stored everything — stories, photos, etc. — as pieces of content in XML. As the AP already provided data in an XML format, we would just need to import it, change some of the tags to better suit the Methode ingestion system and then I would write the code necessary to display the data.

Thankfully, the Boston Globe’s systems staff figured out quickly how to go in and download the XML data and put it into a spot in the CMS that I could access. We had created mockups and styles for displaying the data responsively — still a new concept at the time — and now had to pull in the data, via some incredibly ugly Java I wrote.

We didn’t have time to do something similar with the CMS, which was at the time, I believe, going on 12 years old, and was somewhat fragile. So we decided to build separate styles and templates in that would could iframe into Not the best way to do things, but it’s how we did it.

And then, as the primaries started happening more and more frequently, I would have to make each primary its own chunk of code, violating the DRY principle repeatedly, and just trying to get everything deployed to production in time for the producers to be able to slot the items on the various homepages.

Another coworker had an old Python script that just created basic HTML tables for county/town election totals and pushed them into, for a more in-depth look. Lots of moving parts, different content management systems, different styles, a lot of work for the small number of people working on it.

The Chicago Tribune Way(s)

Now I’m at the Chicago Tribune. In 2012, my coworkers built a system that pulled in AP election data into a Django site with Varnish in front for caching. For local races, they pulled data entered by Chicago Tribune staffers into Google spreadsheets based off information gleaned from various county board of election sites, which were then turned into flat files as well. And then the AP data was pulled into our pagination system for the print product through tables the AP sent, just like it had been done in previous elections.

Fast forward to a month ago. The Chicago Tribune no longer subscribes to the Associated Press, but Reuters has entered the election data game. Instead of having to FTP and download XML files, we hit an API and receive JSON. It’s pretty nifty and much more conducive to building web-facing applications.

We wrote a Python wrapper to hit the Reuters API and reformat the data for our purposes, and then we again built flat pages based on that data, using Django Medusa. And for local elections and referenda that Reuters wasn’t covering, we again had Tribune staffers entering data into Google spreadsheets.

We still had to write a custom system that takes the Reuters and Google spreadsheet data and sends it to our pagination system. This required us figuring out how the data needed to look — basically a mix of XML-ish template tags and tables — and then FTPing it to an area where our pagination system could ingest the files, give them proper templating and allow page designers to put them on pages.

So what have I learned?

Elections are big events traffic-wise, and static sites take large traffic pretty well. With the Boston Globe and Chicago Tribune solutions of using basically static sites (XML and sites baked to S3), it meant little freaking out at 9 p.m. when you’re getting thousands of pageviews a second. If you’re having to deal with lots of calls to your database while it’s also reading and writing, you’re going to have a bad time. Static sites are wicked great.

Testing is important, but knowing what to test is more important. At The New York Times Regional Media Group, I thought I knew what I was doing, but I was testing for unrealistically low traffic and didn’t think about what would happen while it was trying to write election data to the database, too. I now know I could have asked folks on the NICAR listserv for help, or tweeted questions or really just asked anyone with a few years of experience, “Hey, will this work?”

Election nights are stressful, so be cheerful and smiley. We at team Trib Apps try to be cheerful and kind whenever working with anyone, but with this many moving parts, it never hurts to just think “smile while saying words” when conversing with other folks. We’re all working hard on these nights, and I’m a big fan of not adding any extra stress on people’s lives. That’s also part of what our technology is supposed to do — make things easier for folks in the newsroom.

Have a point person from the tech side to coordinate with the newsroom. When local election data started coming in, I stood in the area where folks were entering it into Google spreadsheets, just so someone was around to help answer any questions on the spot, while David Eads, who was the lead developer on the elections project, made sure the technical side was running smoothly. We had only one minor hiccup that was quickly fixed and we were able to identify it because we were all near one another, able to communicate more effectively. Even though we work with machines, this job is mostly about communication between humans.

Know that you’re going to be covering an election again and make your code reusable. When we were writing our code for the primary, we knew a general was coming up in November. We also knew that other Tribune newspapers would be itching to show election results so we needed to get the fundamentals right the first time.

We would love to hear about your experiences with election data. Please feel free to add a comment and tell us your story.

Written by Andy Boyle

April 25, 2014 at 12:32 pm

Responsive Charts with D3 and Backbone

with 2 comments

So, you started building charts with D3 and quickly realized there are certain behaviors you want all of your charts to have. Being the excellent developer you are, you decided to wrap the basics up in a nicely packaged, reusable bit of code that will help you build charts faster in the future.

For our team, this meant creating a simple Backbone view to encapsulate all of our charts’ must-haves. Why Backbone? Because we like it, though you can certainly accomplish all of what you’ll see here with a jQuery plugin or your own homebrew JavaScript lib.

Here’s how you can use it.

Simple bar chart

Just an example of a simple bar chart. If you’re not familiar with the code below, you’ll want to check out Michael Bostock’s Let’s make a bar chart.

Making your chart code reusable

At this point you’ve created a single, crappy little bar chart and you can hardly contain your joy. You can’t deny your desire to plaster these things all over your site. You WILL make this code reusable. Soon you will have loads of these things EVERYWHERE.

Before you do, however, consider a few things that might make your crappy little bar chart a little less crappy. You need:

  • Responsiviosity, responsiveness, responsivity — whatever you call it, it’s a way to redraw the chart when the window resizes
  • A simple way to extend and modify your chart to create different versions
  • A way to vary the appearance of the chart at certain viewport breakpoints
  • A fallback mechanism for browsers that don’t support D3

Enter ChartView.js — a simple Backbone view put together by my teammate David Eads to address these very needs.

So let’s wrap our bar chart up in a `ChartView` based view.

First thing we do is define `BarChartView` by extending `ChartView`. The only function we must override is the `draw` member function of `ChartView`. The code within our `draw` function looks a lot like the code we wrote to render our simple chart, but it takes advantage of some of the values that `ChartView` calculates and tracks for us. For example, instead of defining a `width` variable, we use `this.dimensions.width,` which is calculated based on the chart’s parent element.

var BarChartView = ChartView.extend({
  draw: function() {
    var scale = d3.scale.linear()
      .domain([0, d3.max(])
      .range([0, this.dimensions.width]);
      .attr('class', 'bar-chart')
          .style('width', function(d) { return scale(d) + 'px'; })
          .style('height', (this.dimensions.wrapperHeight / 5) + 'px')
          .html(function(d) { return '' + d + ''; });

The next thing we do is create a new instance (or two, or three, etc — remember charts EVERYWHERE) of our `BarChartView`. The minimum you need to get started is an options object with `el` and `data` or `collection` defined (note: you can only use one of `data` or `collection` with views that extend `ChartView` — not both).

var chart_one_data = [3, 8, 12, 7, 17];
var chart_two_data = [4, 10, 13, 14, 7];

var chart_one = new BarChartView({
  el: '#one',
  data: chart_one_data,
  base_height: 220

var chart_two = new BarChartView({
  el: '#two',
  data: chart_two_data,
  base_height: 220

Remember to call .render() or you won’t see your chart!

If you resize your browser, you’ll see these charts are now responsive. At viewport breakpoints of 420 and 728 pixels wide, the height of the charts’ containers will be adjusted to 0.7 and 0.9 of the base_height option we passed when creating them. These breakpoints are, of course, customizable:

var chart_one = new BarChartView({
  el: '#responsive-bar-chart',
  data: chart_data,
  base_height: 220,
  breakpoints: {
    728: 0.9,
    420: 0.7,
    380: 0.65

An advanced example

I know what you’re thinking. The examples are awesome, but yo, you don’t even use SVG for these charts. How about showing some real code?

You’re right. To show how to use this in a real life scenario, I thought I’d refactor some of the code we wrote for our Broken Bonds series. Trust me when I tell you you don’t want to see the original code. You can, however, see the refactored code by clicking here. Look at this code in action below.

Note that this chart has a few more options specified:

var obli_chart = new ObligationDebtChartView({
  el: '#ob_chart_container',
  collection: new Backbone.Collection(bonds),
  y_key: 'debt_per_capita',
  y_scale_max: '7e3',
  base_height: 425,
  breakpoints: {
    600: 0.75,
    380: 0.5

The `y_key` is used to determine which key to pluck from each item in our dataset to draw the chart’s bars. The `y_scale_max` option is used to adjust the maximum value that can be plotted on the chart — in this case, $7,000.

Also, notice we’re specifying a `collection`. When this option is present, `ChartView` will bind to the collection’s “sync” event, triggering a re-rendering of the chart any time the collection data changes. A big yay for events!

I won’t go through ObligationDebtChartView.js line-by-line. What’s important here is that the process is the same regardless of how complex your chart’s D3 render code is.

 draw: function() {
    return this;

We have our draw function, which calls a handful of other functions that do the heavy D3 lifting. Our chart is responsive — the bars squish as the viewport is constrained and axis labels change to be legible on smaller screens.

Fallback for older browsers

One last thing to cover — `ChartView` checks whether the browser supports D3 and will display a message if it does not.

In this example, I call .fallback_draw() directly to illustrate the point, but you should never have to do this. `ChartView` will replace the default .draw() method with .fallback_draw() when needed:

   // Fallback if d3 is unavailable, add some formatters otherwise.
    if (!this.d3) {
      this.draw = this.fallback_draw;

Again, you can customize the fallback behavior by overriding this method. For example, you might want to show an image of your chart:

  fallback_draw: function() {
    this.$el.append('<img src="" />');


Written by Ryan Nagle

March 7, 2014 at 4:12 pm

Lots of data on a little screen

leave a comment »

Ever since David Eads, Joe Germuska and I launched the Chicago Tribune crime site more than a year ago, we’ve wanted to revisit it. Things that we thought we’d be able to return to shortly after launch ended up sitting unaddressed for months. New pages got shoehorned into a design that was never meant to accommodate them. Finally, at the end of last year, my colleagues (notably Andy Boyle and Mr. Eads) and I got a chance to take a crack at revising the massive app. I’m pretty happy with how it ended up, and I thought it would be helpful to share some of the things I learned along the way.

Responsive design

I designed the site originally before I’d heard of this concept of “mobile-first design,” and, boy, did it show. On community area pages, the order of information was weird, the map just disappeared, and the performance was terrible.

Click to see the original responsive site in its full "glory."

Click to see the original responsive site in its full “glory.”

Mobile-first design has been derided as a method that over-simplifies for desktop, forcing one small bit of information at a time upon a user who has screen real estate to spare. It’s true that a columnar, white-space-driven adapts easily to mobile. Sometimes, that’s just the ticket. But for a site like this, dense with numbers, graphs and data, that is not a viable layout. So we dropped the multi-column layout from the community area page in favor of a more focused experience.

The prior site’s multi-column layout resulted in an odd informational hierarchy, with demographic information and community area stories superseding nearly everything else.

A community area page on the original site. Click for the full page. (Pardon the repeating headers -- sticky navbar doesn't play nice with AwesomeScreenshot.)

A community area page on the original site. Click for the full page. (Pardon the repeating headers — sticky navbar doesn't play nice with AwesomeScreenshot.)

For the new site, there was a lot of careful planning about what needed to be placed high on the page at every breakpoint. I laid out columns-within-columns to allow those sections to break cleanly. For instance, each crime trends section is a column that contains three columns: a time-of-day-graph, a type table and a location table. On phones, these collapse nicely, one on top of another, in a logical order.

The new, more streamlined version drops the sidebar.

The new, more streamlined version drops the sidebar. (Click to see the full page.)


Responsive graphs are hard. These still need a bit of work (the interaction is a bit tough on phones and tablets), but moving to D3 and Backbone.js from jqplot helped immensely. David came up with a brilliant solution for the historical chart: display only as many prior years of data as there is screen real estate for. This ensures that the chart is legible even on small screens.


The crime site expanded after launch to include pages about shootings and homicides in the city as a whole. We hadn’t designed the site to be very flexible, anticipating that we would only have community area-level pages. As a result, the shootings and homicides pages ended up looking a bit disjointed. Accessing them from the rest of the site was difficult. They felt orphaned.

The new site was designed to be flexible. Indeed, part of the redesign’s goal was to incorporate a large amount of data about crime in Chicagoland suburbs. It was clear that good navigation could solve the orphaned-page issue and allow for future expansion of the site.

So all I have to do is find a way to logically display more than 350 links? I thought to myself. No problem! Oh, and a search bar? And branding?

Two days later, I’d found a style that allowed me to display all 77 community areas easily. The city-wide pages are nestled in the same dropdown. The suburban pages required a little more work. I did misguidedly try to figure out how to display all 250+ cities in the dropdown (try making THAT mobile-friendly), but ended up displaying just the top 20% most populous towns with a link to the full list. The search box got its own dropdown — not ideal, but when faced with needing to make space for the Trib’s logo, it seemed like the best way to save space.


Our project manager, Kaitlen, originated the term “crime confetti,” because the community area map sported so many colorful dots. The colors on the site have always been a bit disproportionately bright and cheery for their subject matter, so Alex Bordens and I sat down to try to come up with better colors. After about a day of experimentation, we realized that the original site’s colors, funky though they are, solved a hard problem: They worked for colorblind users, didn’t imply a hierarchy and didn’t conflict when in proximity to each other. Coming up with three other colors that worked just as well in each of these situations proved a Really Difficult Problem, so we eventually tweaked them a bit and called it good.

All in all, the site functions so much better on devices now than it did, and it’s cleaner and more user-friendly across the board. Stay tuned as we use our new-found flexible design to add more analysis to the site!

Written by Heather Billings

February 11, 2014 at 12:44 pm

Creating compelling visualizations

leave a comment »

At ONA13 in Atlanta, Georgia, Emily Chow of the Washington Post and I spoke about visual literacy and picked apart a few projects we’ve worked on over time. These are some rough notes from that talk.

Got it?

This is the type of thing that jumps to mind when I hear the term “infographic.” Please don’t make anything like it.

Infographics, especially done like the above, usually suck at communicating information. This wild rainbow of colors, random decorative icons, and confusing layout could probably be distilled to a sentence or two of text. It’s important to think about function. At the end of the day, graphics is a storytelling medium. If you can’t get the message across, you’re doing it wrong.

While this is a fun art piece, I wouldn’t train a barista with it.

While this is a fun art piece, I wouldn’t train a barista with it.

Comprehensive vs. complementary graphics

The above is an example of a comprehensive infographic, one that tries to cover an entire concept in a single image. Here’s another fun example:

Sometimes a comprehensive graphic is the answer, but it’s rare. Issues that are difficult to visualize (national debt, global warming) can be the best choices for comprehensive infographics. One of the biggest pros for a graphic like this: it’s incredibly easy to share, because it encapsulates a topic, and may have a longer shelf life.

That said, graphic elements can be helpful. News apps like KPCC’s Firetracker can benefit greatly from smart iconography. In this particular case, the colors used cause the icons to recede into the background, and the icons are, for the most part, also labelled. (Reliance on someone’s memory to recall what an icon means is a bet for confusion.) Apps like this provide much more information in a more structured way than infographics ever can.

KPCC's Firetracker app attempts to follow California wildfire response.

KPCC’s Firetracker app attempts to follow California wildfire response.

Complementary graphics are those which can’t stand alone, but need to be placed inline in the context of a story.These can be used to illustrate small bits of the story that cannot be told as well in text as they can be visualized. [examples?]

If you have a graphics desk, you can often explode those graphics into web-friendly inline graphics. At the Trib, we use our Tarbell platform to create these as quickly as possible.

Here’s another great example, from KQED, of the possibilities for using your graphics desk well. (Since KQED is a broadcast outlet, I’m pretty sure this was created for the web, but these sorts of maps are fairly common among graphics teams.)

Explanatory vs. exploratory graphics

Some graphics lead people through a concept. Some let people play with a concept. Depending on your story and data, you may want to create a visual that displays an overall trend, or one that lets people see very granular information.

Often, for exploratory graphics, a simple table or searchable database is the best tool for the job.

Maps can often be both. A zoomed-out map can offer an overall view, while offering a zipcode search that allows folks to drill down to their own location.

Another way to look at this is to balance the two. Annotate your graphic. Make it clear what the trends are, where the story is. Then allow for exploration as an additional layer.

Speaking of maps…

Maps are tricky things. Having geodata does not necessitate the creation of a map. For instance, this powerful package about the houses of fallen soldiers uses a map as its main navigation tool, interrupting the experience. While the information isn’t irrelevant, it’s also not the point of the package. It would be better shown as a small locator on the story itself, instead of playing such a starring role.

What do you call a traffic jam caused by drivers slowing down to look at an accident or other diversion on the side of the road?

What do you call a traffic jam caused by drivers slowing down to look at an accident or other diversion on the side of the road?

Maps can also pose color difficulties, even more so than other graphics. A heatmap like this dialect survey map can be indecipherable if the wrong colors are used. (Red-green colorblindness is especially prevalent.)

A few tools to help you choose “safe” colors:

Your colors should also be meaningful. One of the biggest weaknesses of the Chicago Crime home page is that the colors on the map don’t actually mean anything. We made that map trying to make neighborhood boundaries clear, but the end result looks like we’re trying to tell you something about crime in those areas.

Deciding what viz to do

Sometimes, all you really need is a bar chart. For instance, this small part of the Seattle Times’ report on methodone abuse:

Screen Shot 2013-10-19 at 11.37.41 AM

Though the chart depicts usage and death levels of several drugs over time, it makes a single point that is immediately clear. (Usually, graphics that make a single point are the most effective.)

However, sometimes a bar chart is exactly the wrong answer. These shooting average graphics from Grantland would be terribly difficult to understand as bar charts:


LeBron James, what you doin’?

The blog post that accompanies those graphs gives you a peek into the process behind creating such a graphic. There’s often quite a bit of ideological ping-pong that occurs. Emily has a fantastic post on how the Post’s team created a map-less geographic breakdown of gun homicides and suicides by race. (Again, maps are tricky!) Her breakdown of the struggle of creating the graphic illustrates the importance of bouncing your ideas off of others when possible, and also that you have to make a lot of things before you make the *right* thing. It’s always a process.

Written by Heather Billings

October 19, 2013 at 11:49 pm

Mapping Chicago’s shooting victims

leave a comment »

Shooting screenshot

This week we launched Chicago shooting victims, which tracks where and when people are shot in Chicago, a city that’s seen more than 1,000 people shot in the first six months of 2013. This project coincided with a detailed story that ran in the Chicago Tribune written by the very reporters gathering the data.

First, let me explain what the data represent and where these numbers come from. In Chicago, if six people are shot at one time, the police record the data as just one shooting. For instance, if six people are shot during a drive-by shooting, then it counts as one shooting, even if it has six victims. Our Chicago Breaking News desk started keeping track of every victim shot in late 2011 because they wanted to have a complete record of how many people were struck by gunfire, because the city doesn’t provide that data.

When a shooting occurs, Tribune reporters track down as much information as possible about the victim — name, age, gender, where the shooting happened, which hospital they went to, etc. — and they enter it into a Google spreadsheet. But in many cases, all reporters can discover is that someone, of some age and some gender was shot at a specific location.

With about a week to go before heading to print, Alex Bordens, Ryan Mark and I set to work turning the spreadsheet into usable data we could visualize. First we geocoded and cleaned the data then loaded it into a database. I’ll be covering that in this blog post. The next one will focus on displaying the data in a map, charting the data and displaying a recent list of shootings.

The project runs on Django, PostgreSQL with PostGIS,  and gunicorn on an already-existing Amazon Web Services rig we’ve set up for Django projects, with Fabric for deployment.

Geocoding the data

We have a Bing Maps API Key, and I have a basic Django management command to pull data from a Google doc, attempt to geocode an address and then save it into a PostgreSQL database. (Oh, and a quick FYI, my colleague Ryan Nagle wrote a geopy reverse geocoder for Bing. Here’s the pull request. And here’s an example of a similar Django management command.)

One of the first problems we encountered – which will happen when you’re trying to geocode 4,000+ location points – was data duplication. The data points are sometimes so similar that you can’t just check for unique attributes as a way to check for duplication. Sometimes multiple victims were shot at the same location, on the same day. But the only recorded information in the spreadsheet could be just their gender, typically male.

So we created a new row and made unique IDs, with the year first and a number after, such as 2013-789. We then communicated with the Breaking News desk staff so they could include the unique ID in their workflow. The unique ID also allows us to check for any updates to existing entries in the database, which I will discuss later.

Next, we discovered some locations just don’t geocode that well, sometimes because addresses weren’t specific enough or because Bing was having trouble. If Bing can’t find a specific location, which happened a lot on any street with “Martin Luther King” in the name, it just returns a latitude and longitude in the northwest part of the Loop. And some locations only described where a highway and street intersect. So I added a boolean field into the database that  defaults to “True” but returns “False” if either of two things happen: Bing geocodes it to that latitude and longitude in the northwest part of the Loop (thus proving Bing couldn’t find it) or did it returned nothing. Otherwise, it stays “True.”

We were also able to use the boolean field to keep the incorrectly geocoded locations off our map and out of the neighborhood level data. Any locations we couldn’t geocode were left off the map, counted and included in a disclaimer at the bottom of the map. More transparency about the data is always better for us, so it was important to include that.

After geocoding all of the addresses, we were able to identify which ones weren’t found using the boolean field. We could then find the latitude and longitude by hand. We did this because we wanted to be able to compare neighborhoods, so being off by a few yards was mostly close enough for this.

So now we had latitude and longitude for each record. First I thought about importing them into the database by hand, but that seemed silly, as I knew we may need to reimport the data again, and that meant these same locations would get screwed up over and over. Therefore we added a latitude and longitude field at the end of the spreadsheet, and input the coordinates for the 70+ addresses that failed to geocode correctly. We can continue to use this technique for future bad addresses, too.

Cleaning and importing the data

The data had more than 4,000 rows, entered by humans, who, despite doing their best, occasionally entered typos or wrote ages as “40ish” instead of an integer. Anyone who’s dealt with any large dataset made by humans can attest this is pretty normal. So I had to write many functions to clean every field of data we were importing. We also wrote a function that checks if any data has changed for each record and updates the database accordingly.

We build our projects using virtual environments locally, and then test them on staging and production servers with the same virtual environments. This way we know what software is installed on each server and how everything should interact, which leads to (hopefully) fewer errors in our code.

For this project, I geocoded everything using the Django management command on my local machine. Once the geocoder ran perfectly (sometimes it would hiccup more than our tests allowed and break partway through), I made a fixture, exported the data into json and committed it to our git repository, so we could easily load the data.

We were also able to turn this json file into a csv using csvkit with this simple command:

in2csv -f json path/to/filename.json > path/to/filename.csv

We used the csv to create the static maps that appeared in the newspaper.

Making the map and chart

The map is powered by Leaflet.js, Stamen Design map tiles and OpenStreetMap data. We get and build community area shapes as json from our Crime app. We calculate how many shooting victims each community area has on the backend and send it as json to the page where it calculates the shading of each community area. We also pass the data for the current year and previous year in json, which is added to the pop-up that is generated in a template. The individual shootings are passed with their latitudes and longitudes as an array, which Leaflet then uses to draw a circle around.

The chart is built with Rickshaw, using more json we pass to the page from the backend. We initially tried using D3.js, but it was kind of complicated. Rickshaw was used for the graphs on our Homicides page, so we already knew what it could do.

We just used basic Django templating to send the last 30 days worth of shootings to the page. We originally had listed homicide information, showing which shootings were fatal, but our numbers don’t always jive with our Homicides page, as sometimes folks die days or weeks later from shootings and it sometimes isn’t updated in our spreadsheet.

Lastly, we styled it. Thankfully, we’ve rolled our own version of Twitter Bootstrap and created our own style guide, which makes it much easier to start projects. Having base templates like this ready to go makes it easy to focus on the engineering and mechanics of any new project, and I would strongly recommend investing time to having base styles that you can use on projects that match the look and feel of your website.

Written by Andy Boyle

July 15, 2013 at 11:40 am

Style and Substance: Analyzing a Beach Ball Chart

with 5 comments

This morning our friend Scott Klein tweeted about a chart published in the Susan G. Komen Foundation 2009-10 annual report:

The original "beach ball" chart.

For a variety of reasons, pie charts can be a misleading illustration of numbers, and this one did seem like it might suffer from similar problems. After a light prod from my PANDA colleague Chris Groskopf, I set out to see how well the visual representation of the numbers matched the reality. I wanted to determine if each segment of the chart actually represented the proportion of the whole that was promised by the labels.

There are probably many more formally correct ways to do this, but Chris suggested using the Python Imaging Library (PIL) to count the colors and determine their proportions. I hadn’t previously explored PIL, but looking at the documentation, I found the Image.getcolors method which seemed promising. It returns a list of pairs, where the first value in each pair is the number of pixels of a given color and the second value is the RGBa specification for that color. (As an aside, simply calling getcolors() returns the value None. I’m not sure why, but the docs mentioned passing the area of the image as an argument, and using im.getcolors(im.size[0]*im.size[1] returned data.)

A simplified version of the beach ball.

Not immediately usable data, however. Using that method on the original image returns a list of over 3600 colors, while when I look at the chart, I see only six. The problem is that digital images use a technique called “anti-aliasing” to make images appear smoother. Shadows and edges of text and shapes use many subtle variations on the most significant colors.

To simplify the image, I used Acorn, a Mac image editing tool. I cut off the black text labels. I used a magic wand selection tool to select the remaining “white” background (including the drop shadow) and cut that out, leaving a transparent background. I then used the magic wand to select each segment and filled it with the most representative color in the segment. This produced an image with 407 colors (shown right), and 400 of those colors appear in less than ten pixels each.

Using the simplified image, I wrote some python code using the interactive interpreter. The following code roughly reconstructs what I did:

from PIL import Image
  (236, 133, 191, 255): "Education",
  (209, 207, 212, 255): "Research",
  (204, 34, 132, 255): "Screening",
  (226, 156, 188, 255): "Administration",
  (241, 177, 211, 255): "Treatment",
  (162, 163, 167, 255): "Fundraising",
im ="/tmp/simplified.png")
colors = im.getcolors(im.size[0]*im.size[1])
colors = colors[1:7] # I know the first is transparent and I only care about the six most common after that
total = sum(count for count,color in colors)
for count,color in colors:
  print "%s %.1f%%" % (COLOR_LOOKUP[color],float(count)/total * 100)

which produced the following:

Education 36.0%
Research 24.5%
Screening 16.0%
Administration 8.1%
Treatment 8.0%
Fundraising 7.4%

I created the COLOR_LOOKUP dict after the fact, comparing the colors which PIL found to the colors in Acorn. They didn’t match exactly, which is strange, but they’re close enough that it’s pretty clear how to match up the PIL colors to the labels.

Here are my results laid alongside the labeled values:

Category Computed Labeled
Education 36.0% 34%
Research 24.5% 24%
Screening 16.0% 15%
Administration 8.1% 12%
Treatment 8.0% 7%
Fundraising 7.4% 8%

It is likely that some of the mismatch can be explained by the white borders between segments of the chart. I wonder how one might mathematically compute the data loss those create, and how one might prove (in the geometric sense) what shapes are least subject to that sort of problem. I suspect there are also natural perceptual tricks to how we see circles, and how we interpret those kind of curved lines. They suggest a three-dimensional object which means we may subconsciously be adjusting our understanding of the parts of the “ball” which we interpret as “farther away.”

While I’m not suggesting that anyone at Komen willfully misrepresented these numbers, it is interesting to see that the percentages of expense for administration and fundraising (a.k.a. “overhead”) come out rather lower when the actual area of colored pixels is assessed. On the other hand, administration is “farther away” in the visual field, so we may perceive it as larger than the actual number of pixels.

As data visualization becomes more common, it is important for readers to learn to critically interpret charts and infographics. This seems to be a case where the choice of a visually novel graph is probably interfering with the clear and accurate communication of the numbers the graph is meant to understand.

Written by Joe Germuska

February 7, 2012 at 2:58 pm

Posted in Data Visualization