Tribune DataViz

Matters of interest, from the data reporters and developers across Tribune Publishing

Author Archive

Style and Substance: Analyzing a Beach Ball Chart

with 5 comments

This morning our friend Scott Klein tweeted about a chart published in the Susan G. Komen Foundation 2009-10 annual report:

The original "beach ball" chart.

For a variety of reasons, pie charts can be a misleading illustration of numbers, and this one did seem like it might suffer from similar problems. After a light prod from my PANDA colleague Chris Groskopf, I set out to see how well the visual representation of the numbers matched the reality. I wanted to determine if each segment of the chart actually represented the proportion of the whole that was promised by the labels.

There are probably many more formally correct ways to do this, but Chris suggested using the Python Imaging Library (PIL) to count the colors and determine their proportions. I hadn’t previously explored PIL, but looking at the documentation, I found the Image.getcolors method which seemed promising. It returns a list of pairs, where the first value in each pair is the number of pixels of a given color and the second value is the RGBa specification for that color. (As an aside, simply calling getcolors() returns the value None. I’m not sure why, but the docs mentioned passing the area of the image as an argument, and using im.getcolors(im.size[0]*im.size[1] returned data.)

A simplified version of the beach ball.

Not immediately usable data, however. Using that method on the original image returns a list of over 3600 colors, while when I look at the chart, I see only six. The problem is that digital images use a technique called “anti-aliasing” to make images appear smoother. Shadows and edges of text and shapes use many subtle variations on the most significant colors.

To simplify the image, I used Acorn, a Mac image editing tool. I cut off the black text labels. I used a magic wand selection tool to select the remaining “white” background (including the drop shadow) and cut that out, leaving a transparent background. I then used the magic wand to select each segment and filled it with the most representative color in the segment. This produced an image with 407 colors (shown right), and 400 of those colors appear in less than ten pixels each.

Using the simplified image, I wrote some python code using the interactive interpreter. The following code roughly reconstructs what I did:

from PIL import Image
  (236, 133, 191, 255): "Education",
  (209, 207, 212, 255): "Research",
  (204, 34, 132, 255): "Screening",
  (226, 156, 188, 255): "Administration",
  (241, 177, 211, 255): "Treatment",
  (162, 163, 167, 255): "Fundraising",
im ="/tmp/simplified.png")
colors = im.getcolors(im.size[0]*im.size[1])
colors = colors[1:7] # I know the first is transparent and I only care about the six most common after that
total = sum(count for count,color in colors)
for count,color in colors:
  print "%s %.1f%%" % (COLOR_LOOKUP[color],float(count)/total * 100)

which produced the following:

Education 36.0%
Research 24.5%
Screening 16.0%
Administration 8.1%
Treatment 8.0%
Fundraising 7.4%

I created the COLOR_LOOKUP dict after the fact, comparing the colors which PIL found to the colors in Acorn. They didn’t match exactly, which is strange, but they’re close enough that it’s pretty clear how to match up the PIL colors to the labels.

Here are my results laid alongside the labeled values:

Category Computed Labeled
Education 36.0% 34%
Research 24.5% 24%
Screening 16.0% 15%
Administration 8.1% 12%
Treatment 8.0% 7%
Fundraising 7.4% 8%

It is likely that some of the mismatch can be explained by the white borders between segments of the chart. I wonder how one might mathematically compute the data loss those create, and how one might prove (in the geometric sense) what shapes are least subject to that sort of problem. I suspect there are also natural perceptual tricks to how we see circles, and how we interpret those kind of curved lines. They suggest a three-dimensional object which means we may subconsciously be adjusting our understanding of the parts of the “ball” which we interpret as “farther away.”

While I’m not suggesting that anyone at Komen willfully misrepresented these numbers, it is interesting to see that the percentages of expense for administration and fundraising (a.k.a. “overhead”) come out rather lower when the actual area of colored pixels is assessed. On the other hand, administration is “farther away” in the visual field, so we may perceive it as larger than the actual number of pixels.

As data visualization becomes more common, it is important for readers to learn to critically interpret charts and infographics. This seems to be a case where the choice of a visually novel graph is probably interfering with the clear and accurate communication of the numbers the graph is meant to understand.


Written by Joe Germuska

February 7, 2012 at 2:58 pm

Posted in Data Visualization

Analyzing Illinois Pension Codes using the Open States Project

with one comment

Tribune reporter Jason Grotto has been leading a team of Tribune and WGN-TV reporters on a series of investigations into the morass which is the Illinois pension system. For a recent report, I helped them analyze the laws passed by the Illinois General Assembly to tell the broad story of how we got into this situation.

If you look carefully, you can see me in the back corner of this room, hunched over my laptop. To my left is Ian Bicking, and to his left, Brian Boyer. (Photo by Joshua Ruihley)

To do the analysis, we built upon the Open States Project framework developed by our friends at Sunlight Labs. I’ll write some more about the technical details below, but first, there’s a good story. I met Brian Boyer, my editor, at a Sunlight Labs hackathon during the PyCon 2009 code sprints. At the event, we were working on code for the precursor project to Open States, although neither of us worked on code for Illinois—someone else had claimed that work. This was before the TribApps team was officially formed, but thanks to Sunlight, I was in a good position to apply for a spot on the team when it was created.

Furthermore, in my first few weeks on the job, some of our reporters came to us interested in General Assembly research. By that point, it appeared that the Open States Illinois work was not going anywhere, so I worked out the basics of scraping the site and contributed my work to the project.

According to Golden Eagle Coin, Sunlight has continued to invest in Open States, which means that the code I contributed then has gotten better without a bit of work from me—you can probably see why we get enthusiastic about open source development. As a foundation for Open States, they’ve made a very robust library called ScrapeLib, which can be used to manage caching and robots.txt compliance for any web scraper. It even has an interesting tool called scrapeshell which is appealing to me as someone who likes interactive code development. If I find myself scraping any other web sites, I’ll definitely use Sunlight’s library rather than build my own.

Some minor changes that were made to the Illinois General Assembly website had broken the scraper for past legislative sessions, but that turned out to be pretty easy to fix, so I was quickly up and running with the 2011 version of the code, and I was able to turn to gathering the data Jason needed. (And of course I offered my fixes back to the core project.)

How we used Open States for this story

Here’s how Jason and Tribune reporter Ray Long described Illinois’ system in the recent story:

Pension experts say healthy pension plans require stable, well-structured rules that are rarely altered. But Illinois’ pension code is a tangled, convoluted mess of provisions riddled with giveaways, funding flaws, excessive borrowing and pension holidays. Since 2003 alone, the pension laws have been revised nearly 700 times, according to a Tribune/WGN-TV analysis.

The first step in establishing figures like this for the story was to run a complete scrape of the General Assembly’s legislative activity back to 2003.  Running on my MacBook, this took about four days and produced about one gigabyte’s worth of JSON data files (not to mention over 3.7 GB of cached files from the ILGA website.)

Since 2003, the General Assembly has filed nearly 60,000 pieces of legislation, but only a few of those became laws, and of course, most of those laws don’t impact the pension codes. I wrote python code to analyze the data produced by the scrape to identify only those bills which were actually passed and signed by the governor. I then wrote some custom code to analyze the full text of the public act (that is, the version of the bill which was signed into law) to find those which made changes to the Illinois Pension Code. (Specifically, I looked for the text 40 ILCS 5 which refers to that section of the Illinois Compiled Statues.) This allowed me to produce an index to changes to the pension laws since the 93rd session of the Illinois General Assembly, in spreadsheet form. On a technical note, one of the nice features of the ScrapeLib library mentioned above, and the 3.7+ GB of cached web pages, was that as I refined my code to build the index, I was able to run it against the cached documents, which made things much faster.

Some of the code I wrote, like that which extracts the list of statutes impacted by a bill, is probably of general use. I haven’t shared that back with the Open States project yet because I want to do it in a way which they believe is extensible to other states. I started a conversation about that on the Open States developer list, but we haven’t settled the question yet. Some other parts are more unique to how the Illinois legislature works, or just to how our team wanted to look at this specific issue. Open States has some provisions for “custom” data properties, so perhaps in the future I’ll find time to determine a good way to include that as well.

Written by Joe Germuska

December 19, 2011 at 3:22 pm

Posted in Craft, Open Source, Python

Steal this code: Presentation to Hacks/Hackers-OpenGovChicago

with 4 comments

On Wednesday, October 19, I presented a talk on using TribApps tools and libraries to a joint meetup of the Hacks/Hackers Chicago and OpenGov Chicago groups. It was roughly based on spreading information about the resources describe in my earlier blog post, Apps for Metro Chicago.

Here are the slides I presented.

Here are links to a few things I referenced:

Written by Joe Germuska

October 19, 2011 at 6:42 pm

Posted in Events, Open Source, Talks

Going Deep with Census Demographic and Economic Data

leave a comment »

The McCormick Foundation and the Cronkite School of Journalism produced a series of 17 videos covering various Census demographic and economic data resources. You can watch all of them at no cost.

In June, I was part of a project to help journalists learn more about the wealth of data provided by the Census Bureau. The end result of the weekend was seventeen 40-minute videos about diverse topics, which, as our convener Steve Doig observed is effectively a semester-long course, and it’s all available for you to watch online.

About half of the presenters are seasoned journalists (Pulitzer prize winners, even) who made a point of suggesting specific stories you could write from the data, and most of the rest were Census staff who know their stuff cold. And then there was me! Here’s a direct link to the video of my presentation on Census Geography, Maps, and Apps and a bitly bundle of the links I referenced in my presentation.

In some cases redundant to the bitly bundle, here is a summary of Census resources we’ve made or been involved with:

Written by Joe Germuska

July 19, 2011 at 10:47 am

Posted in Talks

Apps for Metro Chicago

with 2 comments

Sharing our work is one of the core values of the News Apps team. With every project, we ask ourselves if we can feasibly offer the base data for download, and the answer is usually yes. We also release open-source code whenever we find something that can be generalized for re-use.

We’d like to invite the community of developers participating in the Apps for Metro Chicago contest to use our data and resources however you see fit. Here’s a list of things which may be useful:


  • School data: we’ve recently enabled JSON access to the data which drives our School Report Cards application.
  • Census data: we’re part of a team that built a 2010 US Census site which also provides JSON access to the data, as well as bulk exports which help with the complexity of comparing data with the 2000 census. Illinois data hasn’t been released yet, but you can experiment with 18 other states so far, so that you’re ready when Illinois data drops.
  • Geographic data: we have a running instance of django-boundaryservice with locally-oriented shapefiles. This is not designed for heavy-load, so for a production project, you should plan to set up your own boundary service. We may need to retract the service if traffic is too high, but we’re happy to have you use it for building a proof-of-concept.

Base Tile Layers for Maps

Slippy-map javascript libraries like the Google Maps JavaScript API and Leaflet.js support custom layers. You can use the layers we’ve built in your own app.

In general, custom layers work by providing a number of files under a root URL so that the map javascript can compute a URL for a specific tile and retrieve it. Using the Leaflet URL template syntax, the basic URL format for using any of our base layers is

You can see layers we’ve made at our map tile browser. The name in the browser control panel is the value you should use for “layer_name” in the URL format above.

Whenever we can, we’ll make layers we create for our own maps available through that browser. We’re still working out exactly how to document things, so for now, comment on this post or hit us up on Twitter (@TribApps) if you need guidance.

The “wards” and “neighborhoods” layers are adapted from maps we’ve made for other stories and are pretty straightforwardly derived from shapefiles provided at The “wardchange2000” and “wardchange2010” layers come from this map: Reshaping Chicago’s political map: race, ward-by-ward.

Other tools

Besides the above, we’ve open-sourced a number of tools on the team’s GitHub account, and Chris, in particular, has made other useful projects available on his own GitHub account.

Written by Joe Germuska

July 15, 2011 at 4:02 pm

Posted in Mapping, Open Source

NewsApps hosts ChiPy-Hacks/Hackers Meetup April 14

leave a comment »

The Chicago Tribune News Applications team wants to introduce hacks to hackers. We’re hosting the April meeting of the Chicago Python Users Group (ChiPy) and planning an agenda that should be of crossover interest to journalists and coders alike.

The main event will be a reprise of Christopher Groskopf‘s PyCon 2011 talk, “Best Practices for Impossible Deadlines,” where he provides a general overview of how the Tribune NewsApps team has developed its methodology for building applications at the speed of news.

Also on the agenda:
* Jason Grotto (Chicago Tribune) and Jeff Kelly Lowenstein (Hoy, formerly Chicago Reporter) will present brief case studies of computer assisted reporting projects they’ve done, explaining the theory and practice of finding stories in data

* Larry Adams and Nate Nichols will present the how and why of using Python to create domain specific languages (DSLs) as part of Narrative Science’s algorithmic news production process.

* Lightning talks: each speaker has exactly five minutes to present on any topic of likely interest to the audience. News Apps team members will be presenting on projects, tools and/or tricks of the trade. You can present as well!

If you’re interested in presenting a lightning talk on anything pertinent to news and technology, please send an email to — only people who make contact before the event will be allowed to present, and if we get too much advance interest, we may need to subject the list of topics to some kind of vote.

Don’t be shy, five minutes is not long—you probably have something interesting to share.

If you have a topic that might be longer than lightning-talk length, email with a brief description of the topic and the timing, and we can see if it fits in.

ADVANCE RSVP IS REQUIRED in order to gain admission to Tribune Tower. RSVP at the Hacks/Hackers meetup site or by email to


Written by Joe Germuska

March 31, 2011 at 11:03 am

Posted in Events

CrowdMapping the Chicago Blizzard

with 12 comments

A cargo train on the south side of Chicago

A cargo train on the south side of Chicago, a day after the blizzard, Thursday, February 3, 2011. (Alex Garcia/Chicago Tribune)

Last week Chicago experienced its third-largest blizzard on record. When we came into the newsroom on Monday morning, forecasters were warning us that it a big one was on the way. Our team considered various ways we could contribute to covering the story and decided to focus on implementating a variation on the Snowmageddon Cleanup project originally launched by PICnet with the Washington Post in the wake of the February 2010 DC blizzard.

Put simply, Snowmageddon Cleanup is a tool meant to help connect people who need shovels with people who have them. The tool is built on an open source application called Ushahidi which drew much attention after it was deployed to help with crisis response after the January 2010 earthquake in Haiti. The Ushahidi team provides a hosted solution at which makes setting up an instance quite easy, so we spent Monday configuring a basic setup modeled off of Snowmageddon Cleanup hosted at While we were setting up, our friend Justin Massa of the Metro Chicago Information Center volunteered to help with administration and outreach, including asking MCIC intern Resney Gugwor to serve as a report moderator as well. Deborah Shaddon, the Chicago city lead for CrisisCommons, also found us and introduced us to a number of enthusiastic volunteers from the wider community.

Graph of CrowdMap reports

Number of reports submitted per hour during the height of our CrowdMap activity (click to view full-size)

On Tuesday we configured our CrowdMap to accept reports by email and SMS in addition to the web form, and we worked with our colleagues around the Tribune to publicize the service.  By midnight on our first day we’d approved 192 reports, and many more came in overnight. Things really picked up speed at the end of the workday on Wednesday: as our team went home and left the administration unattended for a couple of hours, more than 200 reports came in. We rallied the troops and spent the evening approving reports and looking for any reports which seemed to merit some kind of additional follow-up (more on that later), set up a project twitter handle (@ChiSnowMap), and wondered what we’d gotten ourselves into.

Thursday brought another nearly 300 reports, although spaced out more evenly through the day.  By Sunday morning when we disabled submission of new reports, we had received about 1,100 reports, of which 1,053 were approved and posted on the site.

Chicago Snow Crowdmap report statistics

Overview statistics of reports submitted to the Chicago Snow CrowdMap. (Click to view full-size.)


We were excited that so many Chicagoans heard about the map and chose to use it to report issues, offer help, and tell stories of “victories”. However, looking back, it’s hard to gauge just how much help the system provided. Here are a few things we’ve identified which we will do differently if we do something like this again.

More Partners

Erik Hersman, one of the creators of Ushahidi, observed that technology was only 10% of their solution. That sounds about right, especially because in our case we had only to do minimal configuration of an existing tool. In the future, we’d like to work more closely with government agencies, other media organizations, and independent civic groups to be more sure of connecting all the dots.

Specifically, even if we work more closely with the city, I don’t think we are going to be dispatching plows based on crowd reports. Streets & Sanitation has a process, and they aren’t looking to us to supersede the city’s own 311 call center. I’d prefer a tool that is more clearly organized around helping citizens help each other in the many small ways that 311 could never handle.  (Of course, it’s also technically feasible that we could set up a bridge between a crowd-sourced solution and the city’s 311 system so that calls they get which are not “actionable” could be routed to the crowd tool, and also administrators could forward requests which are for the city to cover directly into the 311 system.)

And if the site is more organized around neighborly help, it would be good to work with volunteer organizations like Rotary, Kiwanis, scout troops, churches, and the like to get more offers for help in the system and to have more eyes reviewing the site and helping to match offers for help with requests.

Finally, while we appreciated the story leads we gathered from the tool (see below), we’re much more interested in making sure that everyone who could benefit from it knows about it. Next time we’ll work with any interested local media organizations to set up a site in partnership, so that there are no obstacles to them promoting it widely.

Review Software Options

CrowdMap was a good solution because it didn’t require writing any code, and only took a few hours to configure. However, the flip-side of easy-to-launch is that less customization is available. We might have liked to have a few extra pages of information or resources, or to put more specific text on the “submit a report” form. Ushahidi (CrowdMap’s big brother) is also available for installation on our own servers, and if we went that route, we could almost certainly have added those kinds of customizations.

While Ushahidi and CrowdMap dominate the discussion of crowd-based crisis response applications, we began to wonder if it was actually designed to solve problems other than ours. Ushahidi’s origin was in collecting reports of violence during elections in Kenya. Administrators would then edit the reports with assessments of “source reliability” and “information probability,” with the ultimate goal of marking reports as “verified.” We weren’t particularly concerned with making those calls for the reports we received.

In retrospect, we were thinking of problem reports as bugs which needed fixing. (We are software developers, after all.) We were looking for ways to adapt CrowdMap to work more like a bug tracker, allowing us to indicate that a report had been assigned to someone, or that it had been resolved and could be hidden from the default view. We tried a few hacks using categories and editing report titles, but didn’t quite find a strategy that satisfied. If Ushahidi will continue to be deployed in this kind of situation, it might benefit from building in a few concepts from the world of bug-tracking software.

One specific challenge we encountered was maintaining contact with people reporting issues. In some cases people made pretty specific requests for help, but did not provide any contact information, or if they did, they provided it in fields which are not public by default. Since we weren’t going to be able to help each person making a request, it would have been better if helpers could directly contact people in need. We had at least one case where a helpful citizen went to a specific address which was posted, but couldn’t reach the resident to establish exactly what help was needed.

Of course, there are conventional bug-tracker applications which I guess could be applied to a situation like this, although I don’t know of any which integrate with a map or with SMS-reporting, and most bug-trackers would be intimidating to regular folks. There are also existing “civic bug trackers” like SeeClickFix which are in some ways closer to what, in retrospect, we wanted, although we’d have to do more development of our own to customize it for a specific event.

The News Angle

While it wasn’t our primary goal, it’s worth noting that reports to the Crowd Map did support some of the Tribune’s reporting. Early on Wednesday morning, an editor identified about a dozen sources to contact for more detailed accounts of their blizzard experiences, and those supported several stories including this one.

In Closing

It’s hard to tell just what impact we had with the crowd map. While they were a small fraction of our overall reports, we did have about fifty people offering help and another 50 reporting victories (my favorite: “Here in Logan Square, I went outside with my boyfriend to build a snowman and came back inside with a fiance. This blizzard’s not so bad after all.”) I’d like to think that even more people would have helped each other out if we could tune the tools and the process a little bit. And there was clearly a lot of enthusiasm for the general idea.

Major thanks are due to:

  • my NewsApps teammates, Brian Boyer, Christopher Groskopf, Ryan Mark and Ryan Nagle, for all kinds of contributions to setting up and managing the application;
  • Justin Massa and Resney Gugwor of MCIC, who also helped administer, and to Justin also for working his many contacts to spread the word about the site;
  • Jessica Jackson for approving reports, researching and entering dozens of shelters and parking accomodations as ‘solutions’, and going out with a shovel when a “dig me out” report from her neighborhood came in;
  • Jenni Prokopy for researching medical options for people who feared missing dialysis treatments if they didn’t get plowed out;
  • Daniel X. O’Neil for helping clear out reports during the Wednesday evening burst;
  • Deb Shaddon and Heather Blanchard for connecting us to the CrisisCommons and CrisisCamp Chicago communities;
  • the Ushahidi/CrowdMap team, for providing an out-of-the-box solution;
  • Mobile Commons, for donating an SMS code;
  • Laura Lanford and Daniel Edwards from Chicago CERT, Cathy Graham and Chris Thompson from Humanity Road, and Jus Mackinnon, from southern England for helping garcinia cambogia reviews reports and look for people in urgent need;
  • @ColonelTribune and many other folks on twitter for helping to spread the word.

Things were moving fast most of the week, and I may well have overlooked other helpers; if so, I sincerely apologize.

Written by Joe Germuska

February 9, 2011 at 10:35 am

Posted in Events