Archive for the ‘Python’ Category
Do you consider yourself a fan of Python? Are you passionate about reporting in Chicago? Have you ever had long discussions about documentation? Then you might be right for TribGraphics. Check out the job posting here and apply soon!
We want our newsroom to look more like the city that we serve. If you’re a person from a group that’s underrepresented in engineering or journalism, we really want to see your application!
Today we released Tarbell 0.9 Beta 6. (Tarbell is our open-source static site generator based on Google spreadsheets, made with newsrooms in mind. Read more here!) This is our biggest release to date and should be our last release before a stable 1.0 version to come in the next week or two. Here are some notable changes:
- New naming conventions: The “base template” naming was confusing to users. We have switched to the term “Tarbell blueprints” to better reflect the role and function of this key Tarbell concept. The “_base” directory has been renamed “_blueprint” and the documentation now refers to “Tarbell blueprints” instead of “base templates.” Projects created with previous versions of Tarbell will still work.
- Expanded documentation: We greatly expanded and improved the Tarbell documentation, including a more in-depth tutorial.
- New hook system: Developers can now trigger actions during project installation. Create a repository and tickets when creating a new project or refresh the Facebook cache when publishing.
- Improved command line interface: Better wording, formatting, and line-wrapping.
- Better credentials: Tarbell now supports publishing from non-interactive environments.
- Support for project requirements: Tarbell projects and blueprints can now specify 3rd party Python libraries as dependencies.
Get started by installing Tarbell! Already a Tarbell user? Upgrade with:
pip install -U tarbell
Special thanks goes to Heather Billings, who did tremendous work on this release.
Covering elections is a staple in American journalism. I’ve covered elections as a reporter and I’ve helped display election data in drastically different ways at three news organizations.
So first, a little primer on elections data. Generally speaking, on election night, the data for vote totals is tabulated by county boards of election and then sent to a state-level board. Next, the data is harvested by vendors such as Ipsos and the Associated Press. Until recently, the only nationwide election data vendor for news organizations was the AP. While other data vendors exist, they usually focus on more niche markets, such as campaigns and political parties.
The AP has a physical person in every U.S. county to report back to them what the current vote totals are for different races. It’s incredibly costly, but means you can dive deep into trends in data. The AP has a system that lets you FTP in and download the data in XML or CSV format, which your publication can then display.
The AP doesn’t always get state, county or local-level election data in this same manner. Thankfully, most states (and some counties) have online data portals, RSS feeds or APIs that can be downloaded, scraped or accessed to get the data you’re looking for. In some places, though, a real person has to sit in an election board’s offices and get the election data back to the news organization somehow, typically by calling or emailing.
While displaying data online may get a lot of attention these days, remember that many news organizations still print something every day. So news organizations have also needed to solve the problem of importing AP election data into their print editions, too — generally through decades-old pagination systems.
Now let’s talk about the differences between the three places I’ve wrangled election data for.
The New York Times Regional Media Group’s Election Data
In 2010, I was a newbie developer at the now-renamed The New York Times Regional Media Group. I started a few weeks before the 2010 midterm elections. My new coworkers had already built a system to FTP into the AP, import the data into a MySQL database and then display it on our 14 news websites using iframes hitting tables built in PHP.
I helped by load-testing, or seeing how much traffic the project could take, while we were running importation tests of the AP’s test data runs. By my estimations using Siege, I thought we were in the clear, with 2,500 hits a minute not crippling anything. If election night traffic had indeed been 2,500 hits a minute, we might have been in the clear. We were not.
If memory serves, we had one EC2 medium instance running to import and display the data and a medium MySQL instance running for the database. I didn’t know about caching and thought it was just something that was turned on automatically. It wasn’t.
On election night, we had two newspapers who received election data first, and things ran smoothly with them, as they were in smaller markets. Then the Florida papers started getting heavy traffic. Our EC2 instances became bottlenecked, stuck at 99 percent CPU usage, unable to read the AP data, let alone write it to the database with updates.
This brought all 14 of the newspaper websites to a crawl because these iframes were getting loaded before almost anything else on the page. In the end, homepage editors took the iframes off the pages, a coworker wrote some SQL to hand-optimize the election tables and, by then, traffic to the sites had subsided to reasonable levels.
It was the scariest night of my professional life. Thankfully, most of the newspapers were happy, as they hadn’t ever even attempted to display live election data on their websites, so this was still an improvement for them. And I learned to set up caching — in later cases, Varnish — when attempting to hit a live database in any way.
The Boston Globe’s Election Data
Next, I was at the Boston Globe during the 2012 general primaries. As then-hopeful Mitt Romney was the former governor of Massachusetts, the Boston Globe was a major source for news and coverage of the GOP primary battle. And the New Hampshire primaries were that paper’s bread and butter.
But the team I worked on had a fun logistical problem: We needed to display the data on two websites, Boston.com and the newly-launched BostonGlobe.com. Each ran in a different content management system, each had different styles and each wanted the data displayed a little differently.
The first problem we had to solve was how to pull in the data. The Boston Globe’s CMS was Methode, which stored everything — stories, photos, etc. — as pieces of content in XML. As the AP already provided data in an XML format, we would just need to import it, change some of the tags to better suit the Methode ingestion system and then I would write the code necessary to display the data.
Thankfully, the Boston Globe’s systems staff figured out quickly how to go in and download the XML data and put it into a spot in the CMS that I could access. We had created mockups and styles for displaying the data responsively — still a new concept at the time — and now had to pull in the data, via some incredibly ugly Java I wrote.
We didn’t have time to do something similar with the Boston.com CMS, which was at the time, I believe, going on 12 years old, and was somewhat fragile. So we decided to build separate styles and templates in BostonGlobe.com that would could iframe into Boston.com. Not the best way to do things, but it’s how we did it.
And then, as the primaries started happening more and more frequently, I would have to make each primary its own chunk of code, violating the DRY principle repeatedly, and just trying to get everything deployed to production in time for the producers to be able to slot the items on the various homepages.
Another coworker had an old Python script that just created basic HTML tables for county/town election totals and pushed them into Boston.com, for a more in-depth look. Lots of moving parts, different content management systems, different styles, a lot of work for the small number of people working on it.
The Chicago Tribune Way(s)
Now I’m at the Chicago Tribune. In 2012, my coworkers built a system that pulled in AP election data into a Django site with Varnish in front for caching. For local races, they pulled data entered by Chicago Tribune staffers into Google spreadsheets based off information gleaned from various county board of election sites, which were then turned into flat files as well. And then the AP data was pulled into our pagination system for the print product through tables the AP sent, just like it had been done in previous elections.
Fast forward to a month ago. The Chicago Tribune no longer subscribes to the Associated Press, but Reuters has entered the election data game. Instead of having to FTP and download XML files, we hit an API and receive JSON. It’s pretty nifty and much more conducive to building web-facing applications.
We wrote a Python wrapper to hit the Reuters API and reformat the data for our purposes, and then we again built flat pages based on that data, using Django Medusa. And for local elections and referenda that Reuters wasn’t covering, we again had Tribune staffers entering data into Google spreadsheets.
We still had to write a custom system that takes the Reuters and Google spreadsheet data and sends it to our pagination system. This required us figuring out how the data needed to look — basically a mix of XML-ish template tags and tables — and then FTPing it to an area where our pagination system could ingest the files, give them proper templating and allow page designers to put them on pages.
So what have I learned?
Elections are big events traffic-wise, and static sites take large traffic pretty well. With the Boston Globe and Chicago Tribune solutions of using basically static sites (XML and sites baked to S3), it meant little freaking out at 9 p.m. when you’re getting thousands of pageviews a second. If you’re having to deal with lots of calls to your database while it’s also reading and writing, you’re going to have a bad time. Static sites are wicked great.
Testing is important, but knowing what to test is more important. At The New York Times Regional Media Group, I thought I knew what I was doing, but I was testing for unrealistically low traffic and didn’t think about what would happen while it was trying to write election data to the database, too. I now know I could have asked folks on the NICAR listserv for help, or tweeted questions or really just asked anyone with a few years of experience, “Hey, will this work?”
Election nights are stressful, so be cheerful and smiley. We at team Trib Apps try to be cheerful and kind whenever working with anyone, but with this many moving parts, it never hurts to just think “smile while saying words” when conversing with other folks. We’re all working hard on these nights, and I’m a big fan of not adding any extra stress on people’s lives. That’s also part of what our technology is supposed to do — make things easier for folks in the newsroom.
Have a point person from the tech side to coordinate with the newsroom. When local election data started coming in, I stood in the area where folks were entering it into Google spreadsheets, just so someone was around to help answer any questions on the spot, while David Eads, who was the lead developer on the elections project, made sure the technical side was running smoothly. We had only one minor hiccup that was quickly fixed and we were able to identify it because we were all near one another, able to communicate more effectively. Even though we work with machines, this job is mostly about communication between humans.
Know that you’re going to be covering an election again and make your code reusable. When we were writing our code for the primary, we knew a general was coming up in November. We also knew that other Tribune newspapers would be itching to show election results so we needed to get the fundamentals right the first time.
We would love to hear about your experiences with election data. Please feel free to add a comment and tell us your story.
We’ve been hard at work on the next version of our Tarbell project, which makes it easy to build static websites.
Tarbell has been through many iterations over the past year. The first open source release of the library came out in May. This version of Tarbell served us and others well and pointed a way forward.
Five months and many projects later, we are releasing our first preview of the new, entirely overhauled Tarbell. The new release reflects our experience managing dozens of projects and incorporates all the incredibly helpful feedback we’ve received from the community.
The old version of Tarbell has been preserved as v0.8 for posterity. The new version is v0.9.
A few highlights:
One line install: Run
pip install tarbellon just about any *NIX-based operating system to install.
Command line app: The
tarbellcommand line application replaces the hodgepodge of tools previously used by Tarbell.
Standalone projects: Instead of maintaining projects and base templates in one big directory, base templates and projects live in their own git repositories.
As many base projects as you want: Need templates for long-reads, map projects and data-driven projects? No problem! Tarbell’s new project template system can accommodate as many or as few project templates as you need.
Google Drive API for spreadsheet access: The old Google spreadsheets API is going away some day. Tarbell now uses the future-proof Drive API.
Faster preview and publishing: Improved performance across the board.
Google spreadsheets and Amazon S3 publishing are optional: It is now easy to create a Tarbell project without configuring Google Drive API access or Amazon S3.
To use Tarbell 0.9, head over to the project page at http://tarbell.tribapps.com to learn more. We expect a final release of 0.9 some time in the next few weeks as folks find bugs and work with the system while we plan a roadmap for version 1.0.
If you’d like support or to discuss Tarbell, please join our new Google Group.
If you’re going to MozFest (October 25 – 27, 2013), please come to the Tarbell workshop, which will be in the software for journalism conference track. Details to be announced.
The Tribune company’s CMS is powerful but monolithic. It makes publishing news for several huge metro areas possible, but it also means that it is hard to build new applications. Lucky for us, the IT team has been developing an API that lets us create custom tools to read from and write to the CMS.
Our NATO applications were designed to elevate the awesome work done by our reporters, photographers, and editors, and to make their jobs easier as they covered the summit, from the streets to the boardroom.
The gallery generator
We knew our photographers would be taking great photos during the NATO summit, and believed that readers would find them compelling. We wanted to show the photos at their best.
First, we built a tool to create big, beautiful, responsive photo galleries from galleries using the photo desk’s workflow and tools. The gallery generator adds a final, lightweight step to the existing process for photo editors who want to publish high resolution galleries after they’ve built them in the CMS. The gallery tool has been used for covering NATO, garcinia cambogia extract ads, sports, and international reporting, among other things.
The gallery generator is a simple Flask app that takes the ID of a photo gallery. It downloads and processes images, provides a preview, and finally publishes the gallery as static HTML on Amazon S3.
Editors start on a friendly page, then preview and publish the gallery:
Voila! A beautiful photo gallery that works on a wide range of devices, from gigantic monitors to the iPad, iPhone and Android devices.
The live blog editor
In addition to great photography, we knew that our reporters would be gathering stories from all over the city, and that readers would want to read those stories right away—even before the details could be woven into a full story. To improve our “incremental” publishing, we designed a live blog editor, called Layercake. A great example of Layercake in action is the Tribune’s live blog from the first day of the NATO summit.
While the gallery generator reads data using the API and spits out HTML files, our little Flask app reads from and writes to the content management system. The goal of Layercake is to take the tedious parts out of building live blog-style posts so that editors can get on with their work.
As reports come in, our breaking news staff uses the web interface to add updates to the story. The app takes care of generating HTML, making it much easier for the team to publish photos and videos. Layercake also provides consistent formatting for headlines and timestamps, making it easier for readers to quickly scan the story.
The live preview shows the post as it will appear on chicagotribune.com. Here’s a published example of a NATO Layercake post:
Frosting on the cake
For our front page, we wanted our readers to know that our coverage was being updated frequently. To show this, we reused the Layercake content in a carousel-style widget, cycling through the five latest updates automatically. We generated a simple HTML page each time the Layercake was updated, and included it on the front page using an iframe.
Using the tools
We’re open sourcing these tools, even though the code has little relevance outside of the Tribune company. Even for internal tools, open source gives us a great way to share software and common language for collaboration. Enjoy!
We are looking for an experienced web developer who can help us build sites for the Chicago Tribune and Chicago Tribune Media Group. Somebody who has a passion for code and getting things done. Somebody who likes having a problem to solve.
You will work with a group of talented, passionate folks who enjoy making websites and software. We have short deadlines so we work iteratively and try to work closely with our users and stakeholders. It can be stressful at times, but it’s worth it. We build good stuff fast and you will become a better programmer. You will always be refining your tools and trying out bleeding edge web technologies. You will make things you will be proud to show mom.
Acronyms and buzzwords:
These are the tools we use. Apply if you can rock them.
- WordPress & PHP development
- HTML5, CSS3, SASS and Responsive Design
- Python and Django
- Amazon Web Services: EC2, S3, RDS
- Linux/Ubuntu server administration: Apache, Nginx, Varnish
P.S. You don’t have to know them all to apply.
Stuff we’ve done:
You will be working on these sites, and new ones like them.
And you’ll be contributing to our blog and our github.
Gear you’ll get:
- One shiny, new MacBook Pro (or an iMac, if you’d prefer)
- One CDM (Cheap Dell Monitor)
- One comfy Aeron chair
- …all at a desk somewhere in the Tribune newsroom, where you’ll be surrounded by reporters arguing with the cops, yelling about the ball game, telling crazy stories, and otherwise practicing their trade.
Tribune reporter Jason Grotto has been leading a team of Tribune and WGN-TV reporters on a series of investigations into the morass which is the Illinois pension system. For a recent report, I helped them analyze the laws passed by the Illinois General Assembly to tell the broad story of how we got into this situation.
To do the analysis, we built upon the Open States Project framework developed by our friends at Sunlight Labs. I’ll write some more about the technical details below, but first, there’s a good story. I met Brian Boyer, my editor, at a Sunlight Labs hackathon during the PyCon 2009 code sprints. At the event, we were working on code for the precursor project to Open States, although neither of us worked on code for Illinois—someone else had claimed that work. This was before the TribApps team was officially formed, but thanks to Sunlight, I was in a good position to apply for a spot on the team when it was created.
Furthermore, in my first few weeks on the job, some of our reporters came to us interested in General Assembly research. By that point, it appeared that the Open States Illinois work was not going anywhere, so I worked out the basics of scraping the site and contributed my work to the project.
According to Golden Eagle Coin, Sunlight has continued to invest in Open States, which means that the code I contributed then has gotten better without a bit of work from me—you can probably see why we get enthusiastic about open source development. As a foundation for Open States, they’ve made a very robust library called ScrapeLib, which can be used to manage caching and robots.txt compliance for any web scraper. It even has an interesting tool called scrapeshell which is appealing to me as someone who likes interactive code development. If I find myself scraping any other web sites, I’ll definitely use Sunlight’s library rather than build my own.
Some minor changes that were made to the Illinois General Assembly website had broken the scraper for past legislative sessions, but that turned out to be pretty easy to fix, so I was quickly up and running with the 2011 version of the code, and I was able to turn to gathering the data Jason needed. (And of course I offered my fixes back to the core project.)
How we used Open States for this story
Here’s how Jason and Tribune reporter Ray Long described Illinois’ system in the recent story:
Pension experts say healthy pension plans require stable, well-structured rules that are rarely altered. But Illinois’ pension code is a tangled, convoluted mess of provisions riddled with giveaways, funding flaws, excessive borrowing and pension holidays. Since 2003 alone, the pension laws have been revised nearly 700 times, according to a Tribune/WGN-TV analysis.
The first step in establishing figures like this for the story was to run a complete scrape of the General Assembly’s legislative activity back to 2003. Running on my MacBook, this took about four days and produced about one gigabyte’s worth of JSON data files (not to mention over 3.7 GB of cached files from the ILGA website.)
Since 2003, the General Assembly has filed nearly 60,000 pieces of legislation, but only a few of those became laws, and of course, most of those laws don’t impact the pension codes. I wrote python code to analyze the data produced by the scrape to identify only those bills which were actually passed and signed by the governor. I then wrote some custom code to analyze the full text of the public act (that is, the version of the bill which was signed into law) to find those which made changes to the Illinois Pension Code. (Specifically, I looked for the text
40 ILCS 5 which refers to that section of the Illinois Compiled Statues.
) This allowed me to produce an index to changes to the pension laws since the 93rd session of the Illinois General Assembly, in spreadsheet form. On a technical note, one of the nice features of the ScrapeLib library mentioned above, and the 3.7+ GB of cached web pages, was that as I refined my code to build the index, I was able to run it against the cached documents, which made things much faster.
Some of the code I wrote, like that which extracts the list of statutes impacted by a bill, is probably of general use. I haven’t shared that back with the Open States project yet because I want to do it in a way which they believe is extensible to other states. I started a conversation about that on the Open States developer list, but we haven’t settled the question yet. Some other parts are more unique to how the Illinois legislature works, or just to how our team wanted to look at this specific issue. Open States has some provisions for “custom” data properties, so perhaps in the future I’ll find time to determine a good way to include that as well.