This is an introductory guide on how to produce the beginnings of a piece of data journalism. We’re going to walk through it together, as I outline the key things to consider before starting, how to structure your work, a basic process to follow, and then use a real case study to show how the process works with a real story.
Be at ease, there is hope
The glitz and glamour of data journalism (the animations, the striking maps, those great infographics) are all over the Internet. It’s easy to think then that it’s about the data and how cool you can make it look, sing, or dance. Our wise friends at Code4SA, Raymond and Adi, keep reminding us (and the salivating Internet-at-large) that the focus should be on data journalism, and not data journalism.
Data journalism is no different than the journalism we all know and consume every day. Where traditional journalism relies on human sources (insiders, experts, scholars, scientists), data journalism treats data sources (spreadsheets, websites, databases) with all the rigor and scrutiny journalists treat human sources.
The animations and snazzy work is a part of communicating the final product – the story – but they will never replace the actual story.
The grand start
A data journalism story can start from an important event or it can simply be a question. You could have seen a breaking headline and wondered, how much x did it take for y to happen? Or, you start thinking about, say, food in a supermarket and wonder, what percentage of dog food features on the average shopper’s bill? Both questions are equally valid and are great starts for considering a piece of data journalism.
What I’ve learned so far in my work is that there is little difference between doing the work of basic science and that of data journalism. You make an observation, you come up with a question (hypothesis for purists and fancy people), and then you go about, doing some work to answer that question. Your work will show either that your initial hypothesis was incorrect or yes, it was indeed correct.
So, as I mentioned earlier, it’s not about the fancy graphic or how much data you trawled through. It’s about, what was your question and did you answer it or not?
Don’t believe the hype.
I live and work in South Africa, so I’ll be basing this guide on data on the workforce from the country’s statistics agency. (The results of the quarterly survey was released just recently and the official unemployment rate is at a grim 25%.) The agency cares (in my head) about my feelings and thus have released the data in a Excel spreadsheet format. I will write other posts about how to deal with sources of data, where the publishers don’t care as much about your feelings.
The dataset is here and there are enough sheets in there to warrant exploration. This exploration is important because an excited and hurried deep dive into the data, without knowing what it’s about, what it covers, and so on, may end up on looking at the wrong data that doesn’t answer your question, attempting to answer the wrong question or – the nightmare of every data journalist – hours wasted achieving little.
So, before we talk about the process, let’s look at the data and see what it tells us. We don’t usually work with all the data (unless our initial idea or question requires this). It’s better to first spending some time looking at all the data and then focus on a particular section that catches their attention.
Looking through the spreadsheet from the stats agency, the data looks at different characteristics of the workforce (by province, age, gender, and demographic group). Even if it’s this is your first time and you’re following along, throw a quick glance at each of the sheets. It’s part of developing that methodical work ethic that will become invaluable as you progress in this type of journalism.
As an important sidenote, you’ll need to have only a basic working knowledge of Excel. I won’t be wielding any sort of magic on the worksheets, so anyone can follow along. For the sake of brevity (and so you don’t drop into a catatonic stupor from me detailing every single step), I will leave you to figure out how to do the basic manipulations in Excel after I explain them.
Now the journey begins
We’ve talked about what it really means to produce a work of data journalism, how we start considering an idea that will lead towards a piece, and some introductory remarks about how to look at a dataset. Finally. The process, the good stuff. How does it work?
Step I: Take a bite out of the data
For this guide, I want to see the size of the workforce in all the provinces in South Africa and how it has fared between 2013 and the second quarter of 2015. That data is in the very first worksheet. (You’re welcome to look through all the others and see what other interesting insights you can mine from them.)
So we went from an original spreadsheet of more than 20 worksheets:
… to just this one entitled Table 1: Population of working age (15-64 years):
Let’s copy ’n’ paste the bottom part of into a new sheet, since that’s the view of the data we want to work with. To move towards a clean dataset, I took out the heading and “thousands” rows, and the cell labelled “South Africa”. I also took out the totals row, so it doesn’t come up later to confuse us. (I will adjust all the values, to reflect millions, in a minute.) It now should look like this:
Now, let’s change all the cells to show values in millions. I created columns next to each original column and multiplied the value by 1000. It now will look like this:
I also removed all borders, decimal places, and made the thousand separator a comma; this will help us make our charts readable and accessible later. At this point, you’d (and I did, too, at some point) be ready to take this table and analyse it. Not quite yet. Although it is indeed cleaner, the data structure we need is not there. Why does this matter? Because the data needs to be organized in a way that we can aggregate or group them. The wise old sages of data journalism say, if your data is not summarized [or aggregated], it is not ready for analysis.
Step 2: Transform the data into an analysis/visualization-ready structure
What factors are we ultimately looking to expose from this data set? They are province, year and the total number of workers. But, before that, we’re going to create this new data structure with the following columns:
If you studied database design or are a working programmer, you would have failed your database design test or received the chiding of your life if you proposed this dataset design. And your lecturer (or boss) would have been right; it’s not a normalized (computer science speak for optimised) dataset. However, this is data analysis for a piece of data journalism, so you may scorn those rules! We need to have duplicate rows in order to aggregate the data later (remember?).
Step 3: Produce the final dataset
In the screenshot above, I put in the structure to be followed for all years. So, copy in the totals for 2013, 2014 and 2015. You will then have a dataset that should look this. You should have 91 rows and only Q1, Q2 for 2015.
We’re almost there! The last step is actually aggregating the data. So, take a deep breath and create a PivotTable in a new sheet. Your summarized data should look like this:
Clean up the table: put in thousands separator, remove decimal places, and take out that cell labelled “Row values”. It should now look like this:
Step 4: Produce the visualisation
Congratulations! You have a dataset that is ready to be visualised.
We’re going to use Infogr.am to produce a infographic. This guide won’t cover how to sign up and use Infogr.am, so (as with Excel) you’ll have to become acquainted with the tool. I do assure you that it’s straight-forward and intuitive; you’ll use it like a professional in no time! You shall see.
Create a new infographic, choose any template you like, and look at the blank work area. It will look like this:
Give the infographic a title like “Total workforce in provinces, 2013 – 2015” or something similar, as you see fit. Then, add a grouped bar chart from the popup wizard. You’ll see the chart show up on the work area. (Delete the existing chart that comes with the template, that is now below the one you just created.)
Double-click on the chart and you’ll see an interface appear, not too different from Excel. Delete all the data you see, copy the data in your Excel worksheet from the last step we created (the PivotTable), and paste it into the Infogr.am spreadsheet interface. It should look like this:
When you pasted the data in, the graphic should have automatically updated itself. It’s starting to look great!
Have a look at the infographic. Everything is in there, but it may not be immediately understandable. You have to scroll down to the legend to see which colors denote which provinces. So, instead of having to re-format the data, click on the two-directional arrows icon in the top right-hand corner of the spreadsheet interface. This nifty feature will switch together the rows and the columns, so that the provinces are now the rows and the years are now the columns.
Always aim to show the values on the chart (where appropriate, obvs), so click the “Show values” switch and the totals will reflect on the chart. Also, click on the Settings button and scroll down to add “total (in millions)” in the X-axis textbox. This will help the reader (and you) understand further the chart.
If you click the “Publish” button, you can give your graphic a title and then choose whether you want it to be an interactive or image. This is how the final image would look like:
And you have produced your first visualisation. Pat yourself on the back, have a coffee or beer, and get ready because you’ve just started the process. 🙂
Before we look at the rest of the work needed, let’s review what we’ve done:
- We looked at a data source and extracted a view of the data that we want to look at. In this case, we asked the question, what was the size of the workforce in all of South Africa’s provinces between 2013 and 2015?
- We followed a basic process of cleaning, formatting, transforming, and summarizing the data until we produced a table showing the data we need to answer our question.
- We then inserted the data into our visualisation tool and produced an infographic, shown above.
At this point, you’re so excited that you jump on Twitter or email, and send out your work to everyone you know. Hold on! Not yet.
What do your findings really mean?
Yes, you analysed the data and you answered your question. Gauteng province has had the largest workforce within the time period we chose, but it’s been decreasing in size since 2013. The Northern Cape has been consistently below 5 million since the same year. Why is this?
That’s why the second part of the title for this guide has the disclaimer: “start of a story”, because now starts the work of journalism that you know or were trained to do. At this point, you would:
- contact analysts, experts, academics to interpret and comment on the data
- depending on the scope of the story or your editor’s instructions, you’d look at other data sets or speak to experts to explain the context behind the findings
- even analyse/visualise other datasets to test and refine your findings
- and, do anything else required to make sure the piece is balanced and fair.
Once you’ve done any or all of these steps, you write the final article, include the infographic we produced above, and submit it for publication. If you run your own blog or website, you would just publish it live.
There’s no place like the end!
And the end, it is. I hope that you’ve come this far and your appetite has been whet to do further (and more sophisticated) work in data journalism.
If anything hasn’t worked for you or you’d like some help with a certain section, follow me on Twitter @minaddotcom and we can figure it out together. Please also check out the Johannesburg chapter of Hacks/Hackers @HacksHackersJHB for more information and resources on data journalism.
I’ve included below all spreadsheets, tools, and links, so you can pick up this guide any time and see how I arrived at the final infographic.