How to collect data better than others

How to collect data better than others

A key issue for all media organizations around the world is the loss of exclusivity combined with a democratization of publishing technology. Can data help?

Journalism always depended on technology. The invention of the printing press, later the technologies to transmit radio or TV signals - these platforms provided a competitive advantage for journalism. A media outlet usually was the only organization able to reach an audience over night or directly.

No more.

Today everyone can rent a web server, install blog software and start publishing. People with mobile phones can take pictures of any event and inform big audiences via social media in an instant. There are million reporters out there now.

This is why working with data provides a new opportunity: If journalists learn how to collect data over time better than others, they might still not win the race to report first. But they could shift to providing context, depth and background. Give better answers.

Here is a great example to show how and why this approach can work: Crime Databases. The Los Angeles Times Homicide Report collects data around heavy assaults, incident by incident. Different to reporting in the past, where usually only the story would be published, the people from the data desk collect all the data over time: Time, place, name of the victim and so on. Of course they are writing an article.

But the database grows, enabling the newsroom to simply hit a button to run an analysis, comparing one year against the other. The geo data enables a regularly updated map, providing overview. And so on.

Eventually this database can become the most complete, easiest to access overview of heavy crime in the region. Yes, there might be public records at the police office that have basically the same thing. But often these are not openly accessible. They lack information what happened later. They do not provide the context how a city reacted to a rise in crime. So, the best context for anyone worrying about his or her own safety will be the offering from the newspaper.

Used this way, data journalism is not only a new avenue; it is helping to make reporting easier and faster, too. Paul Bradshaw, who is teaching journalism in the UK thinks that aspect is often overlooked.

Crimespotting is just one topic where data can be collected. There are now numerous other projects like the Homicide Report now, for other regions and other topics.

Oakland Crime for example provides filters, so that it is easy to see the patterns of all crimes committed and whether there are hotspots where it is really dangerous. The Washington DC Crime Spotting project is another example. In August 2012 they started a campaign on Kickstarter, a site where projects can get funded by crowdsourcing. And it seems people are willing to finance this form of journalism and structured information - in the first 15 days around 18.000 Dollar of a needed budget of 40.000 Dollars where flowing in.

What other areas are there? What about traffic? Car crashes? House prices, then and now? Living costs? Once the key concept is understood it is easy to see: The idea is to simply collect data better than others and make it understandable. The technology needed for such projects has gotten dramatically cheaper, often many components are open source, so all is needed is a strategy and some know-how in developing and coding. The rest is simply doing it day after day.

These projects - finally - fulfil a demand formulated in 2006 by Adrian Holovaty. In a blog post titled „A fundamental way newspapers need to change“ he argued that by mangling datapoints into HTML-pages we loose a lot of value over time. Simply recording events, remembering when it was, where it was, who was involved - combined with as much additional data as possible can become a pillar of local and regional reporting in many newsrooms, everywhere in the world.

This is just starting. Journalists slowly understand that having access to structure, checked data is an advantage.

As a result there a number of initiatives under way to help newsrooms handle data:

FreeDive makes it easy to turn a spreadsheet into a searchable database.

Projekat Panda, steered by a team at the Chicago Tribune is a newsroom archive for datasets.
 
Platforma CKAN, from the Open Knowledge Foundation, is a full-blown Data CMS. For example, the Open Data website of the UK government uses it.
 
Rnews, by the New York Times, is a standard that helps to get metadata of published articles under control, enabling easier exchange of linked data, across platforms, topics and time.
 
•  And should you have the data and then just want to visualize what you found then Datawrapper* helps to visualize data in four steps and embed the resulting charts in any website.
 

Most these tools are open source and free to use. Many have been developed by journalists, who teamed up with coders or knew how to do it themselves. Getting these to work is a bit of tinkering, but is usually not overly complicated.

"Policing the data“

All these projects - or: merely the thinking behind it - can over time create a solid base for reporting. It‘s about „policing the data“, as the website Spreadsheetjournalism puts it.

In order to catch on in more newsrooms a basic misunderstanding must be overcome: Such projects do not add to the workload, they simplify the reporters work. All is needed is installation of one of the tools and then a patient collection of available data over time. After one, two years local statistics can be analyzed in minutes and this ability is of value. For reporters this can bring time spent on investigative work from hours to minutes.

Turn your offering into something people can trust

There is another perspective that is important: Ask yourself, when searching for information on the web - how often you can really trust a source? There is so much content out there just published to get a bit of attention, that trustable news sources can build a destination against all that shallowness, incorrectness.  Journalists and newsrooms have a strong incentive to report as correct as possible. Others, including official institutions might report correctly, but could be tempted to keep a sharp rise in incidents under wraps.

Equipped with regularly updated and quality checked data journalists can regain a competitive edge. It‘s not so much about being the first to report any more, it‘s about being the one who can clearly show the context of something that happened: How does it affect the community? How does it affect every single person?

Over time, we might find that in a world where there is often too much information, many pressing questions of the public are still unanswered. By starting to collect data that is relevant, cleaning it, visualizing it and step by step finding the answers, such collections build a value for journalism.

*Disclaimer: Mirko Lorenz had the idea for Datawrapper in 2011. The project was funded by ABZV, a German training institution for newspaper journalists. The current beta is live since February 2012. He is now working with Nicolas Kayser-Bril and Gregor Aisch (driven-by-data) on a second, more advanced version at the moment.