Last updated January 12, 2021
In AI Origins & Evolution

How Analytics Is Being Used In Data Journalism

Share

Published on September 14, 2020

by Vishal Chawla

The field of journalism over the past decade or so has been witnessing continuous change. Today, journalism is influenced by big data and new computational tools. Data and visualisation have become the latest techniques for telling stories in media, thanks to intersections between journalism and computation.

One of the many things that AI is doing for journalism is to make it easier and faster to analyse the data and also synthesise the data into stories. When we mention automatic story writing tools, they use Natural Language Understanding and Processing, to synthesise the stories. We also see the use of AI to help generate imagery and videos.

Major news publications are struggling with budgets to maintain strong reporting staff. In such times, media houses have been exploring data and related computational tools to keep the expense of public accountability journalism economical, while presenting fact-based news reporting.

Computational Journalism Leveraging Algos

Computational journalism involve processes that utilise the assortment of analytical tools for storytelling. Data and computation are changing how journalists discover, write and distribute stories in the realm of public affairs reporting. Media houses are showing a deep interest in datasets which they can use to find information that is not revealed yet, particularly at the local level.

Many times journalists have to take unstructured data, turn it into structured information, and tell their readers about a pattern that catches their attention. Such as in political reporting where reporters are trying to make use of big data to analyse political events. An example is Paradise Papers, the journalistic investigation from The International Consortium of International Journalists, where they utilised software developed in the digital humanities work at Stanford. There is even a website called Open Secrets that publishes content based on campaign finance data. They have something called Anomaly Tracker to track money in politics and its effect on elections and public policy.

Case Study

International Consortium of Investigative Journalists, a network of global journalists that does cross-border investigations and issues of global concern precisely does investigations using large datasets. One of its works called the Panama Papers, was the most prominent data journalism story ever which had a significant global impact and resulted in resignations and legal trials of politicians on charges of corruption and tax evasion. ICIJ journalists from Süddeutsche Zeitung, a newspaper in Germany first got access to a large dataset (2.6TB) containing over 11 million documents from an anonymous source.

ICIJ quickly made a team of more than 370 journalists from 76 countries and a group of data-savvy developers to work together for a year-long project in secret and start analysing the files. Finally, the team exposed those powerful billionaires, celebrities and politicians who were involved in money laundering and tax evasion using offshore companies and a Panama-based legal firm called Mossack Fonseca. There were hundreds of stories published by our media organisations. There were a lot of processes involved to process all the 11.5 million files, mostly leaked emails, PDFs and images of scans, records of incorporation. ICIJ turned to open-source tools that allowed it to work on the files.

According to ICIJ, it built a secure cloud network that consisted of between 30-40 G2 AWS instances at a time to do parallel processing of all these documents. They used open-source technology such as Apache Tika, a Java-based toolkit that detects and extracts metadata and text from over a thousand different file types, and Tesseract- which is an OCR engine.

The team had a small internal project Extract hosted on its GitHub. Extract is a cross-platform command-line tool for parallelised, distributed content-extraction built on Top of Apache Tika. It supports Redis-backed queueing for distributed, parallel extraction and will write to Solr, plain text files or standard output. The team also created a search engine for journalists to query searches for investigating certain people and finding specific data. It used Neo4J for creating a knowledge graph which was then used for visualisation that demonstrated patterns between specific politicians and the offshore companies.

Open Data Is Democratising Storytelling

We’ve seen many examples of the use of computer-assisted reporting to gather information and analyse it to create stories. Open data is democratising storytelling, allowing people to tell engaging stories.

We’re seeing a lot of efforts today at making data more publicly available. There are websites where you can acquire government data, census data, and all types of demographic data. Certainly, some tools make it much more open to a broader spectrum of the public and journalists who want to dig deep and learn what’s going on, to study interesting patterns and create stories.

One example is the Stanford Open Policing project, where the journalism department had its students file a freedom of information act requests on all. 50 states were asked for electronic versions of State Police stop data resulting in about 130 million records from 31 states in two years.

The data was then used to find insights (using certain algorithms) on what rule of thumb a police officer uses when someone is pulled over. Stanford opened up the data for media houses and local reporters to download the information. It helped in understanding how their state police are operating, leading to stories which highlighted US police’s actions across racial demographics in the US.

The transparency of data is reflected by media firms who are opening up their datasets to create stories. Here is a look at the Github of the Economist, including one repo containing code for a dynamic multilevel Bayesian model to predict US presidential elections. Written in R and Stan, the model is updated every day and combines state and national polls with economic indicators to predict a range of outcomes.

Visualisation Is An Important Aspect Of Journalism

One of the domains that a lot of journalists are focusing on is data visualisation and attempting to take complex datasets and turn them into fascinating visualisations, which otherwise would look mundane. Visualisations provide a lot of context to the story while engaging readers.

While creating compelling visuals for a given dataset is challenging for most people, there are many tools available now that make it easy and faster for journalists to create useful visualisations.

Access all our open Survey & Awards Nomination forms in one place

Vishal Chawla

Vishal Chawla is a senior tech journalist at Analytics India Magazine and writes about AI, data analytics, cybersecurity, cloud computing, and blockchain. Vishal also hosts AIM's video podcast called Simulated Reality- featuring tech leaders, AI experts, and innovative startups of India.