Alternative Data Practices in China

A couple of years ago, I delivered a presentation introducing data journalism in China at the Google News Summit, organized by Google News Lab. It was a beautiful winter day in the heart of the Silicon Valley, and the audience was a full room of a hundred or so senior media professionals mostly from western countries. I started by asking them to raise their hands if they think, firstly, if there is no good data in China, and secondly, if there is no real journalism in China. Both questions got quite some hands up, along with some laughters.

These are two common comments, if not bias, that I encounter often when I attend or speak at international journalism conferences. From my observation in the past six years, instead of having no data, there are huge amounts of data existing and being accumulated every day in China, and its quality is improving. Instead of having no real journalism, there are many journalists producing impressive stories every day, though not all of them ultimately get published.

Issue-driven Data Creation

Even before the term “data journalism” was introduced to China, data stories existed. While nowadays we normally use the term “data-driven stories” in China, there was a period when we see the contrary: instead of having data driving stories, we witnessed stories, or particular issues, driving the production of data. These are always issues that resonate with regular citizens, such as the air pollution.

Since 2010, the Ministry of Environment has published a real-time air pollution index, but one important figure was missing.1 PM2.5, or pollutants that measure less than 2.5 micrometers in diameter, which would lead to irreversible harm to human bodies, was not published.

In the wake of the seriousness of air pollution and lack of official data of PM2.5, a nationwide campaign started in November 2011 called I test the air for the motherland, advocating for every citizen to contribute to monitoring air quality and publishing results to social media platforms.2 The campaign was initiated by an environmental non-profit, with testing equipment crowd-funded from citizens, and they also provided training to interested volunteers. The movement was widely spread after a few online influencers joined forces, including Pan Shiyi, a well-known business leader, who then had more than 7 million followers on Sina Weibo, one of China’s most widely used social media platforms.3

After two years of public campaigning, starting from January 2012, the data of PM2.5 was finally included in the government data release. It was a good start, but challenges remained. There was immediately observation on the discrepancies with the data released by the U.S. Embassy there, which brough doubts regarding the accuracy and accountability of the data.4

In terms of functionality, it was also not journalist-friendly. Despite hourly updates of the data from more than 100 cities, the information is only provided on a rolling basis on the webpage, but not downloadable as a dataset in any format. Though data has been centralized, historical data is not accessible for the public. In other words, without being able to write a script to scrape the data every hour and save it locally, it is impossible to do any analysis of trends over time or undertake comparisons between cities.

That was not the end of the story. Issue-driven data generation continues. When the data is not well structured or in a user-friendly format, and when data journalists struggle with limited technical skills, civil society or tech geeks can come in to provide support.

One early example back in 2011 was PM25.in, which scrapes air pollution data and releases it in a clean format. The site claims to have more than 1 billion search queries since they started operating.5 Another example is Qing Yue, a non-governmental organization which collects and cleans environmental data from government websites at all levels, and then releases it to the public in user-friendly formats. Their processed data turns out to be widely used by not only data teams in established media outlets but also government agencies themselves for better policy making.

The generation of data and the rising awareness around certain issues have gone hand in hand. In 2015, a documentary investigating the serious air pollution took the country by storm. The self-funded film, entitled Under the Dome, exposed the environmental crisis of noxious smog across the country, and then traced after the roots of the problem and the various parties responsible.6 The film has been compared with Al Gore's An Inconvenient Truth in both style and impact. In the storytelling, it presented lots of scientific data, charts to show analysis and explain the trends over the years, as well as a social network visualizations of corruption within environment and energy industries. As soon as it was released online, the film went viral and reached 200 million hits within 3 days, before it was censored and taken down within a week. But it had successfully raised public awareness and ignited a national debate on the issue, including around the accessibility and quality of air pollution data, and it has successfully made the country’s leadership aware of the significance of the issue.

Two weeks after the release of the documentary, at the press conference the National People’s Congress, addressing a question about air pollution which referred to the film, Premier Li Keqiang admitted that the government was failing to satisfy public demands to halt pollution, acknowledged some of the problems raised by the documentary, including lax enforcement of pollution restrictions, and emphasized that the government would impose heavier punishments to cut the toxic smog.7 At the end of August 2015, the new Air Pollution Prevention and Control Law was released, and was implemented Jan 2016.8

Air pollution is only one example illustrating that even when data availability or accessibility is challenging, public concern with issues can lead to citizen contributions to data generation, as well as changing government attitudes and the availability of public sector data. In more established ecosystems, data can be more readily available and easy to use, and journalists’ job can be more straightforward: to take data and use them as basis for stories. In China the process can be less linear, and the dynamics of citizen, government, civil society and media are much more interactive. Data, instead of just serving as the starting point for stories, can also be brought into the picture at a later stage and further enable new kinds of relations between journalists and the public.

Evolving Data Culture

The data environment in China has been changing rapidly in the past decade, partly driven by the dynamics described above, and partly due to other factors such as the global open data movement, rapidly growing internet companies, surprisingly high mobile penetration rate, etc. Data culture has been evolving around these trends as well.

Government legislation provides the policy backbone for data availability. To the surprise of many, China does have laws around Freedom of Information. The State Council Regulations on the Disclosure of Government Information was adopted in 2007 and came into force on May 1, 2008, which has a disclosure mandate and affirms a commitment on government transparency. Following the regulation, government agencies at all levels started dedicated web pages to disclose the information they have, including data. However, even though it gave journalists the right to request certain data or information from the authorities, in the first three years since the law enforcement, there is no publicly known cases of any media or journalists requesting data disclosure, according to a study in 2011 published by Caixin media.9 The study revealed that, in 2010, the Southern Weekly, a leading newspaper, sent a testing request to 29 environmental bureau for certain information disclosure only got a 44% response rate, and within media organizations there is normally no supporting system such as a legal team that could help the journalists push their demands further. One journalist who, in his personal capacity, actually took the government to the court for not disclosing information, ended up losing his job. The difficulties and risks that Chinese journalists encounter when leveraging legal tools can be much greater than their western peers.

In the wake of the global open data movement and increasing interest in big data, China was also reacting to these trends. In 2012, both Shanghai and Beijing launched their own open data portals, each with hundreds of datasets, around areas such as land usage, transportation, education, pollution monitoring, etc. In the following years, more than a dozen open data portals have been set up, not only in the biggest cities, but also in local districts and less developed provinces. The development was rather bottom-up, without standard template or structure for data release at the local level, which made the data collection at the user end not much easier. By 2015, the State Council has released the Big Data Development Action Plan, where open data was officially recognized as one of the ten key national projects, and a concrete timeline for opening government data was presented.10 However, the official data is not always where journalists start, and also not always aligned with public interests and concerns.

On the other hand, the private sector, especially the technology giants such as Alibaba or Tencent, have over the years accumulated huge amount of data. According to its latest official results, Alibaba’s annual active consumers have reached 601 million by September 30, 2018.11 The e-commerce data from such a strong user base – equivalent to the entire Southeast Asian population – can reveal lots of trading trends, demographic shifts, urban migration directions, consumer habit changes, etc. There are also vertical review sites where more specific data is available, such as Dianping, the China equivalent of Yelp. Despite concerns around privacy and security, if used properly, those platforms provide rich resources for data journalists to mine.

One outstanding example in leveraging the big data is the Rising Lab, a team under the Shanghai Media Group, specializing on data stories about urban life.12 The set-up of the Lab was an answer to the emerging trend of urbanization: China has more than 600 cities now, compared to 193 in 1978, with 56% of the population living in urban areas, according to a government report in 2016.13 Shifting together with the rapid urbanization is rise of internet and mobile use, as well as lifestyle changes such as the rapid adoption of sharing economy models. These trends are having a big impact on data aggregation.

With partnership agreements and technical support from tech companies, the Lab collected data from frequently-used websites and apps by city dwellers, such as property price, number of coffee shops and bars, number of co-working spaces or easiness of public transportation, etc. reflecting various aspects of urban life. Coupled with its original methodology, the Lab has produced a series of city rankings on different aspects, such as commercial attractiveness, level of innovation, diversity of life, etc. The rankings and the stories are updated every year based on new data available but following the same methodology to ensure consistency. The concept and stories have been well received and even starting to influence urban planning policies and company’s business decisions, according to SHEN Congle, Director of the Lab.

The Lab’s success illustrated the new dynamics emerging between data providers, journalists, and citizens. It shows how softer topics also become a playground for data journalism, along side of the other pressing issues such as environmental crisis, corruption, judicial injustice, public health and money laundering. It also explores new potential business models for data journalism, as well as how data-based products can bring value to governments and businesses.

Readers’ news consumption also has had an impact on the development of data journalism, with one being more visual and another being more mobile. Since 2011, infographics have become popular thanks to a few major news portal’s effort to build dedicated vertices with infographics stories, mostly driven by data. In 2014, the story of the downfall of the former security chief Zhou Yongkang, one of the nine most senior politicians in China, was the biggest news of year. Together with the news story, Caixin produced an interactive social network visualization to illustrate the complex network around Zhou, including 37 people and 105 companies or projects connected to him, and the relationship between these entities, all based on the 60,000-word investigative piece of its reporting team. The interactive received 4 million hits within one week, and another 20 million views on social media, according to Caixin.14 The widespread of this project introduced the new ways of data storytelling to the public, and created the new appetite which didn’t exist before.

Almost at the same time, the media industry was welcoming the mobile era. Like the Rising Lab, more and more data stories, like any other online content in China, is now disseminated mostly on mobile. According to the China Internet Network Information Center (CNNIC), more than 95% of internet users in the country have used a mobile device to access the internet in 2016.15 WeChat, the domestic popular messaging app and social media platform, has reached 1 billion users in March, 2018.16 The dominance of mobile platform means data stories in China are now not only mobile-first, but in many cases mobile-only. Such market demand led to a lot of lean, simple and sometime creative interactives that are mobile friendly.

In short, data culture in China has been evolving, driven by various factors from global movements to government legislation, from public demand to media requests, from new generations of data providers, to new generation of news consumers. The interdependent relationships between players have created very complex dynamics, where constraints and opportunities coexist. Data journalism has bloomed and advanced along its own path in China.

Practical Tips

This section is specifically for those who are looking to work on China-related stories, and wondering where to even get started. It won’t be easy. You would have language barriers first, as most data sources are only available in Chinese. You would then have all the common issues with any data elsewhere: data accuracy, data completeness, data inconsistency, etc. Let’s assume you have all the right skills to spot those issues and work on them.

First of all, who are the biggest players? Quite a number of the leading media outlets have established data teams, and it is good to follow their stories and talk to their reporters for tips. Here are a few ones you should know:

Secondly, where to find the data? A comprehensive list would be a separate handbook so here are just a few suggestions to get started:

Start with government websites, both central ministries and local agencies. You would need to know which department is the right one(s) for the data you are looking for, and you should check both the thematic areas of ministries (for example the Ministry of Environmental Protection) and the dedicated data website at the local level if it exists.

There will be data that you don’t even expect - for example, would you expect that the Chinese government publicized millions of court judgements after 2014 in full text? Legal documents are relatively transparent in the U.S. but not in China. But the Supreme People’s Court (SPC) started a database called China Judgments Online just doing that.

Once you find some data that could be useful online, make sure you download a local copy.

Sometimes the data is not available online. It is still common. Sometimes they are in the form of a government annual report published and you could order online, sometimes they are only available in paper archives behind certain offices. For example certain government agencies have the records of private companies but not all available online.

If the data is not at all released by government, check if any user-generated contents available. For example, the data of public health is very limited, but there are dedicated websites for hospital registration, or elderly centres, among others. If you could scrape and clean up the data, you would have valuable data to have a good overview of the topic.

Utilize databases in Hong Kong – from official ones like Hong Kong Companies Registry to independent ones such as Webb-site Reports. As mainland China and Hong Kong becoming politically and financially closer, more information is available there thanks to Hong Kong’s transparent environment and legal enforcement, which may be valuable for tracing money.

There is data about China but not necessarily in China. There are international organizations or academic institutions that have rich China-related data. For example, The Paper used data from NASA and Harvard in one of its latest stories

Last but not least, while some of the challenges and experience are unique to China, lot of them could potentially provide some useful lessons for other countries, where the social, cultural and political arrangements have a different shape but similar constraints.