Over the weekend of November 25-26, ONE had the unique privilege of partnering up with dozens of data scientists at DataKind UK's Autumn DataDive. After years of advocating with our partners for extractive companies to publish information about their payments to governments, we finally had a large set of data on these financial flows at a granular, project-by-project level. This data is important because it helps enable citizens to demand that their country's natural resource wealth goes towards things like education, health, infrastructure and poverty eradication.

Our challenge for the weekend was to dive into this new mandatory disclosure data, alongside voluntary payment data from the Extractive Industries Transparency Initiative (EITI), and answer a set of complex questions â with the help of our volunteer, expert data wranglers. This wouldn't have been possible without the heavy lifting of the data team from the Natural Resource Governance Institute, who have scraped thousands of pages of pdf documents to build a tidy data set of mandatory disclosure payment information. Part of our objective for the weekend was to put this groundbreaking data to use, and to explore ways of making it actionable for researchers and advocates.

The data set we analysed contained details on $292bn of payments made by 499 companies, related to projects located in 135 different countries. Most of the data is for payments made in 2015 and 2016, although a smaller number are for 2014 and 2017. $44bn (15%) of the payments reported were made to governments in Africa, with the majority to Angola ($17bn) and Nigeria ($15bn). The payments relate to approximately 3,400 different extractives projects around the world, although companies vary in how they define a "project". But this was just a snapshot â the data is regularly updated by NRGI as companies make more disclosures and as new PDFs are ingested.

DEBUNKING INDUSTRY CLAIMS WITH DATA

As we push for this data to be made available by all companies around the world, we often run into resistance and push-back from the oil, gas and mining industries. What we found in the data sheds new light on long-running debates.

One industry group, the American Petroleum Institute (API), has staunchly opposed efforts to require this payment information to be published in the US, in part because it claims that doing so would be too burdensome. But our volunteers found that several API member companies are already publishing this information in other jurisdictions. In fact, we found that API members â such as Shell, Chevron, BP, and others â have already disclosed at least $145 billion of payments to governments, many through subsidiaries. This represents nearly half of the total payments reported so far. That undermines the API'a assertions that publishing these reports under US law would be burdensome, since many are already doing it anyways as required by EU and Canadian laws.

Our analysis of the data also fully debunks another evidence-free claim advanced by the API, namely, that four countries (Angola, Cameroon, China, and Qatar) would prohibit them from disclosing payments and punish them if they did. Guess what? Five of the largest API members we identified in the data have collectively reported payments of nearly $20 billion to those countries, without experiencing any negative effects. The data undermines their claims that publishing this information was prohibited or would cause them harm.

Some opponents of this data also claim that publishing the information would put them at a competitive disadvantage to state-owned competitors, on the assumption that state-owned companies would not need to report their payments to their own or other governments. However, the data shows that this is simply not the case: state-owned companies account for 2 in 5 of the total payments reported to date. These include several large state-owned companies from countries like Russia (e.g. Gazprom and Rosneft) and China (e.g. CNOOC) that are hardly models of transparency, as well as Norway (Statoil). (See Figure 1).

Figure 1: Many of the companies with the largest reported payments are state-owned.

VISUALISING PAYMENT FLOWS

A team of volunteers also set out to visualise the data in ways that would make it more actionable for activists and journalists. In doing this, we found that the mix of payment types varies widely across company and recipient government â and that there were differences in the mix of payments made to African governments vs. non-African governments.

One team focused on payments to Africa and visualised the payments from the top 20 companies in an interactive Sankey diagram (Figure 2). Doing so revealed that production entitlements were the largest payment type made by these companies to African governments, and that the majority of payments unsurprisingly went to Angola and Nigeria, the continent's two largest oil producing countries.

Figure 2: A Sankey diagram showing payments made to African governments from the 20 largest paying companies.

When we plotted the same companies' payments to governmentsoutsideof Africa, the picture looked different: taxes represent a larger share of the payment mix (see Figure 3). This reveals an interesting issue that merits further exploration. Production entitlements often flow to state-owned entities in the form of in-kind payments (e.g. barrels of oil). While this can be a legitimate arrangement, state-owned entities can be notoriously opaque, particularly in Africa, where several such companies have come under scrutiny in recent years for misplacing or mismanaging billions in revenues. The revelation that these types of payments are more extensively used in countries like Angola and Nigeria, where state-owned oil companies are particularly secretive and scandal-prone, highlights the importance of more closely examining these types of payments to ensure that they are handled appropriately.

Figure 3: A Sankey diagram that shows the same companies as the previous, but now reflecting the payments they made to governments outside of Africa.

Maps also featured at the Data Dive as the data experts attempted to link project-level data to individual concessions using OpenOil's Concession Map. In time, this could be a great way for activists to explore this new payment data. However, more work will be needed in cleaning the project names so that we can cleanly link them to individual concessions.

The volunteers also tried using machine learning techniques such as clustering to identify patterns in the types of payments that companies make to governments. Clear patterns emerged (see Figure 4), so we think that this approach could eventually become a tool to help researchers to spot "red flags" in the data.

Figure 4: An example of clustering analysis of the payments data.

IN-KIND PAYMENTS

Another team of volunteers worked with Alex Malden of NRGI to explore in-kind payments. This partially included the production entitlements described earlier, but also meant analysing the free-text notes and annotations that companies use to describe the payments data.

For example, ENI's 2016 Report on Payments to Governments shows taxes and royalties paid to Libya's National Oil Corporation (see Figure 5). Footnotes on these payments explain that at least part of these payments were made as direct transfers of oil instead of cash.

Over the weekend, volunteers developed a provisional methodology to flag line items in the data that refer to in-kind payments. Using this methodology,they estimated that roughly 20% of all payments reported in the mandatory disclosures are made in-kind. This equates to roughly $80 billion of value â a huge number that highlights the urgent need for more transparency on the volumes and transfer pricing of non-cash payments. We also found that the share of in-kind payments varied significantly from one receiving government to the next. Doing further work to perfect this methodology will allow investigators to target their investigations to the areas most susceptible to corruption.

LINKING THE DATA

While this new data tells us a great deal, we think its real potential will be realised when it is combined with other available data â such as data from the Extractives Industry Transparency Initiative (EITI), commodity data, budget data, financial statements, corporate ownership data, contracts, and more. So part of our exploratory work at the Data Dive involved trying to build methodologies to link the information from the mandatory disclosures to these other data sources. This proved to be difficult, but in the process we learned a lot about the specific challenges we face and the next steps to overcome them as a community.

One team focused on the EITI data, with the aim of linking individual companies between it and the mandatory disclosures data. Why did we hone in on this data? In short, we think that finding a way to combine them could result in a more comprehensive picture of extractives payments. EITI member countries submit annual reports that detail the payments their governments receive from extractive companies. In essence, the information provided through this process is similar to what companies are supposed to report in the mandatory disclosures, particularly going forward since EITI countries will soon begin reporting project level data. But since neither EITI nor the mandatory disclosures are yet implemented globally, the two data sets each reflect a different, overlapping patchwork of countries and companies. Linking them together would enable us to compare two different accounts of the same underlying system.

âUsing text matching tools on company names, the team was able to find 35 companies from the mandatory disclosures in the EITI data. While this number was small, the matched companies accounted for over 40% of the total financial flows in the mandatory disclosures. These overlaps can now be analysed in further detail to check for validity and consistency.

But we also saw very clearly what we were missing: a tidy dataset of company ownership information. The entities reporting the mandatory disclosures were predominately large parent companies while the entities reported in the EITI data were usually smaller, local operating subsidiaries. We used text analysis of the company names to link some of these together, but we know this method left a lot of connections uncovered. The next step would be to locate information from the larger parent companies about their subsidiaries and build a comprehensive ownership dataset, which could then be used to decisively connect the two data sets. We look forward to continuing this work with OpenOwnership and the wider community of partners.

Linking companies solely through text matching proved to be a messy and time-consuming process. So a team also worked on connecting the EITI data to OpenCorporates, which maintains a vast data set of corporate entity information organised with unique corporate IDs. At first we were able to make exact matches on 230 names, which represented c.23% of the total flows reported in the EITI data. After a lot of cleaning and fuzzy matching the volunteers found matches for 600 names, which correspond to c.28% of the financial flows. We would love to share our code and learning from this work with others who are keen to help take it forward. This work also highlighted the value of Legal Entity Identifiers being incorporated in all company reporting, as our research would have been much easier if we could easily identify and link unique corporate entities.

WHAT'S NEXT

As a community, we are still at the beginning of a journey to maximise the potential of data about governments' natural resource revenues. But the DataDive was an energetic, whistle-stop tour of a groundbreaking data set: we left with a deeper understanding of what the data meant and feel inspired by the new questions and possibilities that volunteers unlocked. In coming weeks we will publish further detailed documentation of the work done at the dive, along with links to code. Please contact Kate Vang or Joseph Kraus with questions, contributions, or to discuss anything in more detail.

âAll of us at ONE give huge thanks to DataKind UK, to NRGI and to all of the DataDive volunteers. This project would not have been possible without the incredible volunteer Data Ambassadors: Victoria Bauer, Stephen Gaw and Nick Jewell. And a special thanks to the Elsevier Foundation and University College London for sponsoring the event.

IntroductionExtractives data comes from a wide array of sources. It’s the job of the person working with this data to extract, combine, analyse and communicate it. That’s where data visualisation comes in. If you have a basic grasp of working with data, you are probably familiar with basic chart types that are used in data visualisation.

There are many online resources which list taxonomies of data visualisation types and guides you to choosing the appropriate tool and type of data to be used with each type of visualisation. Datavizcatalogue.com and Datavizproject.com are great examples of taxonomy sites. Dataviz.tools is a very useful site that catalogs all the different tools available for data visualisation. The Financial Times’ Visual Vocabulary chart provides a handy guide to match data types to visualisation types.

When we are working with data from the extractives sector, we often find that there is a need to communicate “flows”. According to FT’s visual vocabulary chart, sometimes we need to “Show the reader volumes or intensity of movement between two or more states or conditions. These might be logical sequences or geographical locations”. Examples of flows in extractives data include:

Flows of financial revenues from companies to various government departments

Flows of imports and exports from one country to another

Connections between companies and their owners and subsidiaries

Flows of supply chains from raw materials to finished goods

Flows of people to and from mining sites as migrant labor or displaced populations

In this module, we will create visualisations of the first of those examples, and the we will use data about the global uranium industry to demonstrate how to prepare data and create these visualisations.

We will be using a free-to-use online tool called RAW. RAW is designed to create highly customizable, static visualisations (i.e. not interactive) that provides the functionality to easily create interesting visualisation types that many off-the-shelf tools usually do not provide. RAW is an especially useful tool for designers because it lets you export visualisations as SVG files which can be further edited using vector graphics software such as Illustrator. You can check out some of the beautiful visualizations that are created using RAW in the gallery section of their website.

For developers, RAW offers a lot of customizability too. It is completely open source, and is build on top of D3.js, which is the most popular web data visualisation framework. If you’re ambitious and know a bit of D3, you can even add your own new chart types to RAW.

If you’re sold on how awesome RAW is, great! Let’s get started.

Sankey Diagram of Revenue FlowsThe first kind of chart we will make is a sankey diagram that shows the revenue flows to government from the extractives sector as presented in an EITI report. Specifically, we will look at the revenue flows reported in Kazakhstan’s 2014 EITI report. “Wait, I thought we were going to focus on data from the uranium industry? What’s that got to do with Kazakhstan?” I hear you ask. Stay tuned, you will see why in the next section.

You can access the PDF file of the report from the global EITI website here, or from the google drive from the data section of the EITI website. We are specifically interested in this table on page 61 of the report:

We can use Tabulato extract the report. After a bit of manual cleaning, we can get a table in Excel or Google Sheets that looks like this:

The first 3 rows of data are directly from the PDF table’s first five rows of data (excluding the share of total as % row). The last row “Non-Extractive Receipts”, is calculated with a simple formula, the “Tax Receipts Total” row minus the sum of the “Oil and Gas Receipts” and “Mining Receipts” rows.

Let’s look at what the column names mean:

SB + NF: State Budget + National Fund

SB: State Budget

RB: Republican Budget

LB: Local Fund

TR: Oil Sector Tax

We have to always remember the context behind the data we are trying to visualise. Especially in the extractives sector, the data is very complicated and tied up with the individual country and/or company’s policies. In Kazakhstan’s case, the total receipts are broken down into state budget and the national fund. The state budget is then broken into the republican budget and the local budget. In addition to that, there is also a special tax that companies in the oil sector have to pay that is not included in the state budget but included in the national fund. Writing it all out in text makes it sound quite confusing, and that’s why we’re visualising it in the first place.

RAW is incredibly easy to use, but the most difficult step is making sure the data is in the correct shape for the chart that you are trying to make. Notice the color coded cells in Table 2 above? Those are the figures that we want to chart using RAW. But why are we ignoring the first two columns and the first row? It’s because the figures in those columns and row are just sums of the other figures, and RAW will automatically sum up the figures for you. In general, you only want to give to RAW the most disaggregated data.

Now that we know which are the figures we want to use, we still have to reshape it into a format that works for RAW’s sankey diagrams. Sankey diagrams have a series of stages, with the flows diverging or converging at each stage. Hence, we have to reshape the data so that it will look like Table 3 below. The color codings on the cells with the numbers are the same as in Table 2, so you know where each of the numbers go.

As you can see, we have divided the categories into different steps to show how each item is broken down into subcategories. Once you have this table of data prepared, we can go over to the RAW web app (apps.rawgraphs.io) to start visualising.

In the first screen that you see, you just copy and paste the data directly from your spreadsheet. Make sure to change the format of the numbers in your table so that they don’t contain any thousand separator commas (i.e. we want it not like this: 1,000,000, but like this: 1000000).

If the data is acceptable by RAW, the bar below the text box will turn green with a little thumbs up icon, and it will tell you how many rows of data has been loaded. In the top right corner, you can change the view of the data to a table view to see the data more clearly.

Scroll down. Once the data is loaded, RAW will let you choose the type of chart we want. The sankey diagrams that we want are called Alluvial Diagrams in RAW (there’s a subtle difference between the two but the terms are often used interchangeably. You can refer to the dataviz project’s pages on sankeyandalluvial diagrams to see the difference). Click on Alluvial Diagram in the list of charts.

Next, we have to choose which columns from the data we want to visualise. Since we have pre-prepared the data to fit the sankey diagram format on RAW, this step is quite simple. Drag the column names into the boxes as shown below.

After that, you’re basically done! Scroll down further to see what the chart looks like.

The chart updates live depending on what columns you drag into the boxes in the “Map Your Dimensions” section, so you can play around to see what kind of changes your choices make to the chart. For example, if you don’t include anything in the “Size” box, RAW will just assume each of the flows are of the same size, as seen below:

There are some limited options for changing colors and dimensions on the left, but for real customization, RAW itself is not the best tool. It is best used in conjunction with a vector graphics editor like Illustrator to really polish up your charts. RAW especially accommodates for this kind of importing to a graphic editing software. Scroll down for the Download section to see how.

If you are satisfied with your chart and want to use it as an image, choose “image (png)” from the dropdown, give your file a name, click download, and you’re done! However, there are two other formats that you can get the chart in. Select “vector graphics (svg)” to get it in format which can be edited further in a vector graphics software. If you want to embed the chart in a web page, you can copy and paste the code in the “Embed SVG Code” box into your HTML. There is an additional option to download the chart’s data model in JSON format, but that option is for more advanced users and we won’t cover that in our tutorial.

That’s it! Making charts in RAW is super quick and simple. No need to register for accounts, everything is completely free (not “freemium”), and it’s all done on a simple web app on a single page.

Bump Chart of Uranium Production by CountryNext, let’s try another useful chart type that takes data in a different shape from the sankey chart.

In this section, we will visualise how uranium production has changed over time by country. The dataset we will use is from the World Nuclear Association’spage on World Uranium Mining Production. We want this table: