Digging Deep into Oil, Gas, and Mining Data

In our first videotrainingsession, we presented a walkthrough of how to organize USEITI data for use in the open source mapping software QGIS. Fortunately, that dataset included geographic identifiers called Federal Information Processing Standard (FIPS) county codes--five digit codes identifying counties and county equivalents throughout the United States. However, not every dataset will include a geographic identifier alongside data attributed to a location. Google Refine is a powerful and versatile tool that can allow users to clean, manipulate, and transform their data. In this post we will walk through the process of using Google Refine to add geographic coordinates to a dataset.

Google Refine has been renamed OpenRefine. Instructions for downloading and installing OpenRefine are available here. The older Google Refine software was used as it is the only stable version available at the time this post was written. You may encounter problems accessing the the most recent file, “OpenRefine v.2.6 release candidate 2,” and we recommend that you use Google Refine 2.5. The instructions detailed in Step 1 will also apply for OpenRefine v.2.6, if it becomes accessible at a later date.

Step 1 - Download and Install Google RefineNavigate to theOpenRefine download page, and download Google Refine 2.5 for your operating system. Google Refine operates as a hybrid desktop and web application. When you run Google Refine, a browser window should open automatically and present you with the Google Refine web interface. Despite operating within a web browser window, Google Refine does not require an active internet connection to work. As long as the Google Refine application is running, you can navigate tohttp://127.0.0.1:3333/to access the web interface.

Before we move to the next step, take a moment to download the following.csv file. This dataset was downloaded fromResourceProjects.org, and was reduced to only include 2015 projects carried out by Tullow Oil. Google Refine is a powerful piece of software, however, it can quickly get bogged down with very large sets of data. This file was limited to one company for the purposes of this tutorial.
​

Step 2 - Upload your dataset to Google RefineTo get started, click ‘Create Project’. You will be presented with a number of options for data inputs. We will create a new project using data from ‘this computer.’ Select the file downloaded in the step above, and click next to start the process of uploading the dataset.

​Step 3 - Add a new column to fetch location informationWith the dataset uploaded, Google Refine will present a preview of the entries. Review the data and headers to make sure everything appears as it should. At the bottom of the window check that the ‘Parse next’ box is ticked so that the first row entries are parsed as column headers.

Click the ‘create project’ button in the upper right corner to proceed to the main working space of Google Refine. As noted above, we will be adding in additional geographic information to this dataset. To do so, click the triangle in the ‘Paid to’ column and navigate to ‘Edit column’ > ‘Add column by fetching URLs…’

A window will pop up as shown below. Name the column and enter in the following text into the ‘Expression’ box. (Clickhereto learn more about General Refine Expression Language)

Click ‘OK’ and the expression will produce a column containing what is essentially the output of a search of the google maps application programming interface (API) on the basis of each term in the ‘Paid to’ column. This operation will typically take several minutes to complete depending on the size of the dataset. While you wait for the process to complete you can experiment to get a better of sense of how this function works. Enter the expression we just used, leaving off the last portion, into the address bar of another browser window:

http://maps.google.com/maps/api/geocode/json?sensor=false&address=

Fill in the name of any location around the world after the “=” and you will see a page with all the relevant location information for that location. This should give you a better sense of what is happening under the hood with the fetching URLs function in Google Refine.
​

Step 4 - Add another column to parse the information from the previous stepOnce the process has completed, you will see a column filled with a long string of text and numbers.

​To clean this up we will add another column parsing through that data. Click on the triangle in the new column you created in Step 3 containing all the Google maps information, and select ‘Edit column’ > ‘Add column based on this column…’ Write in a title for this new column and enter in the following text into the ‘Expression’ box:

Click ‘OK’ and the new column will populate with a neat seat of latitude and longitude coordinates separated by a comma derived from the data in the column we produced in Step 3.

​STEP 5 - Export your projectThe final step is to click ‘Export’ in the upper right corner of the Google Refine window. Select ‘Comma-separated value’ or ‘Excel’ from the dropdown list of file types.

You can then open the exported file in a desktop application to delete the column containing the unparsed location information while leaving the second column we created that includes the latitude and longitude coordinates. Google Refine is ideal for refining, cleaning and adding to a dataset, but operations like deleting rows and columns should be done in programs like Excel.
​While this post demonstrates how latitude and longitude coordinates can be derived from a country name, the exact same process can be carried out for any other location. If instead of country names the dataset contained the names of cities or provinces, the same steps can be used to obtain the latitude and longitude coordinates. Location information can help you to create persuasive maps and other visualizations of your data. To learn more about what can be done with extractives data and mapping, navigate to thetrainingsection of Extract-A-Fact.

Always on the lookout for interesting data, I was excited when I recently came across a comprehensive trove of data on offshore production in the Gulf of Mexico from the Bureau of Ocean Energy Management (BOEM).The datasets atdata.boem.govinclude:

However, because the data is interspersed between four datasets, it is downright complicated to find out which leases a company owns, what the lease attributes are, and how much oil and gas has been produced from the lease. With that in mind, I have created an interactive map application that combines the geographic data, the ownership data, and the production data into one easy-to-use tool.

the leases owned by some of the biggest leaseholders in the Gulf of Mexico

which leases produced oil and/or gas in 2014, 2015, and 2016, and how much

all the owners of a lease by percentage

the royalty rate on the lease

I created the map usingCarto, an online platform for mapping and analysis, and used their Javascript library,Carto.js, to add custom features. I cleaned and merged the BOEM data usingRinto a usable format for the application. The data is hosted on the Carto website atpwypusa.carto.com; it can be viewed and downloaded there.

ResourceProjects.org is an open-source repository of data on oil, gas and mining projects across the world. It provides a platform to collect, display, download and search extractive project information using open data. It aims to harvest data on project-by-project payments to governments—based on recent mandatory disclosure legislation in the EU, U.S. and Canada as well as EITI reports—and link it to associated information about the project from a variety of sources. The platform will make it easier for journalists, CSOs, researchers and government officials to search, access and download relevant data.

As we continue to develop the platform and connect it to new data sources, we are inviting contributors and collaborators to get involved.

Why does project-level data matter?

Projects are the physical, tangible presence of extractive operations in a country. A project is the mine that people see out of their window or the oil field along their coastline. But a project also has a concession area where it is located, one or more participating companies, contract documents detailing their obligations and payment information giving an insight into their economic contribution.

Governments and citizens groups can also use project data to model revenues and forecast budgets, such as in Ghana, where all interested parties could see how different oil prices affected the money available for the budget. Others, such as CCSI, Global Witness and Open Oil have modelled contracts to evaluate extractive deals, while IMF economists routinely use project-level information for fiscal design and technical assistance using their publicly available FARI model. Project information has a multitude of applications beyond fiscal modeling. It can be tied to spatial data to help better understand local impacts or environmental consequences, as highlighted by recent academic papers.Why did NRGI build this tool?

Information on extractive projects are scattered across different company and government websites, in EITI reports, as well as databases compiled by regulators, international organizations and civil society. It comes in multiple formats: PDF, spreadsheets and in computer queryable databases. These are rarely linked to each other at all.

ResourceProjects.org brings this information into one place. We are also working on linking the data gathered to other repositories on related entities, such as OpenCorporates for associated companies; ResourceContractsfor oil and mining contracts; and Open Oil`s concession map. All information on the platform is stored with details on what source it came from and how it was retrieved. By bringing this information together in a standardized and accessible format, we are allowing users to explore extractive projects with greater depth.How to get involved?

We are now looking for people who are interested in getting involved in the site. By the end of April, we will have added company disclosures from the U.K. that are starting to be released. Beyond the U.K., many companies are beginning to release project-by-project tax payment data. We would welcome any organisations or individuals who wish to lead on sourcing data from specific countries from upcoming mandatory disclosures.

Additionally we are inviting feedback as well as interested collaborators to help develop the site and its content. Further features and enhancements will be rolled out in the coming weeks and we are looking for partners who want to get more closely involved.

Finally, we are seeking to support the growing community of data users. Please sign up to the ResourceProjects mailing list if you want to keep up to date with what’s happening and how different organizations are using project-level information for improving resource governance.

On June 7, Publish What You Pay - United States held the second training workshop on using extractives data and QGIS,an open source GIS mapping application. This time around, we delved deeper into visualizing data, and explored how to calculate a new data point from imported datasets in QGIS. ​

The first training workshop, held on May 19, centered around cleaning and organizing data in preparation for use in data analysis and data visualization software. If you missed that training, we recommend that you watch the first workshop video recording and follow along in the training manual.

The training instructor, Diana Parker, walked participants through the steps of importing data andshapefiles, gave a refresher on the basic visualization options introduced in thefirst training session, and explained more advanced visualization techniques in QGIS.

Diana also demonstrated how QGIS can be used to compare datasets and calculate a new field using pre-existing data. For example, we determined the the Federal revenue received per barrel of oil, by county for a single year. We then visualized that data on a map of the US.

​To determine the estimated revenue per barrel, participants were instructed to import the shapefile depicting US county boundaries as avector layerin QGIS. Next, we imported the Federal oil and gasproductionby county and the Federal oil and gasrevenueby county datasets into QGIS.

For the purposes of this training these datasets have already been sorted and organized from the original USEITI datasets for use in QGIS and made available to training participants here. To learn how to prepare the Federal production and revenue USEITI datasets refer to pages 1-4 of the workshop 2 training manual or watch the full walkthrough covered in the first workshop.

Both of these datasets include FIPS codes, which are unique identifiers for geographic information throughout the US. For data to be placed on a map using software like QGIS, the data will need to have some form of geographical identifier. The FIPS codes in the datasets used in this training relate to state and county locations. To associate the data with the correct location in QGIS, the two datasets were joined to the US county boundaries shapefile and the FIPS field was matched to the GEOID target field in the US county shapefile.
​With production and revenue datasets joined, Diana then instructed participants to duplicate the shapefile layer and to rename the new layer to “USD per BBL (2013).” With this new layer selected participants used the QGIS calculator function to create a new attribute “USDBBL13” using the expression {“Oil_bbl13” / “ProdOil13”}. The expression instructs QGIS to divide the Federal oil revenue by the oil production on Federal land for each county revealing an estimated revenue figure per barrel of oil for each county for the year 2013. The “USDBBL13” attribute can then be selected as the value to be visualized in the new layer as seen below.

The visualization raises some interesting questions. If you look at the legend in the map above you will see that the counties highlighted in red have revenue per barrel of a net negative amount. On the other end of the spectrum we can see the data showing revenue per barrel of up to $362,022. The way the data is presented above is a good example of why you should be checking the data and visualizations as you go, as well as the importance of using the right data classifications. By changing the data classifications for the legend we reveal that the negative and extremely high revenue per barrel values only accounted for three counties included in the dataset, as can be seen in the map below (look for the lowest values in yellow and highest in light blue). Obviously there are still serious questions about revenue per barrel values. However, by adjusting the legend to more appropriately fit the underlying data we get a more useful visualization. For more on how to choose the right data classification, refer to p. 12 of theworkshop 2 training manual.

When working with several datasets and numerous layers of shapefiles, things can get complicated and it is important to keep your work organized. As Diana guided participants through the training she cited best practices to follow while working with QGIS.

Before starting a project, set aside a folder to house all the relevant files, including the csv files, shapefiles, and the QGIS project file.

Save your work often.

When working with multiple datasets in a single map you will often duplicate layers, as Diana demonstrates in this training. It is important to rename duplicated layers to keep things organized.

After datasets are joined to shapefiles, the datasets themselves can be safely deleted from the layers panel window.

As additional layers with visualized data are added they can be dragged up and down in the layers panel to change the stacking order of the visualizations on the map.

The visualization of each layer can be toggled on and off, allowing you to check that the data you want depicted in your final map is not obscured behind other layers.

This training also covered some additional interesting ways to visualize data and produce exportable maps as shown in the screenshot above. The map depicts the revenue for oil & gas production on Federal lands for each county in Montana in one layer. In an additional layer, pie charts present the percentage of total revenue from different resource types or exploratory projects.
​You can watch the full recording of the second workshophereto learn more about visualizing data in QGIS. Follow thislinkto download the training manual and datasets. After viewing the course, please share your feedback by filling out ourpost-course survey. We have additional data skills training workshops planned for the future and your input can help in determining the topics we cover.

------------

Waseem Mardini is the Policy Advisor at PWYP-US, follow him on Twitter @pwypusa