Combining tables using SQL

Enigma Public includes tens of thousands of datasets, each of which may contain useful information, but often you need to combine datasets to gain insights that aren’t otherwise apparent. For example, the United States Senate issues annual reports of lobbying activities, but these are compiled in a way that makes it hard to see which lobbyists and clients tried to influence specific issues, and which client spent the most money. In this article, we’ll show you how to combine two datasets to answer those questions.

About the datasets

All the Senate lobbying activity reports are available on Enigma Public in the U.S. Senate - Lobbying Reports collection:

The Primary Report lists registrant and client transactions, but it doesn’t provide information about the specific issues lobbyists worked on for their clients.

The Issues dataset provides information about the issues worked on by lobbyists, but doesn’t provide information about the lobbyists or their clients, or how much was spent.

Fortunately, the two tables use a common “Filing ID” (shown in the “ID” column). By downloading and joining the two tables using a SQL query, you can find out what issues the lobbyists and their clients worked on.

There are many ways you can do this kind of a join. In this article, we’ll show you how to do this using a free app called DB Browser for SQLite, which is available for PCs and Macs.

Importing the files into DB Browser

If you don’t already have DB Browser for SQLite, download it from http://sqlitebrowser.org/ and install it on your computer. Then follow the steps below to import the datasets.

In DB Browser, click File > New Database.

Enter a file name and click Save. Then click Cancel to close the “Edit table definition” window.

In DB Browser, click File > Import > Table from CSV file.

Select the first CSV file and click Open.

In the Table name field, change the table name to primary2017.

Make sure the Column names in first line option is selected.

Click OK.

Repeat for the second CSV file, setting the table name to issues2017.
After importing both files, you’ll see the tables listed in the Database Structure view and the DB Schema view.

Joining the two tables

The SQL command to combine rows from two tables is the JOIN command (specifically an “inner join,” which is the default join type). In this case, you want to combine rows with the same ID. The query to do this is:

This query says to include all columns from the issues2017 table, plus the registrant name, client name, and amount from the primary2017 table.

To perform the join:

In DB Browser for SQLite, click the Execute SQL tab.

Copy the SQL query above and paste it into the command query window.

Click the Execute
button. You’ll see the query results showing all of the columns you specified for each matching ID.

Locating specific information

In the screen above, there’s a note below the table indicating that the query returned 35,045 rows. This means there are 35,045 rows in the issues2017 table with an ID that matches a row in the primary2017 table. Next, we’ll narrow this down to answer the question, which lobbyists and clients tried to influence the Senate on health issues relating to prescription drugs?

In DB Browser, add an AND statement to the query as shown below. This narrows the results to include only rows where the code field is HEALTH ISSUES (this is one of the standard issue categories).

To sort the rows by client name, add an ORDER BY statement to the query.

SELECT issues2017.*, primary2017.registrant_name, primary2017.client_name, primary2017.amount
FROM issues2017 JOIN primary2017
ON issues2017.id = primary2017.id
AND issues2017.code="HEALTH ISSUES"
AND issues2017.specific_issue LIKE "%prescription%"
AND issues2017.specific_issue LIKE "%drugs%"
ORDER BY primary2017.client_name

Click the Execute
button. This time the query returns 61 matching rows showing which clients and lobbyists worked on issues relating to prescription drugs.

Aggregating amounts

The second question we wanted to answer was, of the companies lobbying on health issues relating to prescription drugs, which spent the most money? Since each “filing ID” in the primary report may map to multiple issues in the issues report, we can’t know how much money was targeted on a specific issue type, but we can sort the results by total amount spent. To do this, you’ll need to perform some calculations on the data – specifically to aggregate the amount spent by each client company.

When you import data from a CSV file, all columns are defined as text fields. Before you can perform calculations, you’ll need to define the amount column as as a numeric field. To do this:

In DB Browser, click the Database Structure tab.

Select the primary2017 table and click Modify Table.

Set the Type field for amount to NUMERIC, as shown above.

Click OK.

With the amount column specified as a numeric field, you can now perform the calculations:

Click the Execute SQL tab and paste in the query below.

SELECT registrant_name, client_name, SUM(amount) AS total_amount
FROM issues2017 JOIN primary2017 ON issues2017.id = primary2017.id
AND issues2017.code="HEALTH ISSUES"
AND issues2017.specific_issue LIKE "%prescription%"
AND issues2017.specific_issue LIKE "%drugs%"
GROUP BY client_name
ORDER BY total_amount DESC

Click the Execute
button.

In this modified version of the query, the SUM command works with the GROUP BY command to compute the total amount for each client. It then orders the results using the computed total amount and displays the results in descending (DESC) order.

When reporting lobbying activities, companies can upload amendments to originally reported amounts. Each upload is given a separate filing ID and appears as a separate row in the dataset. This means a simple analysis like the one here may count the same transaction multiple times. As with all data analysis, it is important to understand the dataset you are using and account for possible anomalies.

Summary

Sometimes you can’t find the information you need from a single dataset. In this example, you combined two datasets containing a common key using a SQL query. You were then able to quickly narrow down over 35,000 rows to just 61 rows with the information you were seeking.

Enigma, the Enigma logo, Assembly, Concourse and Enigma Public are trademarks of Enigma Technologies, Inc. and may not be used without the express prior written permission of Enigma Technologies, Inc. This documentation is provided for guidance and informational purposes only and does not modify, add to, or replace any provisions of any agreement you may have with Enigma Technologies, Inc.