Data We Wrangle

Catalyst Cooperative is organizing public data to help shape policies and tell stories about the ongoing transformation of the U.S. energy sector.

We combine information from the US Energy Information Administration (EIA), the Federal Energy Regulatory Commission (FERC), and the Environmental Protection Agency (EPA) in a single database. Together, this data enables us to shed light on some of the key questions about how our power sources are being transformed.

A few examples of the kinds of questions our database allows us to explore:

Which coal fired power plants have a marginal cost of electricity (MCOE) that makes them unable to compete with the low and falling price of new wind power?

To what extent has electricity demand flattened over the last decade, rendering bulk generating capacity additions unnecessary?

Which existing power plants have operational characteristics that will allow them to play a useful role in integrating a high fraction of forecastable renewable generation?

What coal mines and mining companies will be directly impacted by scheduled coal plant retirements, and the replacement of additional uneconomic coal plants with renewable generation or new natural gas capacity?

How do various future fuel price scenarios affect the MCOE for different utilities?

Has the natural gas build-out been warranted? Which utilities potentially have underutilized dispatchable gas capacity that could be used to support new renewable generation?

We believe that public data should be be usable by the public, so we are developing our database under a liberal open source license. For more information check out our repository on GitHub. Below is an overview of the data sources we have integrated, or are working on integrating, and what types of information they provide. While all of this data is already available to the public, it is often not well standardized or interconnected.

The data from Form 860 is the backbone of much of our database, as it provides the most extensive and well structured plant, utility, and generator level information. This includes data on electric utilities and non-utility generation at the plant and generator level. It also contains plant in-service dates, information about prime movers, generating capacity, energy sources, proposed generators, county and state location, ownership, and FERC qualifying facility status. In recent years the EIA 860 has also begun providing detailed information about renewable energy facilities.

Form 923 focuses primarily on fuel-based thermal plants, and contains data on electricity generation, fuel consumption, useful thermal output, fossil fuel stocks, fuel deliveries, quantity delivered, supplier, coal mine type, fuel heat content, sulfur, and ash content, and receipts at the power plant and prime mover level. This data allows us to see where a plant or utility’s coal is coming from, how costs have changed over time, and to identify which generating units are responsible for a utility’s power output.

How has the length of time remaining on coal (black) and natural gas (blue) contracts changed over the last eight years? EIA 923 data can help us quickly understand. Click to enlarge.

The EIA publishes Form 923 data as a Microsoft Excel spreadsheet, updated monthly. The organization of the data reported changed in 2009, and our work so far has primarily focused on current plant operations and expenses, so our database currently only integrates information going to back to 2009. The plant and operator IDs reported in Form 923 allow its data to be easily linked back to more extensive facility information reported in Form 860, as well as, in some cases, to data collected by the Mining Safety and Health Administration (MSHA) and Environmental Protection Agency (EPA).

Environmental Protection Agency

Much of the electric utility data reported to the Environmental Protection Agency (EPA) is related to pollution, which has recently come to include annual greenhouse gas (GHG) emissions under the Greenhouse Gas Reporting Program. However, far more detailed information is available from the EPA’s Air Market Program Data, which contains hourly CO2 emissions as well as traditional air pollutants, collected under the EPA’s Continuous Emissions Monitoring System (CEMS).

The CEMS dataset also includes hourly generation loads and fuel heat content consumed, which gives the most detailed publicly available view of power plant operations we are aware of. This data allows us to place quantitative constraints on power plant operational characteristics that are often considered proprietary, and thus not made directly available in public datasets or state-level regulatory proceedings.

The EPA publishes CEMS data as comma separated value (CSV) files, separated by month and by state. The files includes several other facility IDs (including the EIA plant and unit IDs) allowing for relatively straightforward integration into the database.

Federal Energy Regulatory Commission

Fewer electricity producers are required to report data to the Federal Energy Regulatory Commission (FERC) than to EIA, but those who do, provide information via FERC Form 1 about their non-fuel production and non-production costs on a plant-by-plant basis. This data is key for estimating the marginal cost of electricity (MCOE) generated by a facility, allowing for economic comparisons to other generation and demand side management options. To our knowledge this is the only public reporting of non-fuel operating and maintenance expenses.

FERC and EIA unfortunately use different IDs and different groupings of plant infrastructure, but we are able to link the datasets together by correlating several plant-level variables reported to both entities. These include net electricity generation, total cost by fuel, and total heat content by fuel. We also identify individual utilities and their infrastructure by identifying similar collections of plants by name, type, plant capacity, and other reported characteristics.

FERC also collects detailed annual accounts of utility plant in service. While this data is aggregated on a utility-wide basis, rather than at the plant level, it can still provide some insight into how a utility’s capital additions and retirements have affected its overall financial picture.

Data of Future Interest:

The above agencies collect far more information than we have enumerated or thus far attempted to integrate into our database. Several that may be of interest for future integration include:

Additional FERC Form 1

We’ve only scratched the surface of the FERC Form 1 data. Beyond operating and maintenance expenses, Form 1 offers a wealth of financial information ranging from the utility’s balance sheet down to the distribution of its salaries and wages. Operationally, Form 1 provides some data on power purchases and sales, sales of ancillary grid services, and detailed transmission statistics, including peak transmission system load and wheeling.

Form 861 mainly deals with what happens to electricity after it’s generated, and other aspects of the distribution level electricity system. This includes electricity sales, revenues, and customer counts, peak load, electric purchases, and energy efficiency and demand-side management programs, green pricing and net metering programs, and distributed generation capacity. We haven’t had the occasion to integrate this data yet, but it would be useful for analyzing the impact of generation changes on consumer rates, and exploring the role of distributed generation and demand side management, especially in more competitive electricity markets.

Form 714 contains balancing authority level generation, power purchase, transmission, and load statistics as well as data on hourly incremental energy pricing in the balancing area. This data provides a broad overview of operations not only within balancing areas but also between them, including, for example, actual and scheduled inter-balancing authority area power transfers. Planning area data includes summer and winter demand forecasts and actual hourly demand values for each planning area.

While the EPA CEMS data has far more detail about the timing and sources of CO2 emissions, this GHGRP dataset is potentially interesting because it includes non-CO2 GHGs. However, in order to make the best use of it, we would need to find or construct an accurate mapping between EPA Facility IDs and EIA Plant IDs.