Keeping data sources confidential

'The law is pretty general, and the guidance has specific requirements for the agencies,' Energy's Jay Casselberry says.

Rick Steele

OMB guide will help agencies that collect statistical information

When it comes to statistical data, the Energy Department is between a rock and a hard place. On one hand, the department must offer definitive data on energy use trends nationwide. How much oil is being extracted? How much is in reserve? How much is being processed in refineries? What price is it being sold for?

But at the same time, it must keep the data confidential. Large energy companies are wary of competitors getting hold of their information, either through legal challenges or by teasing apart aggregated data to find a company's figures. If Energy can safeguard the information, the Exxons and Shells of the world will be more forthcoming with details.

For agencies that gather statistical information, protecting participants' identities has always been a concern. Soon the Office of Management and Budget will issue a guide to help agencies make sure they are securing their statistical data against privacy invasions, as required by a a provision of the E-Government Act of 2002.

On the technology side, agencies have worked to perfect techniques to obscure the origin of data. On the legal side, Title V of the E-Gov Act, the Confidential Information Protection and Statistical Efficiency Act, set the rules for protecting the sources of that data.

Now, the administration's CIPSEA guidance will make sure agencies gathering statistical data comply with the privacy regulations set forth by Congress.

OMB expects to issue the CIPSEA guide by year's end. It will standardize the language federal agencies should use when collecting data for statistical purposes.

'The law is pretty general, and the guidance has specific requirements for the agencies,' said Jay Casselberry, an Energy statistician and agency clearance officer in the Statistics and Methods Group of the Energy Information Administration.

Statistics vs. administration

The government collects data for two fundamentally different purposes, said Katherine Wallman, OMB's chief statistician and overseer of the CIPSEA guide. Most data the government collects is for administrative purposes. Think tax forms and passports.

Other agencies collect data strictly for statistical purposes. Agency workers or contractors survey people and businesses to get an idea of the trends and activities in the country. These surveys sometimes probe sensitive personal and corporate information.

More than 70 federal agencies conduct statistical surveys, including 11 agencies with statistical work as their primary function, Wallman said.

One thing these agencies all promise their survey participants is that the data will remain private and be used for statistical purposes only.

But ensuring this privacy can be trickier than it might seem. Alvan Zarate, confidentiality officer for the National Center for Health Statistics, said his center goes to unusual lengths to ensure that information it holds about anyone surveyed cannot be sorted from its data sets.

The agency provides public statistics on death rates and on such conditions as obesity among U.S. residents. It polls people, then aggregates the data and draws summaries.

To protect participants' privacy, the agency only releases the summaries in final form, never the raw data. The agency generalizes the data sets, not making them specific to any one geographic area.

'There are all kinds of other databases out there. They usually cover limited areas, so we make sure our data doesn't cover those same areas,' Zarate said.

For instance, NCHS never releases health data broken down by ZIP code, birth date or occupation. In a relatively sparsely populated region, it might be possible to take a voter registration list or other source of names and match someone to a specific ailment if outbreaks were broken down by ZIP code.

Casselberry's Energy office is also careful about how it releases its data.

'It is business information, and [participants] have an interest in making sure it is protected and is not available to their competitors,' Casselberry said.

Energy collects most of the information through surveys, as frequently as every week or as infrequently as every four years. Surveys are sent through the mail, posted electronically on the Internet and even completed in person.

'A lot of federal agencies are trying to balance the confidentiality of the data against disseminating information derived from the data,' said Alan Karr, a researcher at the National Institute of Statistical Sciences of Research Triangle Park, N.C.

To better help agencies like Energy, NSF funded a team led by Karr's institute to develop new techniques for increasing data disclosure while maintaining user confidentiality. The team developed a data toolkit that swaps at- tributes of individual entries within data sets without affecting the integrity of the data as a whole, Karr said at the National Science Foundation's recent National Conference on Digital Government Research in Seattle.

Shared-data concerns

By exchanging attributes, characteristics of individual participants can be obscured.

Because agencies hire statisticians and social scientists to conduct surveys, they have long been aware of potential breaches of privacy and, as a result, have taken pains to avoid even the possibility of someone pinpointing personal data's origin within a batch of records.

But one gray area agencies have worried about is the use of data if it's shared with other agencies. Before CIPSEA, there was a legal question of whether one agency could use data about someone that another agency collected. Agencies also feared that Freedom of Information Act requests might force the release of data they had promised to keep anonymous.

CIPSEA established a clear set of defenses agencies must use to protect sources and even spelled out punishments, should they disclose collected information.

Agency workers who disclose confidential information face up to five years imprisonment and fines of $250,000. Previously, infractions could only be punished through the Privacy Act of 1974, which stipulated a $5,000 fine and misdemeanor conviction.

CIPSEA also specifies that data collected under the law cannot be released through a FOIA request.

Plus, no information collected under CIPSEA can be shared with other agencies unless permission is obtained from survey participants. This condition provides statistical agencies with legal safeguards from law enforcement agencies and others that may be hungry for personal data. Previously, statistical agencies had been vulnerable to court challenges compelling them to divulge information.

'It is very reassuring to know we have the law on our side,' Zarate said.

Many see OMB's pending CIPSEA guidance as the last piece of the statistical-data protection puzzle. To create the guide, OMB consulted experts from the principal statistical agencies about how best to train staff and draft confidentiality agreements.

'The law is great in setting one bar for the entire government,' Casselberry said. 'I think the OMB guidance will do the job of trying to get the agencies to [apply CIPSEA] consistently.'

About the Author

Joab Jackson is the senior technology editor for Government Computer News.