Open government reboot focuses on APIs instead of data

The White House hopes for an explosion of commercial application development.

Issued by the Office of Management and Budget, the Digital Government Strategy is the basis of a new White House directive to expose "high-value" Federal data through Web APIs

Have you ever wanted a mobile app that ties your location to crime statistics, government environmental and health data, and weather and solar flare data to calculate the hourly probability of a zombie apocalypse? While that may not be exactly what the White House has in mind, it’s the sort of mobile mash-up that a new Federal IT policy could make a lot less difficult to create. The Obama administration has added another twist on “open government”—open, as in open API.

On May 23, the White House issued a directive that requires all agencies to establish programming interfaces for internal and external developers to use, and make “applicable Government information open and machine-readable by default.” As part of an effort to push government toward a cloud-computing future, the White House is encouraging agencies to make their data more developer friendly, and to create a shared platform for providing mobile access to data for both citizens and government employees. And they have 12 months to start delivering.

The goal of the new policy, called the Digital Government Strategy, is to jump start the government’s three-year old open data initiative, draw more private developer interest, and encourage the development of mobile applications that connect citizens and government employees more effectively with data that has previously been public, but nearly inaccessible.

Federal CIO Steven VanRoekel hopes the move will spawn an explosion of commercial application development. “Treating the government as an open platform in this way encourages innovation,” he wrote in a White House blog post. “Just look at how the government’s release of GPS and weather data fueled billion dollar industries. It also makes government more efficient and able to adapt to inevitable changes in technology.”

This isn’t the first attempt by the Obama administration to create an app ecosystem around government data. In many ways, the new initiative is an attempt to correct the failings of the government’s first “open data” effort, Data.gov. Launched in 2009, Data.gov was conceived as a clearinghouse for government data sets published in open formats.

Then-Federal CIO Vivek Kundra hoped Data.gov would seed thousands of commercial and citizen apps, in addition to shining some much-needed light on the inner workings of government. In an interview with Fortune Magazine’s Geoff Colvin in July 2011, Kundra pointed to some of the early successes of Data.gov—such as a mobile app that tells parents whether the crib they’re about to buy has been recalled, Microsoft Bing’s use of Medicare/Medicaid data to rate hospitals, and a web app that combines FAA statistics with traveller “tweets” to help airline customers make decisions about which airline and flight to pick and when to leave for the airport.

Enlarge/ Flyontime,us, a web app that uses government data and Twitter feeds, is an example of what the Obama Administration would like to see more of.

But despite some modest successes, Data.gov has had some significant problems. Most of the first wave of data posted to Data.gov was in “open” formats, but ones that required data to be downloaded in order to be processed and used—such as comma-separated value (CSV) format. And because it was bulk-exported, there were problems with the data quality in many data sets—including the issue that much of it was stale before it was even posted.

The new presidential directive aims to change all this. First, it aims to decouple information from applications. “Rather than thinking primarily about the final presentation—publishing web pages, mobile applications, or brochures,” VanRoekel wrote in the Digital Government Strategy document, government agencies need to take an “information-centric” approach, “ensuring our data and content are accurate, available and secure. We need to treat all content as data, turning any unstructured content into structured data, then ensure all structured data are associated with valid metadata.” The data would then be accessed by all applications through a common set of Web APIs.

And at the center of the strategy is a transformation of Data.gov itself from a publishing site to a developer resource. In a White House blog post, Federal Chief Information Officer Steven VanRoekel wrote, “To make sure there’s no wrong door for accessing government data, we will transform Data.gov into a data and API catalog that in real time pulls directly from agency websites.”

Just what the nature of those APIs will be has yet to be determined, though the approach outlined in the Digital Government Strategy favors XML data formats. The Office of Management and Budget will issue a government-wide policy on web APIs, open data, and content formats within the next six months, after which agencies will have six months to “ensure all new IT systems follow the open data, content, and web API policy,” set up developer pages with API information, and make data from at least two existing “customer-facing” systems available through those APIs.

Many others have more thoughtfully described the problems with providing government data as APIs and I'll provide links. Basically, the first step should always be providing data as downloads as easy to use text files like CSV. APIs are much harder to design and support. As a result of being underfunded in the government space they typically perform too poorly to be relied on my 3rd parties. Read more here:

Government agencies shouldn't be spinning their wheels thinking about XML vs JSON, RESTful vs SOA. They should be concerned with providing accurate, timely, and extensive information in simple plain text formats. Once this is achieved, it _may_ make sense to start thinking about APIs for a subset of those datasets.

A lot of federal agencies already have developer sites and/or APIs. One even provides SDKs for developers to use if they wish. Basically this new strategy is going to have the effect of ensuring the other agencies get on board.

That's my big concern about the APIs: that there won't be enough horsepower behind them to serve up data reliably to apps, and that developers will end up using them just to get snapshots of data and put them in their own infrastructure to get the required performance. Of course, if they're hosted in a government cloud, there's the chance they could at least scale up the back end to meet demand.

Many others have more thoughtfully described the problems with providing government data as APIs and I'll provide links. Basically, the first step should always be providing data as downloads as easy to use text files like CSV. APIs are much harder to design and support. As a result of being underfunded in the government space they typically perform too poorly to be relied on my 3rd parties. Read more here:

Government agencies shouldn't be spinning their wheels thinking about XML vs JSON, RESTful vs SOA. They should be concerned with providing accurate, timely, and extensive information in simple plain text formats. Once this is achieved, it _may_ make sense to start thinking about APIs for a subset of those datasets.

This makes me worry a bit about security. Opening something up via API seems to leave more wiggle room for nefarious deeds.

The full strategy, which I didn't dive into here, includes separating security from the API by using metadata. That would allow for security attributes to be attached to data, and control access based on that metadata.

Many others have more thoughtfully described the problems with providing government data as APIs and I'll provide links. Basically, the first step should always be providing data as downloads as easy to use text files like CSV. APIs are much harder to design and support. As a result of being underfunded in the government space they typically perform too poorly to be relied on my 3rd parties. Read more here:

Government agencies shouldn't be spinning their wheels thinking about XML vs JSON, RESTful vs SOA. They should be concerned with providing accurate, timely, and extensive information in simple plain text formats. Once this is achieved, it _may_ make sense to start thinking about APIs for a subset of those datasets.

I disagree, it might be slightly easier to support but it's not difficult to support a proper api.

I think data.gov was a great version 1 but to support a truly open datasite developers are going to need proper tools and downloads for flat files isn't going to really cut it going forwards so lets not half ass version 2 for a minimal amount of extra effort.

Many others have more thoughtfully described the problems with providing government data as APIs and I'll provide links. Basically, the first step should always be providing data as downloads as easy to use text files like CSV. APIs are much harder to design and support. As a result of being underfunded in the government space they typically perform too poorly to be relied on my 3rd parties. Read more here:

Government agencies shouldn't be spinning their wheels thinking about XML vs JSON, RESTful vs SOA. They should be concerned with providing accurate, timely, and extensive information in simple plain text formats. Once this is achieved, it _may_ make sense to start thinking about APIs for a subset of those datasets.

As amazing as I am, I didn't write this article. That quote? Not mine.

However, I disagree completely with your opinion on formats. the government should in fact be spinning their wheels attempting to deliver that timely data in standardized, structured, machine readable formats that are actually in use by today's technology. Maybe the bloated anachronistic machine needs to get with the times and leave those flat file export formats behind. I'm not a fan of the current state of stagnation.

As someone who has worked on implementing a web service and it's design (IHE.net) Ic an tell you committees (of which the government is the worst) love SOAP. But this is horrible for publishing data.

They need a layered approach as suggested by Bluewater. First get the data out there, then once it's in a format everyone can read, maybe move to a REST interface for "current data" and if anyone is interested in SOAP services (with defined business logic transactions) then have companies implement the SOAP on top if the REST, but don't put the government on the hook for it. When you get into SOAP, you are getting into the realm of business logic, and we don't want the government having to support every possible line of business that the data is used in.

Step 1. Publish all data, including historical data. Step 2: Use REST interface for determining updates to entities within the data.Step 3: Build your own web services over the REST interfaces, or your own database maintained from steps 1 & 2.

Basically, the first step should always be providing data as downloads as easy to use text files like CSV. APIs are much harder to design and support.

Is a programmatic interface to data not an API by definition? Yes, designing good APIs is hard, but if the alternative is informally-specified, bug-ridden APIs, they're only easier to support when they're so bad that nobody uses them. As you clearly aren't saying that the government should use crappy API design to cut costs by discouraging use, I'm curious what you have in mind.

I'm not trying to pick nits here, but "CSV" is a great example of what not to do. Not because it lacks fancy semantic tagging, but because it lacks a widely-accepted specification. I've worked on lots of systems that have "standardized" data interchange via "CSV", and there are always problems, even when I've been able to get the various parties to sign off on detailed, unambiguous specifications of the format to be used.

I'll stop here, but suffice it to say that such a wide variety of CSV "flavors" that arbitrary files can't even be parsed reliably without special per-flavor configuration isn't even the worst part. Because CSV is ostensibly so "easy", it's frequently generated by half-assed software written by programmers who are surprised to learn that data contains significant whitespace and commas. I've honestly been told that "our code isn't buggy, it's just that most peoples' names don't contain commas or spaces."

I'm not trying to pick nits here, but "CSV" is a great example of what not to do. Not because it lacks fancy semantic tagging, but because it lacks a widely-accepted specification. I've worked on lots of systems that have "standardized" data interchange via "CSV", and there are always problems, even when I've been able to get the various parties to sign off on detailed, unambiguous specifications of the format to be used.

I'll stop here, but suffice it to say that such a wide variety of CSV "flavors" that arbitrary files can't even be parsed reliably without special per-flavor configuration isn't even the worst part. Because CSV is ostensibly so "easy", it's frequently generated by half-assed software written by programmers who are surprised to learn that data contains significant whitespace and commas. I've honestly been told that "our code isn't buggy, it's just that most peoples' names don't contain commas or spaces."

Qutoed CSV is both reliable and well-understood. I can't help it if a programmer doesn't know how to do it right.

"a 1","b2", 55.678 is just fine.

I do not want the government spending any more time on developing data formats for the hundreds fo thousands of relations that it has to publish than it has to on column names.

Admittedly, CSV does lose some utility. A generic XML specification would bring the data into the modern era, as it would allow querying by XSLT, Query and XPath facilities. Nothing other than the bare minimum though. <xml...><table id="artists"><tr id="header"><td name="internalid">id</td><td name="artist name">name</td><td name="company">record company</td></tr><tr id="0"><td name="internalid">4534</td><td name="artist name">Nine Inch Nails</td><td name="company">Nothing Studios Inc.</td></tr></table>This way the government does not have to publish XML specs which would look like:<artist><name>Nine Inch Nails</name><company>Nothing Studios Inc</company>

One could write a sing routine to export any table, and though the data may not be the most efficient format, you are just a XSLT away from having it in the format most efficient for you.

I'm not trying to pick nits here, but "CSV" is a great example of what not to do. Not because it lacks fancy semantic tagging, but because it lacks a widely-accepted specification. I've worked on lots of systems that have "standardized" data interchange via "CSV", and there are always problems, even when I've been able to get the various parties to sign off on detailed, unambiguous specifications of the format to be used.

I'll stop here, but suffice it to say that such a wide variety of CSV "flavors" that arbitrary files can't even be parsed reliably without special per-flavor configuration isn't even the worst part. Because CSV is ostensibly so "easy", it's frequently generated by half-assed software written by programmers who are surprised to learn that data contains significant whitespace and commas. I've honestly been told that "our code isn't buggy, it's just that most peoples' names don't contain commas or spaces."

Qutoed CSV is both reliable and well-understood. I can't help it if a programmer doesn't know how to do it right.

"a 1","b2", 55.678 is just fine.

I do not want the government spending any more time on developing data formats for the hundreds fo thousands of relations that it has to publish than it has to on column names.

Admittedly, CSV does lose some utility. A generic XML specification would bring the data into the modern era, as it would allow querying by XSLT, Query and XPath facilities. Nothing other than the bare minimum though. <xml...><table id="artists"><tr id="header"><td name="internalid">id</td><td name="artist name">name</td><td name="company">record company</td></tr><tr id="0"><td name="internalid">4534</td><td name="artist name">Nine Inch Nails</td><td name="company">Nothing Studios Inc.</td></tr></table>This way the government does not have to publish XML specs which would look like:<artist><name>Nine Inch Nails</name><company>Nothing Studios Inc</company>

One could write a sing routine to export any table, and though the data may not be the most efficient format, you are just a XSLT away from having it in the format most efficient for you.

once the data storage schema is in place, such as a massive relational database, exporting the data upon request in multiple formats should be relatively trivial.

However, CSV is terrible. Any format that supports the direct translation of relational table data could be used, its true. Including CSV. But why adopt old and broken formats when we clearly have the opportunity to move forward? Vendors can't even agree how to implement comments in CSV.

I'm also not a huge fan of XML when it requires transformation/translation or custom DTDs or name spaces.. .ugh. Tables have fields. Those fields have names. Those fields hold data. Sending anything back heavier than JSON or very slender XML is overkill and will just further complicate the issue.

I'm also not a huge fan of XML when it requires transformation/translation or custom DTDs or name spaces.. .ugh. Tables have fields. Those fields have names. Those fields hold data. Sending anything back heavier than JSON or very slender XML is overkill and will just further complicate the issue.

Very much in agreement here. I'd rather get JSON back than CSV. Light XML would be better than CSV too... but XML seems to have some sort of siren song that attracts people making entire careers thinking about schemas, DTDs, ontologies, and other nonsense that rarely yields practical value.

I laugh every time I hear the words "government" and "cloud" mentioned together. The government is not moving towards cloud computing. They say they are, but there is a huge misinterpretation govt-side of what cloud computing actually has (speaking from experience as a govt contractor here). They think they want cloud computing but they really don't. They want to keep all their data in their own hands. In fact, they have very specific rules about storing data externally (hint: it's a big no-no).

Having said that, I think the move to try and create open APIs for accessing data is a great move. At the very least it will create a bunch of jobs as the various agencies contract out for this work. But, in reality, it has nothing to do with cloud computing.

As a veteran of Enterprise Integration and Semantic Web I will venture to define 3 laws of Open Data:

1. Data model and a shared data space

Data must be described with a model (schema) that allows to keep data highly interconnected and richly decorated with metadata. Any piece of data and metadata must have a Web URL. It can not be underestimated how much automation can be created if data are richly modeled. This includes automatic user interface generation so that the same data can be browsed by humans (see law #3). Common data model and Wiki-style API will allow data to be cleansed, updated and enriched, and links between data to grow with the participation of government, commercial entities, community and apps. Shared data space will allow apps to get created faster, will allow apps to collaborate to build something bigger, in essence becoming an accelerator for useful apps with lasting effects, which government was hoping for with the Open Data initiative.

2. Universal simple bi-directional API for all types of data

This is based on CRUD principle (create, read, update, delete) known in the database world, but on the Web it must also include 2 more operators to load and upload multimedia data (all 6 operators can easily use regular HTTP 1.1 methods, as commonly done by REST-based APIs). With just read access the data will go stale no matter how great the government agencies' real-time data feed is built. Not to mention that the new nodes and connections between data will never grow. To maintain the quality of data which is open for writing will require full history and attribution of changes, ability to roll back the changes, ability to inform (email, news feed) interested parties on changes. This requires data ownership management, identity management, reputation management, and throttling and limits based on the above.No need to quarrel on data formats. As stated by @longhairedboy "once the data storage schema is in place, such as a massive relational database, exporting the data upon request in multiple formats should be relatively trivial", be it JSON, XML, CSV, RDF, or anything else. This same API can also be used to export full datasets or subsets, so it will include and extend the former data.gov features.

3. Equal opportunity for humans and machines

Early Web has focused on humans and the machines were deprived. Now we see humans being underprivileged as tons of Web APIs appear that do now allow normal people to interact with data and require apps to be built for that. What instead needs to happen is for the data to become intuitively viewable and editable by common citizen, provide the ability for further exploration by geeks, yet be easily accessible by machines. In other words data themselves without App toppings will form a citizen dashboard, yet the same items of data will be accessible to Apps via API to add more flavors, interactivity and donate back into the shared data space according to the Laws #1 and #2.

As a professional in the Information / Content Management field I applaud the Rensselaer Polytechnic Institute team for the work they've done on this initiate and fully support this line of thinking around content management. That being said, I completely disagree with the "why" we are doing this, i.e. the "3 strategies" listed on the Digital Gov't webpage, ( http://www.whitehouse.gov/sites/default ... nment.html ).

* Those are not strategies, they're tactics. They tell me what we are doing, not why or what problem(s) are we solving by providing the content in this way #1 contains to strategic objectives, Cost savings & Communication, but that's not easily understood. BONUS - Not only are those true strategic objective, but we can quantify the program pretty easily. #2 Again two objectives - Privacy & Cost Savings. #3 - I don't know what the hell they are trying to accomplish with this one. It's useless as it written, I can't even extract a goal on this one.

My point to rant, is the government dropped the ball on this one and missed a very big opportunity with "WHY" should they should go forward with this initiative. Besides the ones I extracted above, how about to provide free, quality, unique data to people & companies to leverage for "better decision-making". Just a thought.

I think a really major issue with these initiatives is the interaction of the standards organizations and government.

One of the problems with these initiatives is that the data standards organizations like HL7 in healthcare have trouble scaling up to the manage the diversity and complexity of data that exists. It's something I covered in a blog I wrote recently. The comments on the blog are quite indicative of the mindset that exists within the standards organization which make them part of the issue with opening up data:

Sean Gallagher / Sean is Ars Technica's IT Editor. A former Navy officer, systems administrator, and network systems integrator with 20 years of IT journalism experience, he lives and works in Baltimore, Maryland.