Navigation

Share

Data.gov survey: dataset modification dates compared

This demonstration shows when data.gov datasets were last updated, as determined by their HTTP responses. How up-to-date are the datasets listed in data.gov? We start with a time line histogram for a moment in time.

On the afternoons of __ and __, we crawled data.gov's links to datasets and recorded the modification dates reported by the source agency's HTTP servers. Of __ data.gov datasets, __ datasets returned modification dates on __ and __ returned modification dates on __. The list of data.gov datasets came from the 13 Sept 2010 version of their Dataset 92.

change blue crawl:

change red crawl:

Number of Datasets Modified since __

Process used

The following diagram shows the process used to obtain the last-modified dates of the data.gov datasets.

3 - The data.gov details page is requested and scraped to obtain references to data sources. Redirects are followed and HTTP HEADs are recorded in RDF using Turtle syntax and terminology from the Proof Markup Language ontology.

6 - This Javascript is included within THIS PAGE to query the LOGD SPARQL endpoint for dataset modification dates (using this query and these nightly-cached results). It calculates the histogram and uses Google Visualization's Annotated Timeline to render the results. A few notes are hard-coded to provide a frame of reference.

Use of provenance

The following diagram illustrates how the Proof Markup Language was used to record the HTTP headers obtained from the government agency servers. A portion of this data graph is selected by the SPARQL query listed above (the headers are excluded). The red links show the HTTP redirects and HTML anchor hrefs from the data.gov details page to the actual dataset URL on the hosting agency's domain. The green boxes represent URIs for the information obtained (HTTP headers), while the purple boxes associate the information received with its source and when it was received. For more discussion on how provenance is used in LOGD and csv2rdf4lod, see
A look at how csv2rdf4lod incorporates provenance into its tabular conversions.

The sharp peak at the ends shows that 200-350 datasets are updated every couple of days.The large jump in the middle of August 2010 is not a temporal lag b/c it remains in both crawls that are 3 weeks apart.

To the extent possible under law,
Tetherless World Constellation
has waived all copyright and related or neighboring rights to
TWC LOGD. TWC LOGD is an educational project on open government data using Semantic Web technologies.
Datasets hosted on this site are converted from a number of data sources such as data.gov. All data created by us are open for reuse, and usage of data created and managed by other sources should follow their own licenses.

The data contained on this site is automatically repopulated from US government or other open data sites, and any personal data in our linked-data versions is coming from those sources. If your information is removed from the government sources, it will be automatically removed from ours on the next update.