As part of the TRB annual meeting, Xentity Architect, Jim Barrett, presented, as part of a workshop on one of Xentity’s Concepts on shifting where data, information and knowledge is invested in support of shifting needs. This session was titled “Knowledge, Information, and Data (KID) Architecture Implementations”. First, the Concept Presentation explains the general concept, then […]

Veracity is still struggling to support integrated information products – need to learn from challenges to move to knowledge products The common theme is the various geospatial feature classes that are managed are typically managed in silo’s which cause permanent and major veracity issues in projection, granularity, level of detail, provenance, intended use vs. requested […]

View Original Article The National Science Foundation (NSF) has tasked the EarthCube Science Support Office (ESSO) with creating a detailed architecture and implementation plan from the EarthCube architecture blueprint defined at the May 2016 Architectural Framework Workshop and in discussions at the 2016 EarthCube All Hands meeting. NSF has outlined an aggressive schedule, and ESSO is working with a […]

If the answer is “of course”, then why haven’t we done so already?

It is a simple concept, but one without an implementation strategy. Twenty years after the establishment of Circular A-16 and FGDC metadata content standards, we are still looking at metadata from a dataset centric point of view -that is for “what has been” and not for “what will be”. Knowing what is coming and when it is coming enables one to plan.

The model can be shifted to the “what will be” perspective, if we adopt a system’s driven data lifecycle perspective. Which would mean we look at Data Predictability and Crowdsourcing.

It may seem ironic, in the age of crowd sourcing, to argue for predictable data lifecycle releases of pedigreed information and seemingly deny the power of the crowd. But the fact remains, the civilian government entities in the US systematically collect and produce untold volumes of geospatial information (raster, vector, geo-code able attributes) through many systems including earth observation systems, mission programs using human capital, business IT systems, regulatory mandates, funding processes and cooperative agreements between multiple agencies and all levels of government. The governments in the US are enormous geospatial data aggregators but much of this work is accomplished in systems that owners and operators view as special but not “spatial”.

An artificial boundary or perception has been created that geospatial data is different than other types of data and by extension so are the supporting systems.

There remain challenges with data resolution, geometry types and attribution etc., but more importantly there is a management challenge here. All of these data aggregation systems have or could have a predictable data lifecycle accompanied by publishing schedules and processing authority metadata. Subsequently, the crowd and geospatial communities could use its digital muscle to complement these systems resources if that is their desire and all government programs would be informed by having predictable data resources.

What is required is communicating the system’s outputs, owner and timetables.

Once a data baseline is established, the geospatial users and crowd could determine the most valuable content gaps and use their resources more effectively; in essence, creating an expanded and informed community. To date, looking for geospatial information is more akin to an archaeological discovery process than searching for a book at the library.

What to do?

Not to downplay the significance of the geospatial and subject matter experts publishing value added datasets and metadata into clearinghouses and catalogs, but we would stand to gain much more by determining which finite number of systems aggregate and produce the geospatial data and creating a predictable publishing calendar.

In the current environment of limited resources, Xentity seeks to support efforts such as the FGDC, data.gov, and other National Geospatial Data Assets and OMB to help shift the focus on these primary sources of information that enable the community of use and organize the community of supply. This model would include publishing milestones from both past and futures that could be used to evaluate mission and geospatial end user requirements, allow for crowd sourcing to contribute and simplify searching for quality data.

We just finished some work for a large National Government data provider who measures their number of files in the millions, records in the tens to hundreds of millions, and storage in sub-petabyte. Below is the obfuscated general requirements if you were to be looking to deliver your bulk data in the cloud : Storage requirements, access, methods, discovery, communications, and applications.

These requirements have been generalized or completely redacted or some cases, added to, to allow for all in Government Open Data delivery with large public datasets to consider. This is simply the business requirements, and not considering the technologies, vendors, cost models, capacity planning, etc. That was done separately.

For faster and likely larger file requests, User requests a directory, set of directories, set of files or a mix to be put onto storage device by the service provider and the device is delivered back to user.

For faster and likely larger file requests, User requests a directory, set of directories, set of files or a mix to be put onto storage device by the service provider and the device is delivered back to user. Bulk Media minimum specifications for external hard drives

Users who have existing cloud accounts for storage or who have virtual machine processing points on the cloud, will make requests or will pull a directory, set of directories, set of files or a mix Data pushed to the users cloud point

Data Products are usually downloaded via keyword, geospatial or temporal product discovery applications based on filtering their search, creating an order, and downloading the products in small group.

Public file directory listing should be discoverable and optimized for discovery by search engines

Public collections should be discoverable and optimized for discovery by search engines

Explicitly demonstrate how bulk data registrations will be discoverable and registered in both Sciencebase.gov and data.gov

Catalogs should be able to pull or push harvest public FGDC, ISO-19115, or RDF metadata of files in the directories for transaction or bulk loading into their catalog.

File Directory Listing can be queried via open-standard discovery service to assist in developing a download filter list.

The National Map can be discoverable in proposed service provider catalog, but the catalog reference needs to follow the metadata provided along with each file with at minimum presenting source, created date, updated day, title, basic description, and the provided DOI link for the file or directory.

Service Provider should be able to be support being called via a Digital Object Identifier

Users can subscribe to changes to directory, sub-directory, or specific files

Users can be notified of such changes via push notifications via such ways as per change, daily changes, RSS updates, or other notification techniques.

Users can use the notifications as ways to request the bulk file updates

7. Download API – Supporting applications or including applications that help the user download in bulk

Have a download API that can be controlled by api.data.gov which can uniquely identify, provide HTTP access to via GET parameter in a URL query, support an hourly limit of number of requests per hour based on API Key settings. If api.data.gov rate limit is exceeded, an HTTP status code of 503 should be returned.

3rd party applications should be able to support HTTP, REST, FTP, or SCP calls.

The file download should be able to support multiple file requests, allow for parallel downloads, handle restarting partial download file requests, and governor anonymous volume requests

Peer-to-Peer solution support (i.e. such as BitTorrent) must comply with Federal Regulations.

Identify what, availability, and cost for User Training and Sanctioned or third-party consultants for Software Developers is available

8. Applications – Support the end user experience for unzip files and load into geospatial database

If the user will received multiple zipped files that will require the user to click each link to download, unzip each file, and then load each file using the provided metadata manually into a database, can this be automated

Vendor can create premium either accelerator, increased access or additional formats are part of the delivery if branded separately as a vendor branded product and as long as there is one version that is published clearly marked as Authoritative Government as published and controlled by such in its original published form.