Tuesday, November 20, 2007

Harvesting Data for ConservationPart 2: The Solution

In 2002, The Heinz Center released its report on ecosystem conditions in the United Stated, “The State of the Nation’s Ecosystems.” The authors declared the assessment as incomplete due to the lack of data collection, reporting and systems infrastructure to sufficiently assess ecosystem condition. Yet the reality is that between the efforts of academic researchers, local and national governments and conservation organizations, an enormous amount of information deeply relevant to conservation is being collected, even in digital form. However, variation in syntax (e.g., file format) and semantics (e.g., termonology) prevent practical aggregation and analysis of collected data.

In my last post, I described this fundamental problem in conservation information systems, that of data model variability. I described how the variation in the schema of common conservation information systems entities such as observations, protected areas, conservation projects, conservation activities frustrates our ability to provide rich data entry/management applications as well as aggregation and analysis capabilities, capabilities that would significantly inform assessments like "The State of the Nations Ecosystems."

The solution is to develop a system that supports rich data entry/management/reporting and yet is independent of a specific data model. The system would treat the definition of entities such as observations or protected areas as data itself. Each of these entities has a core schema, the set of attributes that makes it what it is. A species observation, for instance, consists of an observer (person), location, date and time, and a species identifier. This core can then be augmented with observation attributes from a library (e.g. egg count, nest height, stratum, life stage). When needed, more advanced users can build their own attributes (solarization, acidity) and submit them to the common attribute library. Finally, the entity core schema and a selected set of extended attributes can be combined into an extended entity schema. Extended entity schema are useful for repeated use potentially reflecting and enforcing a standard or protocol. The library, shared across the natural resources management community, can thus contain core and extended entity schema and their component entity attributes.

For instance, a standard field protocol for surveying invasive species can be captured in a invasive-species observation schema and applied in numerous invasive species surveys. The schema can be extended to support surveys of specific invasives. (For instance, it was determined the length of the hind legs on the invasive Cane Toad was correlated with the geographical front edge of the invasion. "Hind Leg Length" would then be a key attribute of a survey of Cane Toads.)

The data management application parses the entity schema and component attributes as the definition of data types and behaviors. It then provides rich data entry, management, mapping, reporting, spatial and statistical analysis on the entered data. We can afford to invest heavily in the development of this system because the functionality is not specific to a given conservation data model. Attribute definitions are localizable into other languages including labels, help text, error messages, a feature critical for global conservation. Thus ends the tyranny of the software engineer. No longer are users beholden to software developers to create custom applications with rich functionality to support their data models, data models can that evolve with the needs of conservation and basic scientific understanding.

The other win for conservation is the ability to aggregate and analyze the resulting datasets arising from conservation organizations, the academic research community, even state agencies. When users populate datasets based on shared entity schema and extended attribute definitions, these datasets are inherently standardized and available for rich analysis. For instance ,species observation data can be harvested and mapped across all species surveys (using secure web services) for their common core (observer, observed species, date/time, location). Even this basic map would constitute a major breakthrough for conservation. Analysis of invasive species observations, based on an invasive species observation schema, would similarly bring insights to patterns in invasions. Population reductions or migrations over time associated with climate change can be mapped and analyzed based on surveys where climate change, per se, was not the primary focus of the survey. Again, while this approach would enable conservation to make use of an enormous wealth of basic observation data, these concepts apply equally well to information about lands managed for conservation (protected areas), stewardship activities, conservation projects themselves, etc.

We developed such a simple with a small team at NatureServe to support observations. The data entry/management/reporting tool is rich in functionality and yet supports any data model we could think throw at it. Parks Canada is the first customer and is already excited about the ability to support users conducting specialized surveys within parks and as well as those carrying out high level analysis of observation data across parks.

There is nothing specific to conservation in this technology. Indeed, I see examples of related systems existing and emerging on the web. Freebase is the closest I've seen yet to supporting what we need. But I'm not sure Metaweb is going where we need to go.

For instance, the support for combinations of attribute definitions into entity schema, corresponding to basic entities in conservation like observations and protected areas, is critical to support user-driven standards and protocols. We must have the ability to search and browse a community repository for core entity schema and their associated attributes. This open-source style resource would allow schema authors to post their submissions for use by the conservation community, solicit feedback, post modifications, and report on usage. In this way, subject-matter experts in various areas of conservation and biodiversity can directly share their expertise with the community in the form of widely-used entity and attribute definitions.

By separating the data model from the data management application functionality, we can provide conservation practitioners on the ground with powerful and usable tools to capture and manage their information. This same approach, to the extent we succeed in building a rich, common library of data model components, will enable unprecedented aggregation and analysis of similar, though not identical, data sets. The efficacy and efficiency of conservation at the project level can thus be improved as well as our overall understanding of the status and dynamics of nature.

5 Comments:

I've read your past few posts on data technologies for ecology and conservation and find your analysis very insightful.

Providing flexible, collaborative data structures while maintaining some semblance of shared "semantics" is a real challenge.

In an area like conservation, the importance of sharing distributed observations is even more critical than in other domains. As an Economist, I look at this as a design question: how do you develop a collaborative repository which makes the utility of the shared repository, for any individual researcher, greater than that of a private "spreadsheet" collection method.

I work at Metaweb and think these are the important questions which will determine whether shared information systems will be able to improve our world in meaning ways.

I have been working with a small, but growing community of biologists who have been developing schemas on Freebase for taxonomy, ecological models and bioinformatics. I would welcome the opportunity to work with you on the problems you outline here and talk with you about the direction Metaweb is taking.

Utility to the data producers is indeed the critical component. To acheive our goals, we simply have to beat Excel and that includes usability, performance and powerful functionality for this problem space. The very good news is that, when focused on the specific problem domain of managing conservation datasets, we definitely can beat Excel.

If we beat Excel and systems like Excel, the conservation data producer not only wins with a more suitable system to his/her problem, the conservation data consumer (aggregator/analyzer) also wins. Standards conformance, like armies in a Trojan horse, is embedded in the data management system. The data producer is producing conforming datasets without paying any additional costs. His/her data is available for downstream usage not because he's succumb to altruistic arguments about data sharing and then taken extra time and effort to cross-walk his/her data to standards, but because sharing amounts to checking a box, literally, checking a box.

It's possible and exciting. I very much believe that, as you put it, shared information systems can improve our world in meaningful ways.

I welcome the opportunity to collaborate. I'll be emailing you shortly.

Kristin, we've already discussed these ideas in some detail and I'm totally on board. I just wanted make one comment in response to yours that there is still some cost to the data producer, in that there must still be compliance with the attribute library for the full goals to be achieved. So maybe consider it a watered down conformance to standards. In the best case scenario, everything you need is already in place because somebody else did the work. In the worst case you may have to research attributes and semantics already in the library and establish your extended schema. No argument about benefits to both producers and consumers, and still a dramatically lower cost than developing a completely new system to handle the different context.

Excellent point, Paul. You are quite right that if a required attribute is truly missing from the library (or worst of all, hard to find), the data producer either abandons the system (reverting to Excel) or pays the non-trivial cost of describing a new attribute and and template.

The hope is that costs to each individual data producer converge to zero over time only because of the contributions of data producers before him or her.

In practical terms, we know what's required here to make this work for conservation: organizations like The Nature Conservancy, NatureServe, Cornell Lab of Ornithology and others can lend their expertise and capacity to the "seeding" effort. We take our existing conservation data standards and describe them in the library, thus at least reducing the costs for data producers who follow by giving them a good head start.

First, there will always be a cost when the data producer is collecting truly novel information. For instance, if "soil acidity" was only recently measurable, the associated attribute would have to be described, potentially by the first researcher to measure it in the field.

Second, we can mitigate the costs by providing powerful and user friendly functionality for defining, even localizing, new attributes and submitting them back to the community repository.

Third, besides the "seeding" idea mentioned in my previous comment, the open source approach to the shared repository might help us as well by enlisting the power of ego. Is it hard to imagine egomaniac biologists investing themselves in the task of creating attributes and templates based on their expertise? We'll create attribute and entity template usage reports (sharing the usage counts without sharing the data) that have the effect of esteeming their authors.