RDF123 and Spotter: Tools for generating OWL and RDF for biodiversity data in spreadsheets and unstructured text

OWL (the Web Ontology Language) and the related RDF (Resource Description Framework) are XML-based languages designed to represent the semantics of data. These languages enable systems to go beyond simple controlled vocabularies and specify the contexts and logical relationships among terms. Formal ontologies use classes (e.g., Species A) and properties (e.g., is a member of, or eats, or has body mass) to represent concepts and relationships as assertions. For example, two assertions might be “Species A is a member of Family B,” and “Family B is a taxon whose members eat plants”. A machine can then use logic to reason that all individuals of Species A should also eat plants; other assertions would make it clear that, in this context, plants are also organisms and not factories. The Semantic Web is the collection of web documents using semantic languages such as OWL and RDF. On the Semantic Web, specialized search engines can use such data assertions to more sensibly find and integrate information. For example, applications can determine if a web document refers to a “crow” that is a bird, or the “Crow” that is a Native American tribe. They can merge data for “mass” from different body mass datasets but ignore data related to other meanings of the word “mass”.

Although many authors have claimed that OWL and RDF will solve data discovery and integration issues, keen problems in biodiversity science, adoption of these formats has so far been largely limited to computer scientists, database administrators, and highly trained ontologists. The SPIRE project has developed two tools designed to make it easier for individual scientists to convert their information to RDF and OWL. We report on tests from using these tools with biodiversity data.

RDF123 (http://rdf123.umbc.edu/) is a highly flexible open-source tool for transforming spreadsheet data to RDF. It is intended for use with ontologies in any content area. Two RDF123 interfaces are available. The first is a graphical interface that allows users to map their spreadsheet columns and rows to ontology classes and properties in an intuitive manner. The second is a web service, intended for machine-to-machine communication, that takes as input a Google spreadsheet and an RDF123 map, and provides RDF as output. RDF123 was tested using spreadsheet data from the first annual Blogger BioBlitz in 2007. This biodiversity survey involved sightings of a broad range of taxa in 17 localities in April 2007. We mapped spreadsheet columns to concepts in SPIRE’s ETHAN and observation ontologies so that RDF123 could generate OWL representations. The resulting OWL data was posted on the web where it was indexed by Swoogle, the semantic web search engine.

Spotter (http://spire.umbc.edu/firefox/) is a Firefox RDF-based extension designed for observations made by citizen scientists from unstructured sightings of organisms (e.g., in web blog entries, discussions, photo-sharing sites). The user fills out a simple form and pastes a link in their comment or blog entry. By following the link, semantic web crawlers then generate and index the appropriate RDF. We continue to test Spotter on our own blog, https://ebiquity.umbc.edu/fieldmarking, and in cooperation with an environmental education summer camp.

In both RDF123 and Spotter, the RDF data is able to be discovered and integrated, using our TripleShop application or a mapping application, with related data collected in different contexts. For example, it is possible to conduct queries such as “What invasive species were observed in the Blogger BioBlitz?” or “Where have people observed frogs this year?” We found that of 1200 Blogger BioBlitz observations, 47 of them were of species defined as “of concern” by the US Fish and Wildlife Service. We plan to extend this work by taking advantage of existing technologies such as RSS for alerting subscribers to new data of interest on the Semantic Web.