AP30 - Bird Species Distribution Project

Wednesday, October 24, 2012

1. Introductory Product Information

The web services developed as part of this project have been extensions to product known as the Biocache. This tool is used as the ALA's aggregation software for specimen and species observation data.

This product provides data access including mapping facilities for a number of external portals (OZCAM, AVH, AMRiN) including the main ALA website. These portals all require search capabilities.

The primary focus of the Biocache is to:

aggregate occurrence data from multiple sources

provide data quality checks and cleaning of the data

support assertions by the data made by software or people

provide webservice access to this data to facilitate re-use in other portals.

2. Instructional Product Information

This product is intended to provide bulk data access to occurrence data to enable the JCU Edgar team to develop a portal for vetting occurrence data. In addition, this project has developed a number of data quality processes and services for accessing the results of these offline processes. By their nature, these processes generally require scanning across the entire index to analyse records.

Bulk occurrence (localities) downloads

This services provides the ability to download will include all records that satisfy the q, fq and wkt parameters. The number of records for a data resource may be restricted based on a collectory configured download limit. Params:

q

- the initial query. "q=*:*" will query anything, q="macropus" will do a free text search for "macropus", q=kingdom:Fungi will search for records with a kingdom of Fungi.

fq

- filters to be applied to the original query. These are additional params of the form fq=INDEXEDFIELD:VALUE e.g. fq=kingdom:Fungi

wkt

- filter polygon area to be applied to the original query. For information on Well known text, see this

email

- the email address of the user requesting the download

reason

- the reason for the download

file

- the name to use for the fileto download

fields

- a CSV list of fields to include in the download (contains a list of default)

View Query Assertion detailsThis service will return the assertion information. It will NOT return the details of the query./assertions/query/{uuid}

View Query Assertions detailsThis service will return the information for all the listed assertions. It will NOT return the details of the queries./assertions/queries/{csv list of uuid}

Apply Query AssertionThis service will apply the supplied query assertion against the biocache records./assertions/query/{uuid}/apply

3. Product Re-usability Information

The web services developed as part of this project are open access and are available to be incorporated into other websites and toolsets. In additional, all of this software is built on an open source stack (Apache Cassandra, Apache SOLR) so is "free" to set up in other environments. The code developed by the Atlas for this project is accessible in this google code repository:

Wednesday, August 22, 2012

One of the most important aspects of the AP30 project includes fast access to large numbers of occurrence records at a taxon by taxon level. As an example, at the time of writing there 420,039 distinct records in the Atlas for the Australian Magpie. All of these individual records will need to be retrieved and cached by the Edgar project to facilitate the vetting tool Edgar is developing. Edgar is focussing on Bird Distributions and modelling. Currently, there 1,986 taxonomic concepts for Australian birds supplied by the Australian Faunal Directory.

In addition, the persistence of vettings against the data in the Atlas will mean other tools and portals will benefit from the improved data quality. Typically researchers who work with these data will have to undertake a complete gathering of the data and a cleaning process to remove duplicate and erroneous records. This data cleaning is typically not shared or persisted with the source data, leading to duplication of effort within the research community.

By submitting the vettings to the Atlas, Edgar will be sharing the improved data quality with any researcher accessing these data through the Atlas.

The primary customers for this project are the Edgar team who we are working with to produce a platform to support:

integration of occurrence data for presentation to users on a map interface

the ability to filter environmental outliers

the ability to filter duplicate records

persistence of record vettings, and application of the vettings to records within a polygon

However, once these services and data processing is in place, it will benefit the ALA and additional portals wishing to access occurrence data. This will include portals such as Online Zoological Collections of Australian Museums which is built upon Atlas web services.

The stack we are putting together to support AP30 requirements includes:

Apache Cassandra Database. The database will house the full record details and will store the results of duplicate & outlier detection. It will also provide the persistence for the record vettings provided by Edgar.

Apache SOLR search indexes. These indexes will support the searching capabilities required by Edgar.

A processing chain implemented in Scala. This will include the algorithms for detecting duplicate records and environmental outliers. This custom code will then update search indexes to allow Edgar to filter for non-duplicates, and non-outliers hence improving the quality of the model and reducing the number of records to be vetted. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/

Java Spring MVC web services. These web services will provide the interface for the Edgar project to download snapshots of data for modelling and vetting purposes. They will also provide a write interface for submission of the vettings of bird records by expert users. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/. An additional important functional requirement is for Edgar to be able to query for record deletions. Services will be developed to allow Edgar to keep track of these deletions that occur periodically when the ALA harvests from data providers.

Tuesday, July 31, 2012

The ALA has a web service that
accepts a POST request with a JSON body that contains the information to
support a vetting.

The URL for the service is: http://biocache.ala.org.au/ws/assertions/query/add

Example JSON for the POST body:

Validating the supplied information

When the JSON body is invalid a HTTP Bad Request (400) will be returned.

When an invalid or no apiKey is provided a HTTP Forbidden (403) will be returned.

Otherwise the supplied information will be
validated in 2 ways; first to ensure that the species name exists in the
current system and finally ensuring that the area is a valid WKT format. If either of these checks fail a HTTP Bad
Request (400) is returned as the status with a message indicating the issue.

Insert/Update

When inserting a new validation a first
load date is populated. This date is
never updated. The purpose of this date
is to provide a context for “Historic” vettings. In the future the ALA may provide additional
QAs around records that appear in a “historic” region after the vetting first
loaded date.

Each vetting that is posted to the web
service will be stored in the database in raw JSON. Other fields populated include a last
modified date and query. The query will
be constructed using the species name and WKT for the area. This query will be used to identify the
records that are considered part of the vetting.

Deleting

When a “delete” is issued against an existing
vetting it is marked as deleted in the database. It is not physically deleted until the action
is filtered through to the ALA data. A "deleted" assertion can NOT be resurrected.

Applying Vettings to ALA Data

It is not yet known the exact process that
will be used to apply the vettings to the ALA data. It will be a batch process run nightly that
updates a bunch of records based on the queries that were generated for each
vetting.

New/updated vettings that have not been
pushed through to the ALA data will be applied to all records that satisfy the
query. Old vettings will be applied to
records that have been inserted/modified since the previous batch process.

Thursday, July 12, 2012

As a part of a range of data quality checks the Atlas has identified potential duplicate records within the Biocache. This allows users to discard duplicates records from searches, analysis and mapping where this is appropriate. A discussion on duplicate records is available here.

Records are considered for a species (with synonyms mapped to the accepted name).

Collection dates are duplicates when the individual components are identical (year, month and day). Empty values are considered the same.

Collector names are compared; if one is null it is considered a duplicate, otherwise a Levenstein distance is calculated with an acceptable threshold indicating a duplicate

Latitudes and Longitudes are duplicates when they are identical at the same precision. Null values are excluded from consideration.

When a group of records are identified as duplicates one needs to be identified as the "representative" record. The representative record can be used to represent the entire group of duplicates and should not be considered a duplicate.

Related sites

Subscribe To

This project is supported by the Australian National Data Service (ANDS) through the National Collaborative Research Infrastructure Strategy Program and the Education Investment Fund (EIF) Super Science Initiative, as well as through the Queensland Cyber Infrastructure Foundation (QCIF).