Abstract

To date, the processing of wildlife location data has relied on a diversity of software and file formats. Data management and the following spatial and statistical analyses were undertaken in multiple steps, involving many time-consuming importing/exporting phases. Recent technological advancements in tracking systems have made large, continuous, high-frequency datasets of wildlife behavioural data available, such as those derived from the global positioning system (GPS) and other animal-attached sensor devices. These data can be further complemented by a wide range of other information about the animals' environment. Management of these large and diverse datasets for modelling animal behaviour and ecology can prove challenging, slowing down analysis and increasing the probability of mistakes in data handling. We address these issues by critically evaluating the requirements for good management of GPS data for wildlife biology. We highlight that dedicated data management tools and expertise are needed. We explore current research in wildlife data management. We suggest a general direction of development, based on a modular software architecture with a spatial database at its core, where interoperability, data model design and integration with remote-sensing data sources play an important role in successful GPS data handling.

1. Introduction: gps data, new perspectives, new challenges

Global navigation satellite systems (GNSS) are constellations of orbiting satellites working in conjunction with a network of ground stations that provide geo-spatial positioning of a user's receiver with global coverage (Tomkiewicz et al. 2010). At the moment, the most used GNSS is the global positioning system (GPS) network. This technology represents a powerful tool for wildlife studies (Hebblewhite & Haydon 2010). GPS tracking systems can record huge amounts of highly accurate animal locations with minimal work by operators, thus allowing reduced sampling intervals, and increased accuracy and performance when compared with very high-frequency (VHF) radio-tracking systems (Rodgers 2001; Frair et al. 2004, 2010; Ropert-Coudert & Wilson 2005). Furthermore, data can be remotely transferred to operators (e.g. using GPS for mobile communications GSM network, or the Argos satellite system), making near-real-time monitoring of animals possible.

However, the availability of large datasets also poses a number of challenges, for example the need for appropriate analytical techniques to deal with spatially and temporally autocorrelated data (e.g. Boyce et al. 2010; Fieberg et al. 2010), or the development of modelling approaches to exploit the information embedded in continuous time series of animal locations (e.g. Kie et al. 2010; Smouse et al. 2010). On a more pragmatic and underlying level, GPS tracking routinely generates larger datasets than software tools commonly used by biologists in the recent past could handle (Rutz & Hays 2009).

Existing dedicated software tools for wildlife studies were mainly developed on the basis of VHF radiotracking data, which are characterized by small and discontinuous datasets, and focused on data analysis rather than data management (e.g. Ranges V: Kenward & Hodder 1996; HRE extension for ESRI ArcView: Rodgers & Carr 1998; Animal Movement extension to ESRI ArcView: Hooge & Eichenlaub 2000; Biotas: Ecological Software Solutions LLC 2004; Hawth's Tools extension to ESRI ArcGIS: Beyer 2004; HRT extension for ESRI ArcGIS: Rodgers et al. 2005). Spatial data, such as animal locations and home ranges, were traditionally stored locally in flat files, accessible to a single user at a time, and analysed by a number of independent applications without any common standards for interoperability. This approach can require data replication and export/import procedures that are time-consuming and potential sources of error. Moreover, data preprocessing has to be repeated for every scientific question to be addressed, resulting in task replication and wasted time. Instead, good scientific practice requires that data are securely, consistently and efficiently managed, to minimize errors, increase the reliability and reproducibility of inferences, and ensure data persistence (e.g. consistent use of data on multiple occasions and by several persons).

In this paper, we critically evaluate the requirements for good management of GPS wildlife tracking data. We highlight that dedicated data management tools and expertise are needed. We explore current research in wildlife data management and suggest a possible direction of development, based on a modular software architecture with a spatial database at its core. Specific concerns, including interoperability, data model design and integration with remote-sensing data sources are discussed.

2. Exploring the new information framework: a requirement analysis

GPS data characteristics and users' needs drive the conceptual definition of a suitable software architecture that can be developed with specific tools on different platforms. The main requirements and needs can be summarized as follows:

Data scalability. The main challenge is that GPS-based devices can record hundreds, or in some cases thousands, of locations per animal per day, and improvements in automated data collection is increasing the volume of data available from individual animals. Also, as the cost of this technology decreases, the number of monitored individuals will increase, which will increase the volume of data. Recent multi-sensor devices (Cooke et al. 2004) amplify the data volume by orders of magnitude and also complicate the data structure. To handle this large amount of data consistently, a persistent and very large data storage capability is needed.

Long-term storage for data reuse. Data must be consistently stored in the long term, independently from a specific application, to permit data reuse for different studies.

Periodic and automatic data acquisition. Many GPS data management issues are linked to the frequency of data recording and near-real-time access to data (Tomkiewicz et al. 2010). This requires automated procedures to receive, review and store data from GPS telemetry devices either continuously or at regular intervals.

Global spatial and time references. Studies with regional or global perspectives imply the development of specific tools to manage global time and spatial reference systems efficiently (i.e. to handle Coordinated Universal Time versus local time and different spatial reference systems in a common framework).

Heterogeneity of applications. The complex nature of movement ecology implies that GPS data should be visualized, explored and analysed by a wide range of specific task-oriented applications (e.g. for mapping, spatial statistics and reporting). This requires a software architecture that supports the integration of different software tools.

Integration of different data sources. The availability of spatial data derived from remote sensing, environmental and socio-economic databases, and other animal-related data (e.g. capture details, life-history traits), provide many opportunities to expand the information embedded in GPS-based locations if these spatial and non-spatial datasets can be correctly managed and efficiently integrated into a comprehensive data structure.

Multi-user support. Within research groups, public institutions and environmental organizations, several users might need to access data simultaneously, both locally and remotely, with different access privileges (e.g. Wong et al. 2007). This multi-user environment calls for concurrency-control mechanisms and security functionalities.

Cost-effectiveness. Cost-effectiveness of software tools is an important accessibility factor for institutions with limited financial resources that can be applied to production and analysis of data, instead of data handling.

These requirements must all be satisfied to take full advantage of the information that wildlife tracking devices can provide, and to avoid the risk of drowning in data. The software used in the past cannot provide for all these needs with the increasing volume and complexity of data. There is thus an urgent need for new software architectures.

— data consistency: full support of transactions, which are units of work performed in a database to ensure complete data operations and correct data management (requirements 2, 9);

— automation of processes: DBMSs can be empowered by defining internal functions and triggers, thus a wide range of routinely complex workflows can be automatically and efficiently performed inside the database itself (requirement 3);

— data retrieval performance: querying hundreds of thousands of GPS-based locations can be very time consuming—DBMSs can speed up this process using database indexes (requirement 4);

— client/server architecture: an advanced DBMS provides data as a central service, to which a number of different applications can be connected and used as database front-end, including spatial, statistical, and internet tools (requirements 7, 10, 12);

— multi-user environment: once centralized in a DBMS, data can be accessed by multiple users at the same time, keeping control on the coherence between operations performed by them (requirement 10);

— data security: a wide range of data access controls can be implemented, where each user is constrained to the use of specific sets of operations on defined subsets of data (requirement 10);

— standards: consolidated industry standards for databases facilitate interoperability with client applications and data sharing among different research groups (requirement 11; see §4).

In addition to these features, most DBMSs also can support the spatial dimension of GPS tracking data. Spatial tools are increasingly integrated within databases that now accommodate native spatial data types (e.g. points, lines, polygons). These spatial DBMSs are designed to store, query and manipulate spatial data, including spatial reference systems (requirements 5, 6). Spatial databases integrate the geometric data types of spatial objects with standard data types that store the object's associated attributes (requirement 9). Some spatial databases also include support for storing and managing raster data (requirement 9). In a spatial database, spatial data types can be manipulated by a spatial extension of the structured query language (SQL; Shekhar & Chawla 2003; OGC 2006; Yeung & Hall 2007), where complex geospatial queries can be generated and optimized with specific spatial indexes. Today, practically all major relational databases offer native spatial information capabilities and functions in their products, including IBM DB2 (Spatial Extender), SQL Server (SQL Server 2008 Spatial), Oracle (Oracle Spatial), Informix (Spatial DataBlade), and the open source PostgreSQL (PostGIS), MySQL (Spatial Extension) and SQLite (Spatialite and SQLiteGeo), while ESRI ArcSDE is a middleware application that can spatially enable a set of DBMSs.

Spatial databases can be integrated with GIS software that can be connected to the server database as client applications. In fact, traditional GIS software is focused on specific analysis and data visualization, providing a rich set of spatial operations, but few are optimized for managing large vector datasets and complex data structures. Spatial databases, in turn, allow simple spatial operations that can be efficiently undertaken on a large set of elements (Shekhar & Chawla 2003), like GPS datasets. Thus, in an ideal information system, simple but massive operations are preprocessed within the spatial database, while more advanced spatial analyses rely on the GIS and spatial statistics packages connected to it (requirement 8).

Finally, the need of a cost-effective system architecture (requirement 13) can be achieved using, partially or totally, open source software (James 2003). Most relevant open source tools are available for wildlife GPS applications (Tufto & Cavallini 2005; Hall & Leahy 2008), addressing all the requirements listed in §2, including strong adherence to standards. Notable open source software tools available in the most used operation systems include: spatial databases (e.g. PostgreSQL, MySQL, SQLite), spatial libraries (e.g. PROJ4, GDAL/OGR, GEOS, Geotools, Geoserver), desktop GIS (e.g. GRASS GIS, Quantum GIS, uDIG, gvSIG, ILWIS, SAGA), Web-GIS packages (e.g. UMN Mapserver, OpenLayers, MapFish) and statistical tools (R and its several specific packages). R (R Development Core Team 2009, http://www.r-project.org), in particular, as an open advanced software environment for statistical computing and graphics, supports the quick implementation of new statistics and spatial analysis algorithms (requirement 8). Popular examples are the home range and trajectory analysis tools in the Adehabitat package (Calenge et al. 2009). R functions also can be loaded inside the database itself as native procedures (Conway 2008).

A general schema of a client/server architecture based on a spatial database able to accommodate all the relevant data sources is illustrated in figure 1.

Schema of a possible client/server software system. Information from several data sources, including core GPS data, are integrated in the central spatial database and here accessed, locally or remotely, by client applications for manipulation, visualization and analysis. Outputs are stored back in the database.

An example of a modular software platform developed to handle GPS telemetry data and built on a spatial database is offered by ISAMUD, described by Cagnacci & Urbano (2008). This is based on an open source spatial database (PostgreSQL and PostGIS) and includes, as client applications, a statistics and spatial statistics package (R), GIS software (QGIS; GRASS), Web-GIS (MapServer; Ka-Map), a database management tool (PGAdmin) and a user-friendly interface for data entry, querying and reporting (Microsoft Access).

4. Interoperability and standards

To answer the requirements 7, 10 and 11 (heterogeneity of software client applications, multi-user environment, data sharing in a collaborative framework), international data and metadata standards must be adopted to ensure interoperability between different software platforms, both within and between organizations. Standards play an important role in improving data quality and can liberate data structures from the specific aim for which they were collected. Adhering to such standards ensures that data can be reused for a wide range of purposes, maximizing the returns of research funding and facilitating multi-species, large-scale and long-term ecological studies.

Standards have been developed and applied to many forms of spatial data, but not yet for wildlife GPS data. For example, GPS devices from different vendors produce data outputs with no standardization, making different data sources hard to manage inside the same information system. At the moment, the recognized bodies developing standards related to GPS data and the spatial domain include: Dublin Core Metadata Initiative (http://dublincore.org); DarwinCore (http://www.tdwg.org/activities/darwincore); Access to Biological Collections Data (ABCD, http://rs.tdwg.org/abcd). Spatial functionality standards for database systems (De Smith et al. 2007) have been defined by the Open Geospatial Consortium (OGC 2006, http://www.opengeospatial.org). The International Organization for Standardization, Technical Committee for Geographic Information/Geomatics (ISO TC211, http://www.isotc211.org) is another organization that works on geospatial standardization, including spatial databases.

Standards for GPS tracking data urgently require further attention because the development and adoption of standards is a prerequisite for the integration of wildlife tracking data into national and international Spatial Data Infrastructures (SDI; Onsrud 2007).

5. At the core of the system: data modelling

As database systems grow in size and complexity, and user requirements get more sophisticated, (spatial) data modelling becomes more important. A data model in a database framework (typically, relational or object-relational) describes what type of data are stored and how they are organized. It can be defined as a conceptual representation of the real world in database structures that include data objects (i.e. tables), associations between data objects, and rules that govern operations on the objects, thus explicitly determining their spatial and non-spatial relationships (Shekhar & Chawla 2003; Arctur & Zeiler 2004; Yeung & Hall 2007).

The definition of a suitable data model for wildlife GPS data should take into account the research aims, the structure of GPS data, the technical environment, the policies governing the use of information and the expected performance of the application (Yeung & Hall 2007). Data modelling permits easy update, modification and adaptation of the database structure to accommodate changing goals, constraints and spatial scales of studies, and the evolution of wildlife tracking systems. Without a rigorous data modelling approach, an information system built for GPS tracking data might lose the flexibility to manage data efficiently in the long term, reducing its utility to a simple storage device for raw data, and thus failing to address many of the needs identified in the requirement analysis.

A reference conceptual data model in the context of GPS tracking should include at least two main objects, namely GPS devices and monitored individual animals. In figure 2, we propose a schema for the core elements of an example data model. Once GPS-based locations are stored in the database as spatial features (i.e. points), they can be intersected with other spatial layers using appropriate spatial SQL code, thus enriching the record with all the relevant environmental and socio-economic information describing the animal's habitat. Many other objects can be added and linked to these to extend the initial data model according to species, study objectives, information context, sensor devices, user environment and analysis. Thus, locations are managed not just as a pair of numbers (coordinates), but as complex multi-level objects. This perspective reduces the distance between physical reality and the way data are structured in the database, filling the semantic gap between the user's view of biological systems and its implementation in an information system (Shekhar & Chawla 2003; Nathan et al. 2008).

General representation of a possible standard database data model for core wildlife GPS data. When the GPS device provides coordinates of a location at a specified time, they are uploaded in the database in a table (GPS_raw_data). A device is associated to an animal for a defined time range (table animal_GPS_device, connected to the tables animal and GPS_device with foreign keys, represented by solid arrows in the figure). According to the acquisition time, GPS-based locations are therefore assigned to a specific animal (i.e. the animal wearing the device in that particular moment) using the information of the table animal_GPS_device. Thus, the spatial table animal_location is filled (dashed arrow) with the identifier of the individual, its position (as a spatial object), the acquisition time and the identifier of the GPS device.

In the software platform cited as an example in §3 (ISAMUD; Cagnacci & Urbano 2008), the system imports location data from GPS collars via GSM connection into the database. Then, triggers automatically raise functions that transform the coordinates in a spatial attribute, intersect new points with vector environmental layers stored in the database (e.g. land cover, protected areas) and update the relative attributes. Another SQL statement computes the linear distance, the time interval, and the relative and absolute angles between successive locations. A script connects GRASS to the database and intersect points with raster environmental layers (e.g. DEM, slope). PostGIS computes and stores trajectories and minimum convex polygons for selected points. Another script connects r to PostgreSQL to perform ecological spatial analysis (e.g. home range) and store the results back to the database. A set of additional tools combine functions from PostreSQL, PostGIS and GRASS to compute spatial statistics on locations, trajectories and home ranges based on their environmental attributes.

A further promising extension to spatial data models is the adoption of spatio-temporal data models. Animal locations are characterized by both a spatial and a temporal dimension, representing a unique double-faced attribute of animal movement. Spatio-temporal databases (Pelekis et al. 2004) offer the opportunity to extend the spatial data model by integrating data types and functions specifically related to the spatio-temporal nature of animal movements, thus fully satisfying requirements 5 and 6 (e.g. considering the movement as an attribute of the animal instead of relying on ‘location’ objects, and modelling the environment as an object changing over time). This approach would help to decipher the continuity of animal movement related to habitat use. Although commonly used DBMSs do not yet support an integrated spatio-temporal extension, we foresee that spatio-temporal databases, which are undergoing intense development (Pelekis et al. 2004), will be the natural evolution for wildlife tracking data management tools in the future.

Using classical methods of applied ecological remote sensing (e.g. statistical image classification or manual/semi-automated digitizing of patches), information on animal habitat use can be extracted from classified satellite or aerial images (Fuller et al. 1998). Also landscape metrics can be applied to the classified images in order to obtain information about connectivity, patchiness, diversity and more (e.g. Frohn 1998; Leyequien et al. 2007). Texture analysis and image segmentation are frequently used (e.g. Tuttle et al. 2006). Terrain modelling is another relevant source of information to enhance further the production of habitat maps (Hengl & Reuter 2009). The extracted ecological variables may then be used as input for habitat or population models (e.g. Handcock et al. 2009). However, it is important to use relevant environmental data at the appropriate temporal and spatial scales (Martin et al. 2009; Gaillard et al. 2010). For ecological applications, usually a choice must be made between very high-resolution data (spatial resolution of few metres, e.g. SPOT, or in the sub-metre range, e.g. QuickBird), which are often expensive and only generated on demand (acquisition requires prior data order, hence no continuous temporal and spatial coverage is granted), and high-resolution, low-cost (spatial resolution of 15/30 m, e.g. Aster and Landsat-ETM) or medium resolution (spatial resolution of 250 m or larger, e.g. MODIS) data, which are often available online for a nominal fee, or even for free. But while the latter have a high temporal and spatial coverage, they unfortunately do not usually line up well with the scale of animal home ranges. Ultimately, which kind of remote-sensing data and processing algorithm are selected for a wildlife tracking data management system strictly depends on study goals, scale and financial resources.

7. Concluding remarks

Data management is often not considered a core scientific issue in ecological studies, probably as a result of general reluctance to change work habits and undertake the initial costs for workflow and software updates, including hiring expert counselling and training of personnel. This paper advocates the view that good management of GPS-based locations is an essential step towards better science. Proving this statement empirically is difficult, although the best evidence is the enhanced efficiency and consistency in results. We suggest that dedicated management tools are needed and propose a client/server architecture based on a spatial database. This offers the opportunity to model location data as objects characterizing the presence of individuals in space and time within their habitat. We believe that the intrinsic consistency and integrity of spatial databases represents a necessary scientific infrastructure for rigorous science per se, preventing error propagation, optimizing performance of analysis and improving robustness of inferences. In particular, this favours the move from simple descriptive approaches towards mechanistic models with higher explanatory and predictive power (e.g. Millspaugh & Marzluff 2001; Morales et al. 2010; Smouse et al. 2010), focusing wildlife research on biological, rather than statistical, significance (Johnson 1999; Otis & White 1999; Ropert-Coudert & Wilson 2005).

To conclude, data management is increasingly becoming a necessary skill for ecologists, as has already happened with statistics and GIS. We expect further research towards innovative software solutions to assist the wildlife scientific community towards better data management techniques.

Acknowledgements

We thank Mathieu Basille and five anonymous reviewers for their important notes and suggestions, which considerably improved the paper. We also thank Kamran Safi, Bart Kranstauber and Hawthorne Beyer, who provided interesting input and comments on an early version of the manuscript. Most ideas presented in this paper were stimulated by discussions at the GPS-Telemetry Data: Challenges and Opportunities for Behavioural Ecology Studies workshop organized by the Edmund Mach Foundation (FEM) in September 2008, held in Viote del Monte Bondone, Trento, Italy. Funding of the workshop by the Autonomous Province of Trento is gratefully acknowledged.

1998The integration of field survey and remote sensing for biodiversity assessment: a case study in the tropical forests and wetlands of Sango Bay, Uganda. Biol. Conserv.86, 379–391. (doi:10.1016/S0006-3207(98)00005-6)

2010Selective reprogramming of mobile sensor networks through social community detection. Proc. of the 7th European conference on wireless sensor networks (EWSN 2010), Coimbra, Portugal, 17–19 February 2010.