Abstract: Recent studies have shown a worrying decline in the quantity and diversity of insects at a number of locations in Europe (Hallmann et al. 2017) and elsewhere (Lister and Garcia 2018). Although the downward trend that these studies show is clear, they are limited to certain insect groups and geographical locations. Most available studies (see overview in Sánchez-Bayo and Wyckhuys 2019) were performed in nature reserves, leaving rural and urban areas largely understudied. Most studies are based on the long-term collaborative efforts of entomologists and volunteers performing labor-intensive repeat measurements, inherently limiting the number of locations that can be monitored.</p>
<p>We propose a monitoring network for insects in the Netherlands, consisting of a large number of <em>smart</em> insect cameras spread across nature, rural, and urban areas. The aim of the network is to provide a labor-extensive continuous monitoring of different insect groups. In addition, we aimed to develop the cameras at a relatively cheap price point so that cameras can be installed at a large number of locations and encourage participation by citizen science enthusiasts. The cameras are made smart with image processing, consisting of image enhancement, insect detection and species identification being performed, using deep learning based algorithms. The cameras take pictures of a screen, measuring ca. 30×40 cm, every 10 seconds, capturing insects that have landed on the screen (Fig. 1). Several screen setups were evaluated. Vertical screens were used to attract flying insects. Different screen colors and lighting at night, to attract night flying insects such as moths, were used. In addition two horizontal screen orientations were used (1) to emulate pan traps to attract several pollinator species (bees and hoverflies) and (2) to capture ground-based insects and arthropods such as beetles and spiders.</p>
<p>Time sequences of images were analyzed semi-automatically, in the following way. First, single insects are outlined and cropped using boxes at every captured image. Then the cropped single insects in every image were preliminarily identified, using a previously developed deep-learning-based automatic species identification software, Nature Identification API (<u>https://identify.biodiversityanalysis.nl</u>). In the next step, single insects were linked between consecutive images using a tracking algorithm that uses screen position and the preliminary identifications. This step yields for every individual insect a linked series of outlines and preliminary identifications. The preliminary identifications for individual insects can differ between multiple captured images and were therefore combined into one identification using a fusing algorithm. The result of the algorithm is a series of tracks of individual insects with species identifications, which can be subsequently translated into an estimate of the counts of insects per species or species complexes.</p>
<p>Here we show the first set of results acquired during the spring and summer of 2019. We will discuss practical experiences with setting up cameras in the field, including the effectiveness of the different set-ups. We will also show the effectiveness of using automatic species identification in the type of images that were acquired (see attached figure) and discuss to what extent individual species can be identified reliably. Finally, we will discuss the ecological information that can be extracted from the smart insect cameras.

Abstract: The potential of citizen scientists to contribute to information about occurrences of species and other biodiversity questions is large because of the ubiquitous presence of organisms and friendly nature of the subject. Online platforms that collect observations of species from the public have existed for several years now. They have seen a rapid growth recently, partly due to the widespread availability of mobile phones. These online platforms, and many scientific studies as well, suffer from a <em>taxonomic bias</em>: the effect that certain species groups are overrepresented in the data (Troudet et al. 2017). One of the reasons for this bias is that the accurate identification of species, by non-experts and experts, has been limited by the large number of species that exist. Even in the geographically limited area of the Netherlands and Belgium, the number of species that are regularly observed are in the thousands. This makes the ability to identify all those species difficult or impossible for an individual.</p>
<p>Recent advances in species identification powered by deep learning, based on images (Norouzzadeh et al. 2018), suggest a large potential for a new set of digital tools that can help the public (and experts) to identify species automatically. The online observation platform Observation.org has collected over 93 million occurrences in the Netherlands and Belgium in the last 15 years. About 20% of these occurrences are supported by photographs, giving a rich database of 17 million photographs covering all major species groups (e.g., birds, mammals, plants, insects, fungi). Most of the observations with photos were validated by human experts at Observation.org, creating a unique database suitable for machine learning. We have developed a deep learning-based species identification model using this database containing 13,767 species, 1,530 species-groups, 734 subspecies and 117 hybrids. The model is made available to the public through a web service (https://identify.biodiversityanalysis.nl) and through a set of mobile apps (ObsIdentify).</p>
<p>In this talk we will discuss our technical approach for dealing with the large number of species in a deep learning model. We will evaluate the results in terms of performance for different species groups and what this could mean to address part of the taxonomic bias. We will also consider limitations of (image-based) automated species identification and determine venues to further improve identification. We will illustrate how the web service and mobile apps are applied to support citizen scientists and the observation validation workflows at Observation.org. Finally, we will examine the potential of these methods to provide large scale automated analysis of biodiversity data.

Abstract: <p class="MsoNormal">Research processes in biodiversity are evolving at a rapid pace, particularly regarding data-related steps from collection to analysis. This evolution, mainly due to technological advances, offers equipment that is more powerful and generalizes the digitalization of research data and associated products. It is now urgent to accelerate good practices in scientific data management and analysis in order to offer products and services corresponding to the new context, presenting more and more openness, requiring more and more FAIRness (Wilkinson et al. 2016). Using Information and Communication Technology (ICT) as international standards and software (Ecological Metadata Language and associated solutions for metadata management, Galaxy web platform for data analysis), we propose, through the national research e-infrastructure called "Pôle national de données de biodiversité" (or PNDB, formerly ECOSCOPE), to build a new type of Biodiversity Virtual Research Environment (VRE) for French communities. Although deployment of this kind of environment is challenging, it represents an opportunity to pave the way towards better research processes through enhanced collaboration, data management, analysis practices and resources optimization.

Abstract: <p class="MsoNormal">
Most biodiversity research aims at understanding the states and dynamics of biodiversity and ecosystems. To do so, biodiversity research increasingly relies on the use of digital products and services such as raw data archiving systems (e.g. structured databases or data repositories), ready-to-use datasets (e.g. cleaned and harmonized files with normalized measurements or computed trends) as well as associated analytical tools (e.g. model scripts in Github). Several world-wide initiatives facilitate the open access to biodiversity data, such as the Global Biodiversity Information Facility (GBIF) or GenBank, Predicts etc. Although these pave the way towards major advances in biodiversity research, they also typically deliver data products that are sometimes poorly informative as they fail to capture the genuine ecological information they intend to grasp. In other words, access to ready-to-use aggregated data products may sacrifice ecological relevance for data harmonization, resulting in over-simplified, ill-advised standard formats. This is singularly true when the main challenge is to match complementary data (large diversity of measured variables, integration of different levels of life organizations etc.) collected with different requirements and scattered in multiple databases. Improving access to raw data, and meaningful detailed metadata and analytical tools associated with standardized workflows is critical to maintain and maximize the generic relevance of ecological data. Consequently, advancing the design of digital products and services is essential for interoperability while also enhancing reproducibility and transparency in biodiversity research. To go further, a minimal common framework organizing biodiversity observation and data organization is needed. In this regard, the Essential Biodiversity Variable (EBV) concept might be a powerful way to boost progress toward this goal as well as to connect research communities worldwide.
</p>
<p class="MsoNormal">As a national Biodiversity Observation Network (BON) node, the French BON is currently embodied by a national research e-infrastructure called "Pôle national de données de biodiversité" (PNDB, formerly ECOSCOPE), aimed at simultaneously empowering the quality of scientific activities and promoting networking within the scientific community at a national level. Through the PNDB, the French BON is working on developing biodiversity data workflows oriented toward end services and products, both from and for a research perspective. More precisely, the two pillars of the PNDB are a metadata portal and a workflow-oriented web platform dedicated to the access of biodiversity data and associated analytical tools (Galaxy-E). After four years of experience, we are now going deeper into metadata specification, dataset descriptions and data structuring through the extensive use of Ecological Metadata Language (EML) as a pivot format. Moreover, we evaluate the relevance of existing tools such as Metacat/Morpho and DEIMS-SDR (Dynamic Ecological Information Management System - Site and dataset registry) in order to ensure a link with other initiatives like Environmental Data Initiative, DataOne and Long-Term Ecological Research related observation networks. Regarding data analysis, an open-source Galaxy-E platform was launched in 2017 as part of a project targeting the design of a citizen science observation system in France (“65 Millions d'observateurs”). </p>
<p class="MsoNormal">
Here, we propose to showcase ongoing French activities towards global challenges related to biodiversity information and knowledge dissemination. We particularly emphasize our focus on embracing the FAIR (findable, accessible, interoperable and reusable) data principles Wilkinson et al. 2016 across the development of the French BON e-infrastructure and the promising links we anticipate for operationalizing EBVs. Using accessible and transparent analytical tools, we present the first online platform allowing the performance of advanced yet user-friendly analyses of biodiversity data in a reproducible and shareable way using data from various data sources, such as GBIF, Atlas of Living Australia (ALA), eBIRD, iNaturalist and environmental data such as climate data.

]]>Conference AbstractTue, 20 Aug 2019 08:40:00 +0300Lessons from the First Year of the International Bio-Logging Society’s Data Standardisation Working Grouphttps://biss.pensoft.net/article/38919/
Biodiversity Information Science and Standards 3: e38919

DOI: 10.3897/biss.3.38919

Authors: Peggy Newman, Francesca Cagnacci, Sarah Davidson

Abstract: The Data Standardisation Working Group pursues the recently formed International Bio-Logging Society's (IBioLS) objective “to progress standardisation of data protocols used within the bio-logging community, with a view to making databases interoperable”. During 2017 and 2018, the group has garnered a lot of interest across the sector from well over 200 colleagues with broad international representation from device manufacturers, researchers, biodiversity data experts and bio-logging database managers.</p>
<p>Through a series of remote meetings, the group has explored a range of existing, relevant standards, projects and platforms that could be leveraged to facilitate data decoding, exchange, archiving and discoverability.</p>
<p>This presentation will highlight some of the research and examples discussed by the group including the Open Geospatial Consortium (OGC) and W3C sensor-based standards that are being adopted in similar sectors, Darwin Core and the OBIS-ENV-DATA Darwin Core format as a way to define completed datasets, NASA’s Oceanographic In-situ data Interoperability Project (OIIP), which developed prototype templates for biologging data, and the NERC Vocabulary Server for managing persistent terms.</p>
<p>Considerable challenges ahead lie in resourcing the development of standards, enabling technical leadership, and negotiating governance and consensus in a domain where most of the stakeholders participate in a common market as either manufacturers or consumers of sensor infrastructure.

Abstract: ’Notebook’ is one of the primary data management systems of the Finnish Biodiversity Information Facility (FinBIF). It is a web solution for recording opportunistic as well as sampling-event-based species observations. It is being used for systematic monitoring schemes, various citizen science projects, and platforms for species enthusiasts. Notebook's main software component is LajiForm, which is the engine that renders a given JSON Schema into a web form. LajiForm is a separate, reusable module that is fully independent from other FinBIF systems. Notebook as a whole, includes other features embedded in FinBIF, such as linking users' geographical data to observation documents, spreadsheet document importing and form templates. We will demonstrate how the Notebook system works as a whole and also focus on LajiForm's technical aspects (Fig. 1).</p>
<p>All Notebook forms use FinBIF's ontological schema in JSON Schema format. Rendering user-friendly web forms based on a single schema is a difficult task, because the web form should be asking meaningful questions, instead of just rendering the schema fields according to the form description. We want to present questions in an interactive style. For instance, after drawing a geographical location on a map for a potential flying squirrel nesting tree, we would ask "did you see droppings at the nest?", and answering "yes" would update the document to include a flying squirrel taxon identification with fields "breeding" and "record basis" filled in but not rendered to the form. A simpler form engine without a user interface (UI) customization layer would just render the "taxon", "breeding" and "record basis" fields and the user would have no understanding why there are so many fields to answer and how they relate to their work or study. Some forms are complex, e.g., for experienced biology enthusiasts who need a form that is advanced, customizable, and compact. Some forms are simple, e.g., for elementary school children.</p>
<p>To tackle these challenges, LajiForm uses a separate schema for UI that allows everything from simple customization like:</p>
defining widgets for fields (e.g., date widgets, taxon autocomplete widget, map widget),
changing field order or
customizing field labels; to more complex customization like
transforming the schema object structure,
defining conditions when certain fields are shown or
if updating a field should have an effect on other fields.
<p>All the functionality is split into a loosely coupled collection of components, which can be either used as standalone components or composed together in order to achieve more advanced customization. The programming philosophy has drawn inspiration from functional programming, which has been helpful in writing isolated, composable functionality.</p>
<p>LajiForm is written with the JavaScript framework React. LajiForm is built on top of react-jsonschema-form (RJSF), which is an open source JSON schema web form library provided by Mozilla. RJSF handles only simple customization, but it is very flexible in design and allows us to build extensions with features that are more powerful. Some features and design proposals were submitted to Mozilla – FinBIF is the largest code contributor to RJSF outside of Mozilla, with a dozen pull requests merged.

Abstract: DaRWIN (Data Research Warehouse Information Network) is an in-house solution developed by the Royal Belgian Institute of Natural Sciences (RBINS), as a Natural History collections management system for biological and geological samples in collections. In 2014, the Royal Museum for Central Africa (RMCA) adopted this system for its collections and started to take part in new developments.</p>
<p>The DaRWIN database currently manages information on more than 600,000 records (about 4 million specimens) housed at the RBINS and more than 650,000 records (more than 1 million specimens) at the RMCA.</p>
<p>DaRWIN is an open source system, consisting of a PostgreSQL database and a customizable web-interface based on the Symfony framework (https://symfony.com).</p>
<p>DaRWIN is divided into 2 parts:</p>
<ul>
one public section that gives a “read-only” access to digitised specimens,
one section for registered users, with different levels of access rights (user, encoder, conservator and administrator), customizable for each collection and allowing update of specimens and collections, daily management of collections, and the potential for dealing with sensitive information.
</ul>
<p>DaRWIN stores sample data and related information such as place and date of collection, missions and collectors, identifiers, technicians involved, taxonomy, identification information (type, stage, state, etc.), bibliography, related files, storage, etc. Other features that deal with day-to-day curation operations are available: loans, printing of labels for storage, statistics and reporting. DaRWIN features its own JSON (JavaScript Object Notation) webservice for specimens and scientific names and can export data in tab-delimited, Excel, PDF and GeoJSON formats.</p>
<p>More recently, a procedure for importing batches of data has been developed, based on tab-delimited files, making integration of data from (old/historical) databases faster and more controlled.</p>
<p>Additional improvements of the user interface and database model have been made. For example, parallel taxonomical hierarchies can be created, allowing users to work with temporary taxonomies, old scientific names (basionyms and synonyms) and document the history of type specimens.</p>
<p>Finally, quality control and data cleaning on several tables have been implemented, e.g. mapping of locality names with vocabularies like Geonames, adding ISO 3166 two-letter country codes (https://www.iso.org/iso-3166-country-codes.html), cleaning duplicates from people/institutions and taxonomy catalogues. A tool for checking taxonomical names on GBIF (Global Biodiversity Information Facility), WoRMS (World Register of Marine Species) and DaRWIN itself, based on webservices and tab-delimited files, has been developed.</p>
<p>Last year, RBINS, RMCA and Meise Botanic Garden (MBG) defined a new framework of collaboration in the NaturalHeritage project (http://www.naturalheritage.be), in order to foster interoperability among their collection data sources. This new framework presents itself as one common research portal for data on natural history collections (from DaRWIN and other existing collection databases) of the three partnered institutions and makes data compliant to a standard agreed by the partners. See Poster "NaturalHeritage: Bridging Belgian Natural History Collections" for more information.</p>
<p>DaRWIN is accessible online (http://darwin.naturalsciences.be). A Github repository is also available (https://github.com/naturalsciences/natural_heritage_darwin).

Abstract: Most digitisation workflows are focused on legacy material, due to the sheer number of objects already collected. However, it is just as important to develop protocols for digitisation of incoming material to reduce accumulation of an additional backlog. This is especially crucial with the advent of molecular collections and field sequencing. In-the-field extraction and sequencing (Oxford Nanopore Technologies 2018) may lead to increasing numbers of voucher specimens without proper collection data and labels; or specimens disassociated with data. It is easy for researchers occupied by collecting and sequencing to delay proper documentation until a later date. As a curator, I can vouch that specimens without properly recorded data (with only collecting codes, for example) are lost for science. Fortunately, a combination of the best collecting and curatorial practices, simple online and offline tools, and modern technologies, makes in-the-field digitisation a reality.</p>
<p>In the last couple of years, entomologists at the National Museums Scotland (NMS) have been testing the following workflow:</p>
Collecting routes and points are recorded with ViewRanger (Augmentra Ltd 2019), available as an app for mobile phones;
At the moment of collecting, event data is recorded with Epicollect5 (Imperial College London 2019), available as Android app. Software's field generator allows creation of different scenarios, depending on method or circumstances of collection; and records main types of data: text, dates, time, coordinates. Individual collecting code is associated with the record;
Specimens collected are prepared (pinned, stored in preservative, dried, etc.) and associated with corresponding collecting code;
Additional data (diary records) is recorded in a notebook with Neo Smartpen (NEO SMARTPEN Inc. 2017) and digitsed.
Collecting event records are imported into a collection management system (CMS) (PAPIS, Pape and Ioannou 2019) or EarthCape (EarthCape 2019);
Specimen lots (if relevant) are sorted to a desirable level;
Multiple specimen or lot records are created in CMS based on collecting event records;
Data labels and UID labels are printed and physically associated with specimens or lots;
Additional data (klm file of collecting route, diary records) are imported and associated with collecting events.
<p>Steps 1-4, and, depending on available facilities, steps 5-9, can be performed in the field, before specimens reach the depository. Alternatively, steps 5-9 should be performed immediately on returning from the field.</p>
<p>There is no excuse for newly collected material not to be digitised before it is reaches the collection. Recent entomological collecting trips of NMS yielded 7358 specimens from 72 collecting events, fully documented and digitised in a matter of hours.

Abstract: The Encyclopedia of Life currently hosts ~8M attribute records for ~400k taxa (March 2019, not including geographic categories, Fig. 1). Our aggregation priorities include Essential Biodiversity Variables (Kissling et al. 2018) and other global scale research data priorities. Our primary strategy remains partnership with specialist open data aggregators; we are also developing tools for the deployment of evolutionarily conserved attribute values that scale quickly for global taxonomic coverage, for instance: tissue mineralization type (aragonite, calcite, silica...); trophic guild in certain clades; sensory modalities.</p>
<p>To support the aggregation and integration of trait information, data sets should be well structured, properly annotated and free of licensing or contractual restrictions so that they are ‘findable, accessible, interoperable, and reusable’ for both humans and machines (FAIR principles; Wilkinson et al. 2016). To this end, we are improving the documentation of protocols for the transformation, curation, and analysis of EOL data, and associated scripts and software are made available to ensure reproducibility. Proper acknowledgement of contributors and tracking of credit through derived data products promote both open data sharing and the use of aggregated resources. By exposing unique identifiers for data products, people, and institutions, data providers and aggregators can stimulate the development of automated solutions for the creation of contribution metrics. Since different aspects of provenance will be significant depending on the intended data use, better standardization of contributor roles (e.g., author, compiler, publisher, funder) is needed, as well as more detailed attribution guidance for data users.</p>
<p>Global scale biodiversity data resources should resolve into a graph, linking taxa, specimens, occurrences, attributes, localities, and ecological interactions, as well as human agents, publications and institutions. Two key data categories for ensuring rich connectivity in the graph will be taxonomic and trait data. This graph can be supported by existing data hubs, if they share identifiers and/or create mappings between them, using standards and sharing practices developed by the biodiversity data community. Versioned archives of the combined graph could be published at intervals to appropriate open data repositories, and open source tools and training provided for researchers to access the combined graph of biodiversity knowledge from all sources. To achieve this, good communication among data hubs will be needed. We will need to share information about preferred vocabularies and identifier management practices, and collaborate on identifier mappings.

Abstract: Crop wild relatives (CWR) are wild plants that are the ancestors of important crops for human well-being. CWR hold genetic diversity that can be vital for plant breeding programs and the sustainability of agriculture, particularly given global change. Conservation of CWR genetic diversity thus has become a global food security issue, and several countries are actively developing conservation strategies including the generation of a national checklist and inventory of CWR, the assessment of current threat status, the identification of knowledge and conservation, and the establishment of genetic reserves. In this context, Mexico, Guatemala, and El Salvador, in collaboration with experts abroad (University of Birmingham, UK, and IUCN), are working together in a project to contribute towards safeguarding Mesoamerican CWR (http://www.psmesoamerica.org/en/).</p>
<p>One important step is to identify CWR conservation area networks framed within the systematic conservation planning approach. However, genetic diversity is generally not addressed during the planning process. As it is unfeasible to sample and perform genetic analyses of hundreds of species due to limited timeframes and conservation budgets, we propose a novel approach to overcome the lack of genetic data. We used two criteria to develop proxies for genetic diversity (PGD):</p>
environmental variability, as given by climate, soil and topographic spatially-defined variables; and
historic differentiation, as shown by phylogeographic patterns found in other species of the same habitat and region.
<p>We tested our approach by using genomic data from an empirical study of maize wild relatives distributed in Mexico. By combining species distribution models of 120 Mesoamerican CWR taxa and 102 PGD, we delimited areas of potential population differentiation. Furthermore, we considered each taxon's IUCN Red List category and habitat preference, assessed by experts during the project, to determine areas for CWR conservation in Mexico, using the Zonation conservation planning tool.</p>
<p>Areas identified as important for CWR <em>in situ</em> conservation are located within sites of high cultural diversity and in areas where agriculture originated and traditional agriculture is ongoing. Also, our study design maximizes the representation of CWR throughout its distribution, thus highlighting the need for comprehensive analysis to encompass the genetic variability of taxa. The results of this work represent a first national and regional guide to promote CWR <em>in situ </em>conservation and sustainable management that contributes towards achievement of the CBD Global Strategy for Plant Conservation, Sustainable Development Goals and Aichi Targets.

Abstract: The global aims of the biodiversity field are to understand the underlying mechanisms of nature, document and capture the state and dynamics of ecosystems, and build predictive models for the future. This understanding is based on access to and use of data, models and analysis tools, produced in ever-greater quantities, and used by diverse communities tackling different aspects of biodiversity from observations, collections, sampling and experimental data.</p>
<p>The analysis of biodiversity data is essential for ecosystem services, risk analysis, and human well-being. The impact goes well beyond provisioning for material welfare and livelihoods, to include food security, resiliency, social relations, health, and environmental indicators. Species loss has dramatically accelerated around the world and now poses an existential threat to some ecosystems and susceptible human societies. There is an urgent need to; 1) collect, preserve and share FAIR data on species and ecosystems before they are lost to the scientific record, and, 2) provide automatic workflows producing biodiversity indicators so researchers, planners or policy-makers have evidence-based models to understand the complex dynamics of biodiversity.</p>
<p>To accelerate progress, both in the completeness and coverage of data, and in the richness of available information, all relevant sources of data must be aggregated; including sample-based data sets, ecogenomics, molecular research, remote-sensing, literature records, local and regional checklists, and expert knowledge. These resources, records and diverse data types should be used not only as a source of occurrence information, but also as an effective discovery tool on species abundance, community compositions, and interrelated genetic data.</p>
<p>Towards these long term aims, the partners of the BiodiFAIRse IN plan to build a virtual research environment and tools, collectively bringing their expertise to FAIR compliance by adapting data exchange standards, promoting the use and mapping of controlled vocabularies and collaborating in the development of registries gathering FAIR research objects and processes, analysis tools, and scalable workflows.

Abstract: The last two decades have seen the development of virtual morphology (ViMo), which emerged during the late 20<sup>th</sup> century through the application of medical imaging techniques to the study of fossil hominins (Spoor et al. 1994, Zollikofer et al. 1995, Conroy et al. 1998). The ViMo workflow has evolved successively by first building digital databases of fossil hominins, followed by digital reference collections, through the development of virtual 3D geometric morphometrics and, more recently, also 3D-printing (Fig. 1; Bastir et al. 2019). ViMo-workflows have led to a renaissance of morphological studies of diversity in evolutionary Earth and life sciences.</p>
<p>The aim of this presentation is to briefly present standard workflows in the Virtual Morphology Lab in the <em>Museo Nacional de Ciencias Naturales</em>, Madrid, and to show, more generally, how ViMo-technologies, together with paradigmatic changes in science (open access, digital data bases), contribute to boosting current research in human paleontology.</p>
<p>The accidental discovery, in 2013, of fossil remains of a new human species, <em>Homo naledi,</em> in the Rising Star cave system, South Africa, has produced a large and important collection documenting early hominin diversity (Berger et al. 2015) . In the light of the huge amount of fossil material, a new research strategy was decided: different kinds of social media and an open-access policy were used for the organisation of a workshop focussed on the study of this new fossil collection and based on data sharing and global collaborations.</p>
<p>Because of this modern strategy, <em>H. naledi</em> was published very soon after its discovery (Berger et al. 2015) and simultaneously, the digitized fossils were made available to the public via MorphoSource, an open-access database. As a consequence, only five years later, more than 30 scientific publications have yielded almost 600 citations. This productivity is much higher than in any other recently discovered hominin species. Thus, 13 years after “glasnost” was proclaimed for paleoanthropology (Weber 2001), <em>H. naledi</em> has provided the first real example illustrating how open-access to digital collections accelerates and modifies research and diffusion in human paleontology.

Abstract: The world's natural history collections represent a vast repository of information on the natural and cultural world, collected over 250 years of human exploration, and distributed across institutions on six continents. These collections provide a unique tool for answering fundamental questions about biological, geological and cultural diversity and how they interact to shape our changing planet.</p>
<p>Recent advances in digital and genomic technologies promise to transform how natural history collections are used, especially with respect to addressing scientific and socio-economic challenges ranging from biodiversity loss, invasive species and food security, to climate change, scarce minerals, and emerging tropical diseases. It is not clear, however, how ready these collections are to meet this challenge because relatively little is known about their size, composition or geographical distribution. Similarly, relatively little is known about the extent, expertise or demography of their curatorial workforce.</p>
<p>To address these questions, a large collaborative team of directors and scientists have collated a global database on natural history collections that comprises more than 70 of the world's largest institutions, including museums, botanic gardens, research institutes and universities. The institutions represented in the database span Africa, Asia, Australasia, Europe, and North and South America, with approximately one third of institutions from each of the Global South, Europe and North America. The database includes information on the number of specimens and experts with respect to both geographic regions and collection categories and geographic regions. Geographic regions include both the terrestrial and marine realms, and collection categories span anthropology, botany, entomology, geology, paleobiology, and vertebrate and invertebrate zoology.</p>
<p>Analyses of this new database reveal that the global natural history collection represents one of the most extensive distributed scientific infrastructures in the world, comprising more than 1 billion specimens that are curated by a workforce of more than 7,000 individuals. The analyses also indicate, however, that a major change in approach is required for these collections to realize their potential to inform future decision making and stimulate the basic research that underpins future questions and knowledge. For instance, at a global scale the collection and expertise does indeed exist to map change in key groups and regions - but this requires large-scale coordination across institutions and countries. Similarly, cross-institution collaboration is required to fill strategic gaps in the collection, particularly for tropical, marine and polar regions. And finally, there is an urgent need for coordinated investment in digital and genomic technologies to make collections available to the global research community and link them with other sources of information. The vast majority of collection information currently exists as 'dark data'.</p>
<p>We conclude that the global natueral history collection comprises one of the most extensive distributed scientific infrastructures in the world, but a major change in approach is required for them to realize their potential to inform future decision making. In particular, natural history collections need to work more effectively together to develop a global strategy, create a common data platform, accelerate the availability and use of specimen data and pursue major new collecting programs.

Abstract: The Catalogue of Life Data Package (COLDP) format was developed to overcome limitations existing in currently used formats for sharing taxonomic information, namely Darwin Core Archives (DwC-A) and the Catalogue of Life (CoL) submission format also known as the Annual Checklist Exchange Format (ACEF). The tabular text based format, strongly influenced by DwC-A and ACEF, was designed to work with the Frictionless Data Package specification, but includes support for other well established formats like BibTeX and CSL-JSON for literature citations. It is the recommended exchange format for data published to, and downloaded from, the CoL Clearinghouse.

]]>Conference AbstractThu, 8 Aug 2019 09:50:00 +0300Challenges in the Development and Curation of Species-plot Datasets in South Africa: The National Vegetation Database of phytosociological plots as a case studyhttps://biss.pensoft.net/article/38675/
Biodiversity Information Science and Standards 3: e38675

DOI: 10.3897/biss.3.38675

Authors: Brenda Daly

Abstract: The South African National Biodiversity Institute is the custodian of numerous national level botanical and zoological datasets that have been collated over several decades and is mandated to ensure that taxonomic and ecological data are made available to the public through responsible data sharing. This study describes the nature of, and presents/discusses relevant standards for, the case study of the National Vegetation Database; the process adopted in the development of a vegetation-plot database; and current data management practices being undertaken in relation to the various stages of research data management. Phytosociological data is a record of vegetation abundance, richness, density and the associated environmental variables within a specified area or plot which usually includes a record of locality. The study aims to review the diversity of approaches in storing species-plot information in databases and to provide minimum data standards for these datasets.</p>
<p>The surveying, classifying, and mapping of vegetation enables monitoring of ecosystems and ultimately can lead to improved conservation planning and land management. A coordinated and integrated approach is therefore needed to record, rectify, and manage these data and capture accurate metadata. Preliminary findings indicate that a lack of version control can impact the authenticity of the data if records are altered or deleted. Data affluence/abundance (currently comprised of 53 500 plots within 384 sample projects, totalling 1 064 770 species occurrence records) is a challenge because these data often differ in formats, varying methodologies, and metadata within these research projects. The curation of plot data requires a standardised approach in the different steps from data acquisition to provision of results. Species names need to coincide with currently accepted taxonomy, and although certain details are specific to a species-plot project depending on their research interest, various other data should be made consistent in terms of field names and formats to improve the quality of the resulting aggregated set of botanical records. All decisions to modify data records to achieve data consistency should be clearly explained in the metadata record for the dataset.

Abstract: We are using Wikidata and Metaphactory to build an Integrated Flora of Canada (IFC). IFC will be integrated in two senses: First, it will draw on multiple existing flora (e.g. Flora of North America, Flora of Manitoba, etc.) for content. Second, it will be a portal to related resources such as annotations, specimens, literature, and sequence data. </p>
<p>
Background
</p>
<p>We had success using Semantic Media Wiki (SMW) as the platform for an on-line representation of the Flora of North America (FNA). We used Charaparser (Cui 2012) to extract plant structures (e.g. “stem”), characters (e.g. “external texture”), and character values (e.g. “glabrous”) from the semi-structured FNA treatments. We then loaded this data into SMW, which allows us to query for taxa based on their character traits, and enables a broad range of exploratory analysis, both for purposes of hypothesis generation, and also to provide support for or against specific scientific hypotheses.</p>
<p>
Migrating to Wikidata/Wikibase
</p>
<p>We decided to explore a migration from SMW to Wikibase for three main reasons: simplified workflow; triple level provenance; and sustainability. </p>
<p>Simplified workflow: Our workflow for our FNA-based portal includes Natural Language Processing (NLP) of coarse-grained XML to get the fine-grained XML, transforming this XML for input into SMW, and a custom SMW skin for displaying the data. We consider the coarse-grained XML to be canonical. When it changes (because we find an error, or we improve our NLP), we have to re-run the transformation, and re-load the data, which is time-consuming. Ideally, our presentation would be based on API calls to the data itself, eliminating the need to transform and re-load after every change. </p>
<p>Provenance: Wikidata's provenance model supports having multiple, conflicting assertions for the same character trait, which is something that inevitably happens when floristic data is integrated.</p>
<p>Sustainability: Wikidata has strong support from the Wikimedia Foundation, while SMW is increasingly seen as a legacy system.</p>
<p>
Wikibase vs. Wikidata
</p>
<p>Wikidata, however, is not a suitable home for the Integrated Flora of Canada. It is built upon a relatively small number of community curated properties, while we have ~4500 properties for the Asteraceae family alone. The model we want to pursue is to use Wikidata for a small group of core properties (e.g. accepted name, parent taxon, etc.), and to use our own instance of Wikibase for the much larger number of specialized morphological properties (e.g. adaxial leaf colour, leaf external texture, etc.) Essentially, we will be running our own Wikidata, over which we would exercise full control. Miller (2018) decribes deploying this curation model in another domain. </p>
<p>
Metaphactory
</p>
<p>Metaphactory is a suite of middleware and front-end interfaces for authoring, managing, and querying knowledge graphs, including mechanisms for faceted search and geospatial visualizations. It is also the software (together with Blazegraph) behind the Wikidata Query Service. Metaphactory provides us with a SPARQL endpoint; a templating mechanism that allows each taxonomic treatment to be rendered via a collection of SPARQL queries; reasoning capabilities (via an underlying graph database) that permit the organization of over 42,000 morphological properties; and a variety of search and discovery tools.</p>
<p>There are a number of ways in which Wikidata and Metaphactory can work together, and we are still exploring questions such as: Will provenance be managed via named graphs, or via the Wikidata snak model?; How will data flow between the two platforms? Etc. We will report on our findings to date, and invite collaboration with related Wikimedia-based projects.

Abstract: The collaboration between LifeWatch ERIC and DiSSCo (Distributed System of Scientific Collections), both pan-European research infrastructures focusing on biodiversity, can be achieved in a number of ways. The direct initiation of this collaboration can be carried out through their joint support to GBIF (Global Biodiversity Information Facility). This approach will facilitate meeting GBIF’s overall objective stated as: “Connecting data and expertise: a new alliance for biodiversity knowledge” (Hobern and Miller 2019).</p>
<p>LifeWatch ERIC supports GBIF in a collaborative way by integrating and providing e-Services according to Global Biodiversity Informatics Outlook (GBIO) Framework objectives (Fig. 1), particularly suitable for the Understanding focus area. This concentrates on building modeled representations of biodiversity patterns and properties, based on any possible evidence, using the following components:</p>
Multiscale species modelling;
Trends and predictions;
Modelling biological systems;
Visualization and dissemination;
Prioritizing new data capture.
<p>In this regard, and during the 2nd Global Biodiversity Information Conference, LifeWatch ERIC actively participated in one of the four parallel working groups reviewing different components from the GBIO framework. Each component was selected to capture information on a broad range of different challenges and opportunities. At the same event, DiSSCo mainly focused on the Data layer, as the main provider of data and other types of collections resources in Europe.</p>
<p>The Evidence layer is the fertile interface to develop sound synergies for collaboration by both research infrastructures in order to support GBIF through the development of 3 concrete activities:</p>
<ul>
Participation in the co-design, development and deployment of a multi-purpose Virtual Research Environment (VRE) to support DiSSCo, specifically by integrating the collections e-Services and by engaging the various communities of practice;
Participation in the co-design and co-implementation of relevant e-Services in LifeBlock (LifeWatch ERIC blockchain-based technology platform);
The active participation of DiSSCo for integrating collections data: DiSSCo is one of the main resources needed for the integration of GLOBIS-B GLOBal Infrastructures for Supporting Biodiversity work on Essential Biodiversity Variables (EBVs) (Kissling et al. 2018). Thus, EBVs together with species traits will be integrated into LifeBlock platform in order to feed Ecosystem Services needed to further support Biodiversity Ecosystem Services VRE provided by LifeWatch ERIC distributed e-Infrastructure.
</ul>

Abstract: Through the Bloodhound proof-of-concept, https://bloodhound-tracker.net an international audience of collectors and determiners of natural history specimens are engaged in the emotive act of claiming their specimens and attributing other specimens to living and deceased mentors and colleagues. Behind the scenes, these claims build links between Open Researcher and Contributor Identifiers (ORCID, https://orcid.org) or Wikidata identifiers for people and Global Biodiversity Information Facility (GBIF) specimen identifiers, predicated by the Darwin Core terms, <em>recordedBy</em> (collected) and <em>identifiedBy</em> (determined). Here we additionally describe the socio-technical challenge in unequivocally resolving people names in legacy specimen data and propose lightweight and reusable solutions. The unique identifiers for the affiliations of active researchers are obtained from ORCID whereas the unique identifiers for institutions where specimens are actively curated are resolved through Wikidata. By constructing closed loops of links between person, specimen, and institution, an interesting suite of potential metrics emerges, all due to the activities of employees and their network of professional relationships. This approach balances a desire for individuals to receive formal recognition for their efforts in natural history collections with that of an institutional-level need to alter budgets in response to easily obtained numeric trends in national and international reach. If handled in a coordinating fashion, this reporting technique may be a significant new driver for specimen digitization efforts on par with Altmetric, https://www.altmetric.com, an important new tool that tracks the impact of publications and delights administrators and authors alike.

Abstract: The European Union (EU) is committed to tackling the issue of climate change, which poses serious risks to the global environment and human well-being. Supporting renewable energy is a key policy direction for the EU to lower its contributions to climate change. However, renewable energy technologies have diverse effects on the environment and on society. These effects can be considered a complex system of interacting elements and are challenging to assess. Conceptual models are a way of synthesizing this information to obtain an overview of the system and essential insights. We present the results of an activity to assess the impacts of EU renewable energy policies on overseas biodiversity and the United Nations (UN) Sustainable Development Goals (SDGs). This was carried out as part of the EKLIPSE (EKLIPSE (Establishing a European Knowledge and Learning Mechanism to Improve the Policy-Science-Society Interface on Biodiversity and Ecosystem Services) mechanism to synthesise environmental knowledge in response to specific requests by decision-makers at the European level. We carried out a participatory process to collate expert knowledge into a conceptual model using a Fuzzy Cognitive Mapping Approach (Özesmi and Özesmi 2004), with the Mental Modeler software for mapping (Gray et al. 2013). The participants were guided to connect significant EU policies associated with renewable energy, the technologies they support, and known impacts of these technologies on biodiversity and the SDGs, drawing on a preliminary review of the literature. The individual models obtained were integrated into a single model (see Suppl. material 1 for images). This was then subject to network analysis, which reveals the collective effects of different renewable energy technologies (RETs) on the wider socioecological system. Our findings highlight that RETs have complex and at times disparate effects on biodiversity and the SDGs, acting through a variety of mediating processes. They benefit the SDGs on balance, particularly climate-related SDGs. Mitigation of biodiversity impacts remains a concern, and processes such as habitat change were found to be influential here. Our results suggest that policymakers must focus on implementing appropriate environmental impact assessments, guided by these mediating processes. This would minimize any negative environmental impacts of RETs, while maximizing the benefits.

Abstract: The aim of the Global Genome Biodiversity Network (GGBN, http://www.ggbn.org) is to foster collaboration among biodiversity biobanks on a global scale in order to further compliance with standards, best practices, and to secure interoperability and exchange of material in accordance with national and international legislation and conventions. Thus, key aspects of GGBN’s mission are to develop a network of trusted collections, establishing standards, and identifying best practices by reaching out to other communities. This is especially critical in the light of new international legislation such as the recent Nagoya Protocol on Access and Benefit Sharing (ABS).</p>
<p>Biological repositories such as but not limited to natural history collections, botanic gardens, culture collections and zoos are facing a series of challenges triggered by the rapid acceleration in sequencing technology that has put added pressure on the use of samples, which just a few years ago were considered inaccessible for sequencing.</p>
<p>ABS legislation applies to nearly all collection types, and with biodiversity biobanks increasing in number worldwide, there is an urgent need to streamline procedures and to ensure legislative compliance. Within Europe it is necessary to 1) reach common standards for biodiversity and environmental biobanks; 2) define best practices for the use of molecular collections; and 3) try to ease exchange of samples and related information, while staying compliant with legislation and conventions.</p>
<p>Within the EU funded SYNTHESYS+ project (http://www.synthesys.info), GGBN is leading Network Activity 3 (NA3). An overview of planned activities and tasks will be given here with special emphasis on linkages within and beyond SYNTHESYS+.

Abstract: Land use disturbances are having enormous adverse impacts on the biodiversity and integrity of natural and managed ecosystems around the world. Adverse impacts on biodiversity are compromising ecosystem services and processes, reducing ecosystem resilience, and leading to unpredictable ecosystem responses to environmental change. The Metagenomics-Based Ecosystem Biomonitoring Project (EcoBiomics) focuses on the urgent need to better understand the extent and significance of ongoing changes to biodiversity in the soil and aquatic ecosystems that sustain essential ecosystem services upon which Canadians and the Canadian economy depend. This project uniquely recognizes that a breadth of scientific expertise from within the Canadian government is required to undertake this research, which is spread across relevant departments and agencies with Biodiversity portfolios. It involves over 50 participants (researchers, technicians, bioinformaticians, software developers, managers and students, etc.) contributing to many smaller projects in several locations across Canada organized by themes. The objectives of this project include:</p>
Develop standard soil and water methods and a federal Bioinformatics Platform to harmonize analyses of metabarcoding, metagenomics and metatranscriptomics data across federal departments/agencies;
Establish genomic observatories and comprehensive biodiversity baselines for assessing future changes to water and soil biodiversity at long-term environmental monitoring sites in Canada;
Develop new knowledge to improve water quality and soil health by comprehensively characterizing aquatic microbiomes, soil microbiomes, and invertebrate zoobiomes, and testing hypotheses to enhance environmental assessment, monitoring, and remediation activities.
<p>Our poster will focus primarily on the challenges associated with the first objective. Genomic technologies are revolutionizing biodiversity assessment in soil and aquatic ecosystems, and they now offer the only practical way to comprehensively characterize this enormous biodiversity. These technologies and associated tools allow us to obtain comprehensive baseline biodiversity data that are essential to support evidence-based decision-making. However, a strong focus of this project is to enable environmental assessment, monitoring, and remediation activities by a multitude of potential end users and thus standardized protocols and processes must be determined and shared. For the data generated, several procedures were defined. A minimum metadata profile related to the sampling event is required for all projects, which follows existing standards including the DarwinCore (https://dwc.tdwg.org). Sample preparation was also standardized based primarily on protocols developed in earlier projects that were validated for use in Ecobiomics, for example the Earth Microbiome Project (http://www.earthmicrobiome.org). The procedures for DNA extraction through sequencing largely followed the Minimum Information about “X” Sequence (MIxS) standard (gensc.org/mixs). This data is all input into an in-house custom-built software package, Sequence Database (SeqDB) used by all participants across the entire project, which is made available centrally via a federal high-performance computing centre. SeqDB not only stores the metadata and data generated but also maintains provenance based on defined workflows for metabarcoding, shotgun metagenomics, and other “omics” pipelines. It also supports project management through various metrics and visualizations.</p>
<p>We will document some of the challenges to standardizing data and workflows in large multi-domain and multi-department project like Ecobiomics and the need for further standard development to truly support data sharing and integration across a highly diverse ecosystem of genomic observatories globally.

Abstract: Citizen science is a powerful way to undertake monitoring of biodiversity, both for detecting rare events (e.g. invasive species, animal and plant health issues or presence of rare species) and assessing trends. However, in order to use citizen science effectively we need to understand better the patterns of people’s participation in projects considering</p>
variation in participation between citizen science approaches and
individual variation in participation within a project.
<p>Here, we particularly focus on the information content of the data collected through citizen science (although we recognise that citizen science has many other benefits, in addition to data collection).</p>
<p>Firstly, we assessed participation in five projects for biodiversity monitoring in the UK, from mass participation to monitoring by volunteer experts, representing up to two thousand people per activity per year. We quantified the patterns of participation (in terms of retention of participants, spatial patterns of participation, and unevenness of contributions per participant - as in the 90:10 rule). We found that the data from mass participation projects were more strongly spatially correlated with human population density and retention of individuals was lower compared to projects targeted on those with existing interest in the subject.</p>
<p>Secondly, we quantified the recording behaviour of recorders in a butterfly citizen science project. We developed this with four thousand users of a smartphone app designed for recording sightings of butterflies in the UK. The majority of these users were active for less than 10 days, a feature common to many citizen science projects. The users who engaged for longer produced most of the records for the project. We characterised their recording behaviour using 11 metrics that describe the variation in temporal and spatial recording behaviour as well as the data they recorded. Results showed that citizen scientists in this project fall on a continuum along 4 main axes describing their behaviour. We then used a 20-year butterfly dataset to assess the contribution of different types of recorders to the overall estimate of biodiversity trends and their precision. Overall, variation in participation, both between projects and between individuals within projects, contributes to variation in the information content (and hence the usefulness) of citizen science datasets. We show how different approaches can provide data to meet different needs for data users and how this understanding can be used to improve analyses of these data, allowing us to better design citizen science activities in the future.

Abstract: ELIXIR unites Europe’s leading life science organisations in managing and safeguarding the increasing volume of data being generated by publicly funded research. It coordinates, integrates and sustains bioinformatics resources across its 22 member states, plus EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute), and enables end users to access services and data that are vital for their research. ELIXIR's remit spans the full breadth of life science data, including data related to human health, food production (agriculture, farming, aquaculture) and the environment (e.g. pollution remediation, ecology), all of clear socio-economic benefit. As a result, ELIXIR contributes to the delivery of several sustainable development goals. This poster will introduce ELIXIR and describe the contribution it can make to coordinating data and services relevant to biodiversity. The poster will set the context for how molecularly-derived biodiversity occurrence data can significantly enhance resources such as the Global Biodiversity Information Facility (GBIF) and the Ocean Biogeographic Information System (OBIS), e.g. by filling in acute gaps in our knowledge of species across realms.

Abstract: The distribution of species is strongly influenced by habitat quality and its changes over time. Climate change has been identified as one of the major drivers of habitat loss, threatening the survival of many range-restricted animal species. Identification of spatiotemporal hotspots of species occurrence is important for understanding basic ecological processes particularly for the conservation of species at risk. This study models the spatiotemporal distribution of Rothschild’s giraffe (<em>Giraffa</em> <em>camelopardalis</em> <em>rothschildi</em>)<em> </em>with the view of explaining the possible effects of changing habitat suitability in Kenya and across Africa. The study analyzes the relative importance of different climatic variables and establishes the variables that are the strongest predictors of the species’ geographic range. We apply species distribution modelling to predict the species' response to future climate and land use change scenarios. Our model is based on occurrence data from the Global Biodiversity Information Facility (GBIF) for the period 1923-2019 and climatic data from the WorldClim. We fit the model using the Maximum Entropy (Maxent) algorithm to identify the combination of environmental responses, which best predicts evolving hotspots of occurrence for this species and future habitat suitability in face of climate change. The study demonstrates the usability of occurrence data over time on Rothschild’s giraffe and gives insights on the integration of land use variables to be able to link species distribution patterns, land use change and climate change to effectively inform conservation management.

Abstract: Species distribution modelling, or ecological niche modelling, is a collection of techniques for the construction of correlative models based on the combination of species occurrences and GIS data. Using such models, a variety of research questions in biodiversity science can be investigated, among which are the assessment of habitat suitability around the globe (e.g. in the case of invasive species), the response of species to alternative climatic regimes (e.g. by forecasting climate change scenarios, or by hindcasting into palaeoclimates), and the overlap of species in niche space. The algorithms used for the construction of such models include maximum entropy, neural networks, and random forests. Recent advances both in computing power and in algorithm development raise the possibility that deep learning techniques will provide valuable additions to these existing approaches. Here, we present our recent findings in the development of workflows to apply deep learning to species distribution modelling, and discuss the prospects for the large-scale application of deep learning in web service infrastructures to analyze the growing corpus of species occurrence data in biodiversity information facilities.

Abstract: Scratchpads launched in 2007 and became an extremely popular resource adopted for a variety of communities. Primarily, Scratchpads are used to manage and publish biodiversity data, but many sites were organised around projects, societies and regions. Demand for Scratchpads peeked at requests for more than 80 new sites over a 3 month period in 2014. Today we have over 1000 Scratchpad sites.</p>
<p>This has not been a pain-free journey. In 2015 the grants funding Scratchpad support and development came to an end, and whilst the Natural History Museum, London, provided some institutional support, this was alongside several competing initiatives. For a period of nearly two years, the Scratchpads had no dedicated developers. The Scratchpads suffered from this neglect, bugs remained unfixed and the platform became increasingly unstable. This situation has now been rectified, in 2017 the Informatics Group expanded, enabling us to provide a dedicated resource to the Scratchpads again.</p>
<p>This has been a challenging but valuable learning experience - and one that many Virtual Research Environments (VREs) in our community have, or will, experience. Current funding models encourage a boom and bust development cycle, described by Hine (2008) as the "dance of the initiatives," as projects constantly need to re-invent themselves in order to receive new external funding. We need to move beyond this, to start working collectively to develop a common roadmap for these systems, so we can begin to mutually benefit from each others development activities. Building on lessons learnt from the Scratchpads, we highlight a draft set of principles that may provide a framework for such a collaboration. While it is unrealistic to expect existing projects at different stages of maturity, and supporting very different use cases, to re-write their codebase in order to facilitate collaboration, we propose a microservices framework that would allow these related systems to converge on the delivery of a common set of services, provided by many current VRE's. This convergence, coupled with the development of a common and mutually agreed roadmap for these systems, has the potential to build a more sustainable future for VRE user and developer communities, as these systems evolve to support new use cases and improve existing functionality.

Abstract: There is a rich history of biodiversity data collection in the United Kingdom (UK) from the earliest record of a Peregrine falcon in 1605 to the 1.5 million records from the last two years, all of which contribute to the 222 million occurence records covering 45,500 species available on the NBN Atlas. Today, there are over 90 national schemes, usually covering specific taxa, approximately 60 Local Environmental Record Centres, a similar number of regional recording groups and a growing number of citizen science projects all collecting data about wildlife in the UK through different mechanisms from paper records to remote recording (e.g., using acoustics to monitor the presence of bat species), and using differing data standards. </p>
<p>Bringing these data together and making them available in the Darwin Core format makes the National Biodiversity Network's (NBN) Atlas (www.nbnatlas.org) an unparalleled resource for understanding the UK’s natural world and contributes to the picture of global biodiversity. However, this is not a simple job, adapting the record data and metadata to fit the Darwin Core format while making it simple for data providers to understand, the complexity of data flows and verification of species ID can cause all cause delay in making data of a known quality available for use. In this presentation we will describe how we are dealing with these issues now and plans for how to improve things in the future. We will also talk about the the strengths of the data shared by amateur experts, how they can be used for more than just species trends and why the push for Open Data may actually be reducing the amount of accessible wildlife data in the UK.

Abstract: Plant traits – the morphological, anatomical, physiological, biochemical and phenological characteristics of plants and their organs – determine how primary producers respond to environmental factors, affect other trophic levels, influence ecosystem processes and services and provide a link from species richness to ecosystem functional diversity. Trait data thus represent the raw material for a wide range of research from evolutionary biology, community and functional ecology to biogeography. The importance of these topics dictates the urgent need for more and better data and improved data availability and applicability, however, producing larger datasets that allow for more powerful, synthetic analyses increasingly relies on the integration of small, focused studies. Operationalizing plant functional traits has therefore been identified a key issue in plant and vegetation ecology.</p>
<p>In 2007 the International Geosphere Bbiosphere Program (IGBP) and DIVERSITAS (together now Future Earth) initiated a global database of plant traits to make the data available for trait-based approaches in ecology and vegetation modelling. This was the start of the TRY initiative (https://www.try-db.org).</p>
<p>In 2019 the TRY database contains about 12 million trait records for more than 300,000 plant taxa and 2000 traits. The data are publicly available under a CC BY license and so far contributed to more than 200 scientific publications.</p>
<p>Based on experience in this bottom-up exercise, my presentation will provide a subjective view on what has been essential to make progress towards operationalizing plant traits and how far the plant trait community has progressed.

Abstract:
Background: The NFDI process in Germany
</p>
<p>The digital revolution is fundamentally transforming research data and methods. Mastering this transformation poses major challenges for stakeholders in the domains of science and policy. The process of digitalisation creates immense opportunities, but it must be structured proactively. To this end, the establishment of effective governance mechanisms for research data management (RDM) is of fundamental importance and will be one key driver for successful research and innovation in the future. In 2016 the German Council for Information Infrastructures (RfII) recommended the establishment of a “Nationale Forschungsdateninfrastruktur” (National Research Data Infrastructure, or NFDI), which will serve as the backbone for research data management in Germany. The NFDI should be implemented as a dynamic national collaborative network that grows over time and is composed of various specialised nodes (consortia). The talk will provide a short overview of the status and objectives of the NFDI. It will commence with a description of the goals of the NFDI4BioDiversity consortium which was established for the targeted support of the biodiversity community with data management.</p>
<p>
The NFDI4BioDiversity Consortium: Biodiversity, Ecology &amp; Environmental Data
</p>
<p>Biodiversity is more than just the diversity of living species. It includes genetic diversity, functional diversity, interactions and the diversity of whole ecosystems. Mankind continuous to dramatically impact the earth’s ecosystem: species dying-out genetic diversity as well as whole ecosystems are endangered or already lost. Next to the loss of charismatic species and conspicuous change in ecosystems, we are experiencing a quiet loss of common species which together has captured high level policy attention. This has impacts on vital ecosystem services that provide the foundation of human well-being.</p>
<p>A general understanding of the status, trends and drivers of the biodiversity on earth is urgently needed to devise conservation responses. Besides the fact that data are often scattered across repositories or not accessible at all, the main challenge for integrative studies is the heterogeneity of measurements and observation types, combined with a substantial lack of documentation. This leads to inconsistencies and incompatibilities in data structures, interfaces and semantics and thus hinders the re-usability of data to answer scientifically and socially relevant questions. Synthesis as well as hypothesis generation will only proceed when data are compliant with the FAIR (Findable, Accessible, Interoperable and Re-usable) data principles.</p>
<p>Over the last five years these key challenges have been addressed by the DFG funded German Federation for Biological Data (GFBio) project. GFBio encompasses technical, organizational, financial, and community aspects to raise awareness for research data management in biodiversity research and environmental sciences. To foster sustainability across this federated infrastructure the not-for-profit association “Gesellschaft für biologische Daten e.V. (GFBio e.V.)” has been set up in 2016 as an independent legal entity.</p>
<p>NFDI4BioDiversity builds on the experience and established user community of GFBio and takes advantage of GFBio e.V. GFBio already comprises data centers for nucleotide and environmental data as well as the seven well-established data centers of Germany´s largest natural science research facilities, museums and world’s most diverse microbiological resource collection. The network is now extended to include the network of botanical gardens and the largest collections of crop plants and their wild relatives. All collections together host more than 75% of all museum objects (150 millions) in Germany and &gt;80% of all described microbial species. They represent the biggest and internationally-relevant data repositories.</p>
<p>NFDI4BioDiversity will extend its community engagement at the science-society-policy interface by including farm animal biology, crop sciences, biodiversity monitoring and citizen science, as well as systems biology encompassing world-leading tools and collections for FAIR data management. Partners of the German Network for Bioinformatics Infrastructure (de.NBI) provide large scale data analysis and storage capacities in the cloud, as well as extensive continuous training and education experiences. Dedicated personnel will be responsible for the mutual exchange of data and experiences with NFDI4Life-Umbrella,NFDI4Earth, NFDI4Chem, NFDI4Health and beyond.</p>
<p>As digitalization and liberation of data proceeds, NFDI4BioDiversity will foster community standards, quality management and documentation as well as the harmonization and synthesis of heterogeneous data. It will pro-actively engage the user community to build a coordinated data management platform for all types of biodiversity data as a dedicated added value service for all users of NFDI.

Abstract: Animal-borne sensor data, along with other types of sensor-based observations, provide a growing volume and proportion of documentation about biodiversity. These data differ from the traditional specimen, sampling and human observation records for which the Taxonomic Database Working Group (TDWG) originally designed the Darwin Core standard. The original intention of the new TDWG Machine Observations Interest Group is to facilitate a body of work combining the informatics expertise of TDWG with that of subject matter experts to document best practice guidelines for applying Darwin Core to bio-logging datasets. This session offers the opportunity to walk through some of the use cases developed so far, including a terrestrial GPS tracking and acceleration dataset from Movebank and a marine acoustic telemetry dataset from the Ocean Tracking Network using stationary as well as mobile acoustic receivers.</p>
<p>Through these examples, we will describe the strategy and rationale for the approaches taken to the application of Darwin Core using typical animal tracking scenarios laced with some of the common complexities in bio-logging and other types of machine-based biodiversity observations.

Abstract: Many permits for collecting biological samples have simple conditions, such as returning specimens and information. In meeting them researchers are ‘sharing benefits’ in the context of the Convention on Biological Diversity (CBD). However, the documents giving such conditions are often poorly connected to long-term research by many individuals, leading to tail-off of delivery. The conditions and the benefits shared are also often ineffective for biodiversity conservation because of inefficient linkage to environmental managers and policy priorities. A new model is needed to better manage permit conditions so that users of biological and genetic material are aware of the agreements and can deliver with minimal additional effort, and to provide better linkage to application in provider countries. Many of the informatics tools to permit this are already available. The paper will outline the elements of the model and suggest means of implementation.

Abstract: In the wide-ranging field of biodiversity conservation, genebanks play a major role in the preservation of cultivated plants. An important focus of genebanks is the comprehensive documentation of the maintained material. This is a prerequisite to enable users to select the most suitable material for e.g. research or breeding programs (Hoisington et al. 1999). The German Federal <em>ex situ</em> Genebank for Agricultural and Horticultural Crops, which is being hosted at IPK, is the largest genebank in Western Europe.</p>
<p>Within the multitude of data associated with plant material (e.g. from various -omics areas or conservation management), the so-called passport data represent the most original and oldest data in genebanks. These metadata are often subject to heterogeneity due to historically different collection and curation, especially if they were received from different institutions around the world. This leads to difficulties in handling these data and can result in misinterpretations. In addition, there are correlations between the individual attributes of the passport data which can lead to a different importance of the individual data points for the users.</p>
<p>Major challenges for users are to estimate completeness, correctness and reliability of these data. Thus, it is necessary to assess the quality of these data by defining a suitable set of metrics. Unfortunately, classical data quality measurement metrics, e.g. (Klier 2008), are not sufficient to fulfill the users' needs. Depending on the intention of the user, a different focus is placed on the data. Moreover, the individual attributes of the respective areas can be related to each other. Therefore, a single index value for estimating the quality of a passport record is not sufficient. Rather, it seems to be more promising to generate more differentiated quality statements.</p>
<p>We are working on a metrics system that is sensitive to the users' focus. Through a practical set of rules of data quality metrics for accession-related data, the user will be able to influence the weighting of individual domains (e.g. geographical origin, biological status) according to their context (fit-for-use index).</p>
<p>The presentation will discuss the background and will give an overview of the progress of this research activity.

Abstract: GBIF Togo, hosted at the University of Lomé, has published more than 62,200 occurrence records from 37 datasets and checklists. As a node participant of Global Biodiversity Information Facility (GBIF) since 2011, it has participated actively in several projects including the Biodiversity Information for Development (BID) programme.</p>
<p>GBIF facilitates collaboration between nodes at different levels through its Capacity Enhancement Support Programme (CESP). One of the actions included in the CESP guidelines is called ‘Mentoring activities’. Its main goal is the transfer of knowledge between partners, such as information, technologies, experience, and best practices.</p>
<p>Sharing architecture and development is the key solution to solving some the technical challenges and impediments (e.g. hosting, staff turnover, etc.) that GBIF nodes occasionally face. The Atlas of Living Australia (ALA) team have developed a feature called ‘data hub’, which allows the creation of a standalone website with a dedicated occurrence search engine that supports data discovery (e.g. specific genus, geographic area) published by particular GBIF nodes.</p>
<p>In 2017, a CESP project between the GBIF Benin and the GBIF France led to the creation of a new portal: Atlas of Living Beninises. This portal shared the same back-end database as the Atlas of Living France portal, while at the same time, each portal displayed and managed information relevant only to its region.</p>
<p>In 2018, another CESP project between GBIF France and GBIF Togo shared the same goal as the previous one: implement a new Atlas of Living Australia portal for Togo. This goal will be fulfilled using a similar implementation as the previous project: a shared back-end and different front-end. Togo will be the second African GBIF node to implement this kind of infrastructure.</p>
<p>This poster will highlight the architecture specific to the Atlas of Living Togo, and present the management procedure that distinguishes data coming from the three different countries.

Abstract: <p dir="ltr">
The Atlas of Living Australia (ALA) is an Australian Government supported collaborative partnership of organisations that have stewardship of Australian biodiversity data. The ALA (www.ala.org.au) provides research infrastructure that enables delivery of biodiversity information to over 45,000 unique users in research, industry and government per annum. It delivers impact and enables research excellence in fields such as biodiversity, environmental management, ecology and genetic sciences.
</p>
<p dir="ltr">
Integrated and consistent infrastructure and processes are fundamental to increasing value of collections and associated data. The Atlas of Living Australia has a mature industry engagement program that provides data standardisation, quality and analytical services to decision makers in all tiers of Australian government (local, state and federal). This program is built on formal partnerships between data providers (collection institutions) and analytical services (such as Virtual Laboratories and Research and Science Clouds www.ecocloud.org.au). The provision of high quality, authoritative data is critical to utilisation and uptake of these services and sector sustainability.
</p>
<p dir="ltr">
This presentation will showcase data service and analytical methods for decision makers within the Australian context. It will also explore how international efforts such as DiSSCo assist in data stewardship, cultural change and system enhancement.

Abstract: Climate change, habitat loss and fragmentation, invasive species, and resource over-exploitation are among the major factors driving biodiversity loss and the current global change crisis. Maintaining and restoring connectivity throughout fragmented landscapes is key to reduce habitat isolation and mitigate anthropogenic impacts. To date, few connectivity approaches seek to identify corridors along climate gradients and least transformed natural habitats despite its importance to facilitate dispersal of organisms, as species' ranges shift over time to track suitable climates. In this study, we identified least-cost climatic corridors in Mexico between 2027 old-growth vegetation patches incorporating evapotranspiration as climatic variable, Euclidean distances, and human impact. We identified old-growth vegetation patches using the land use and vegetation map of 2011 (scale 1:250 000) by the National Institute of Statistics and Geography (INEGI). Moreover, we calculated a human impact index based on the theoretical framework of the Global Biodiversity Model (Alkemade et al. 2009) but adapted for Mexico (Mexbio, Kolb 2016), and includes the impact of land use, road infrastructure and fragmentation based on the land use and vegetation map of 2011 and a road map by the Mexican Institute of Transportation. We modeled corridors for a baseline period (1980-2009) and under three future time periods (2015-2039, 2045-2069 and 2075-2099), corresponding to four Global Circulation Models (MPI-ESM-LR, GFDL-CM3, HADGEM2-ES and CNRMCM5) each under two emission scenarios (RCP 4.5 and 8.5) The historical and future evapotranspiration values were calculated using the climate surfaces from Cuervo-Robayo et al. 2019 and from the Center of Atmospheric Sciences of the National Autonomous University of Mexico*1, respectively. The historical and future evapotranspiration values were calculated using the climate surfaces from Cuervo-Robayo et al. 2019 and from the Center of Atmospheric Sciences of the National Autonomous University of Mexico, respectively. We used the Turc evapotranspiration equation (Turc 1954) to estimate actual evapotranspiration. Least cost climatic corridors using future climate projections were used to test the assumption that climatic gradients are maintained in the future. We then prioritized climatic corridors using a multicriteria analysis guided by expert knowledge, incorporating factors such as indicators of human impact, vulnerability and exposure to climate change, and priority sites for biodiversity conservation and restoration. On average, more than 4,500 least cost climatic corridors were identified for each scenario. There is a high spatial coincidence in the geographical location of current and future climatic corridors (overlap &gt; 90%). Fewer corridors were identified in the northern part of the country where natural vegetation is less fragmented, whereas in central and southern Mexico landscape fragmentation is greater, resulting in an increased number of corridors (Fig. 1). The use of open spatial data was key in identifying climatic corridors in order to support decision-making. The results provide a spatial guide to implement conservation and restoration actions to promote connectivity, in particular among climatic stable areas, thus supporting the achievement of Aichi Targets and Sustainable Development Goals. Also, it informs multiple stakeholders and sectors in land-use planning decisions and to promote the alignment of existing incentives to reduce habitat loss, degradation and fragmentation in key areas needed to maintain and recover landscape connectivity in the face of global change.

Abstract: The Finnish Biodiversity Information Facility FinBIF (LINK: species.fi), operational since late 2016, is one of the more recent examples of comprehensive, all-inclusive national biodiversity research infrastructures. FinBIF integrates a wide array of biodiversity information approaches under the same umbrella. These include species information Fig. 1 (e.g. descriptions, photos and administrative attributes); citizen science platforms enabling recording, managing and sharing of observation data; an e-learning environment for species identification; management and sharing of restricted data among authorities; building a national DNA barcode reference library and linking it to species occurrence data; community-driven species identification support; large-scale and multi-technology digitisation of natural history collections; and IUCN Red Listing to conduct a periodic national assesment of the status of the threatened species.</p>
<p>To improve the taxonomic coverage and the content of species information, FinBIF is starting a process to collaborate with the species information community at large, in order to collate already existing but not yet openly distributed information. This also means digitisation of information from analogue sources. In addition, the attempt is to join forces with Scandinavian counterparts, namely Artdatabanken (LINK: https://www.artdatabanken.se/) and Artsdatabanken (LINK: https://www.artsdatabanken.no/), for more efficient knowledge exchange within the countries sharing the same biogeographical region and thus similar species composition. The aim is also to reach politically high level agreement for deeper and wider commitment to collaborate in compiling, digitising and sharing relevant biodiversity information over the national borders.

Abstract: Motivation and objective: Because biodiversity conservation in forest management planning is necessary for ensuring regular ecosystem functioning, resilience and sustainability, the specific objective of this research was to quantify biodiversity at the landscape level in a forest plantation.</p>
<p>Case study: Vale de Sousa, Forest Intervention Zone (ZIF), is located in the North of Portugal. ZIFs were formed all over the county with the objective to prevent forest fires, desertification and the abandonment of rural areas. The total case study area is 14.773 ha, mainly covered by plantation forests. The predominant forest species are maritime pine (<em>Pinus pinaster</em>) and blue gum (<em>Eucalyptus globulus</em>) either as pure or mixed stands.</p>
<p>Methods:Fuzzy-logic system can serve as a platform for bundling expert knowledge on estimating ecosystem services provision and examining the consequences of contradictory expert views. The method was used to evaluate biodiversity as was recently proposed and demonstrated by Biber et al. (2018) in the context of the European Union (EU) project ALTERFOR (Alternative models and robust decision-making for future forest management - https://www.alterfor-project.eu/key-facts.html). In this study, we applied a fuzzy-logic approach for testing three biodiversity indicators: resident birds, heterogeneity of tree species diameter, and tree and shrub species richness. This approach generates scores for the rotation period of each plantation species between 0 (very low) and 1 (very high) for biodiversity categories. It also allows qualitative value rules regarding the above indicators. Scores are established according to stakeholder’s knowledge and validated by experts. Initially, the scores for each indicator are expressed as coloured matrices, but a final fuzzy output of biodiversity is expressed as a score between 0 and 1.</p>
<p>Results: Our fuzzy outputs demonstrated low scores for biodiversity in monoculture stands, but medium scores in mixed stands. Tree and shrub species richness and diameter heterogeneity have low scores in analysed plantations but need to be tested in other forest types. However, the score for resident birds had medium values in monoculture forests, but due to the low score of the other biodiversity indicators, the overall biodiversity score is low.</p>
<p>Conclusion: The results demonstrate that monocultures have the lowest score for biodiversity due to the zero level of all biodiversity indicators after the clear cut. Mixed stands have different periods of clear cut and this contributes to a higher score for biodiversity in general (fuzzy output). The fuzzy-logic approach is a very useful tool that may contribute to include biodiversity conservation in forest management decisions. This approach can be potentially used for the assessment of other biodiversity indicators (e.g. deadwood, large trees) in other forest types (including semi-natural and natural forests).

Abstract: Since 2015, the Natural History Museum London has made its research and collections data available through its Data Portal (https://data.nhm.ac.uk). This website provides free and open access to important research datasets as well as digitised objects from the Museum's specimen collection. The Data Portal currently has over 4.2 million records from the specimen collection and a further 5.5 million records from other research datasets. Since 2015, more than 250 scientific publications have cited data from the Data Portal, either directly or through aggregators such as the Global Biodiversity Information Facility (GBIF), although there are many more citations than it is currently possible to track.</p>
<p>Users can download data from the Portal and are encouraged to cite the source, however, there is currently no way for users to cite subsets of the data returned through a query, nor a way to persistently identify the data subset they are citing. This is a common issue with scientific data put online, particularly when the cited data changes frequently, such as is the case with the Museum's specimen collection, which grows constantly as more of the collection is digitised.</p>
<p>This poster outlines a new approach that has been designed to meet the Research Data Alliance's (RDA) Working Group on Data Citation recommendations on citing evolving data (Rauber et al. 2015). This is achieved by implementing a fully versioned search framework, ensuring that all modifications to records are tracked and the version timestamp of each modification is combined with the data into the search index. When users search and download data from the Portal, Digital Object Identifiers (DOI) are minted for unique searches at exact versions allowing the dynamic, repeated retrieval of data at any version timestamp, without storing the results. Combining the versioning information into the search index also allows queries against historical data.</p>
<p>By persistently identifying query results in this fashion, researchers can cite data precisely and have confidence that although the data may change after they use it, users of their work will be able to access the data as it looked when they studied it originally. This should also encourage the systematic use of citations, making it easier to track both the usage and impact of research and collections datasets.

Abstract: Digitisation of natural history collections draws increasing attention. The digitised specimens not only facilitate the long-term preservation of biodiversity information but also boost the easy access and sharing of information. There are more than two billion specimens in the world’s natural history collections and pinned insect specimens compose of more than half of them (Tegelberg et al. 2014, Tegelberg et al. 2017). However, it is still a challenge to digitise pinned insect specimens with current state-of-art systems. The slowness of imaging pinned insects is due to the fact that they are essentially 3D objects and associated labels are pinned under the insect specimen. During the imaging process, the labels are often removed manually, which slows down the whole process. How can we avoid handling the labels pinned under often fragile and valuable specimens in order to increase the speed of digitsation?</p>
<p>In our work (Saarenmaa et al. 2019) for T3.1.2 task in the ICEDIG (https://www.icedig.eu) project, we first briefly reviewed the state-of-the-art approaches on small insect digitisation. Then recent promising technological advances on imaging were presented, some of which have not yet been used for insect digitisation. It seems that one single approach will not be enough to digitise all insect collections efficiently. The approach has to be optimized based on the features of the specimens and their associated labels. To obtain a breakthrough in insect digitisation, it is necessary to utilize a combination of existing and new technologies in novel workflows. To explore the options, we identified six approaches for digitising pinned insects with the goal of minimum manipulations of labels as follows.</p>
Minimal labels: Image selected individual specimens without removing labels from the pin by using two cameras. This method suits for small insects with only one or a few well-spaced labels.
Multiple webcams: Similar to the minimal labels approach, but with multiple webcams at different positions. This has been implemented in a prototype system with 12 cameras (Hereld et al. 2017) and in the ALICE system with six DSLR cameras (Price et al. 2018).
Imaging of units: Similar to the multiple webcams approach, but image the entire unit (“Units” are small boxes or trays contained in drawers of collection cabinets, and are being used in most major insect collections).
Camera in robot arm: Image the individual specimen or the unit with the camera mounted at a robot arm to capture large number of images from different views.
Camera on rails: Similar to camera in robot arm approach, but the camera is mounted on rails to capture the unit. A 3D model of the insects and/or units can be created, and then labels are extracted. This is being prototyped by the ENTODIG-3D system (Ylinampa and Saarenmaa 2019).
Terahertz time-gated multispectral imaging: Image the individual specimen with terahertz time-gated multispectral imaging devices.
<p>Experiments on selected approaches 2 and 5 are in progress and the preliminary results will be presented.

Abstract: Camera traps have existed since the 1890s (Kucera and Barrett 2011), but they weren’t widely used until the introduction of commercial infrared-triggered cameras in the early 1990s (Meek et al. 2014). Since then, millions, perhaps billions of camera trap images have been collected for many reasons, biodiversity monitoring being one of the key applications. Unfortunately, although there are camera trap deployments all over the world, these operations occur in isolation, limiting the impact they could have on a global understanding of biodiversity health. Even within individual institutions, managing and analyzing multiple camera trap deployments in aggregate can be challenging. In fact, managing a single deployment of camera traps is non-trivial and important data are frequently cast aside as bycatch, left unanalyzed on decaying hard drives.</p>
<p>Wildlife Insights attempts to overcome these hurdles by providing camera trap data upload, management, and analysis services. It provides the world’s largest database of camera trap images by bringing together the camera trapping efforts of several the world’s largest conservation and research organizations, and it is open to future contributors. Artificial Intelligence-driven services sit at the heart of the platform. New camera trap data uploads are automatically analyzed to differentiate between images with people, non-human animals, and no animals. The images with non-human animals are further analyzed to detect specific species. The proposed labels are sent back to the submitter for review and then uploaded to the database. All uploaded images, unless specifically embargoed, are immediately available for analysis by all users of the system. A selection of tools are provided to support analyses of global biodiversity.</p>
<p>This presentation will describe Wildlife Insights and its AI implementation in detail, contextualized by case studies using analyses of the data currently stored on the platform. Challenges around integrating camera trap data within the platform and with other external services that work with the platform will also be discussed. The talk will end with some thoughts about future directions for the AI services, especially with regards to integration with related platforms.

Abstract: Featuring a large variety of ecosystems, abundant freshwater and forest resources, unique extensive karstic systems, and a high level of biodiversity and endemism, Southeast Europe (SEE) plays a crucial role in the conservation of biodiversity in Europe and beyond. In order to conserve and sustainably use these biodiversity assets and valuable natural resources, a regional concerted approach in the field of biodiversity information management and reporting (BIMR) has been strengthened. This has enabled improvement in access, transparency and exchange of biodiversity data and reporting processes among the participating economies.</p>
<p>Certain significant and visible progress among SEE economies and stakeholders is due to to the knowledge gained about regional and national BIMR baselines, agreed and elaborated minimum Convention on Biological Diversity (CBD) and European Union (EU) requirements on BIMR among stakeholders and implemented BIMR tools (e.g., a regionally unified fundamental database for the Information System for Nature Conservation (ISNC), for instance in Montenegro (http://zasticenapodrucja-cg.tk//en), Bosnia and Herzegovina/entity of Republika Srpska (<u>http://e-priroda.rs.ba/en/</u>) and entity of Federation of Bosnia and Herzegovina and North Macedonia (Standard Data Form - SDF application for NATURA 2000) and compiled dataset on five taxonomic groups of endemic taxa using the Darwin Core standard). Therefore, BIMR activities/priorities from the region have become more evident and supported along with ownership of BIMR tools acquired by the partner institutions and recognized at the global level through the Global Biodiversity Information Facility (GBIF).

Abstract: In the region of Southeast Europe (SEE) the obligation to establish and maintain information systems for nature conservation is scarcely mentioned in national legislation and is not adequately covered in legislative documents. Therefore, there is a great need for a more detailed regional policy paper that consists of a set of measures and a template of regulation. A set of measures was proposed and agreed upon among Biodiversity Information Management and Reporting (BIMR)* Regional Platform members and prepared in a way to be feasible, clear, resourceful and adjusted to the national circumstances, thus easier to implement. The regulation tackles all information system aspects in order to improve reporting processes towards the Convention on Biological Diversity (CBD, https://www.cbd.int/mechanisms) and other relevant conventions (e.g., exchange and provisioning of the data, access and usage rights, technical and functional requirements/standards, compliance with relevant international standards and European Union (EU) directives such as EU INSPIRE Directive (Infrastructure for spatial information in Europe, https://inspire.ec.europa.eu), Birds (http://ec.europa.eu/environment/nature/legislation/birdsdirective/index_en.htm) and Habitats Directive (http://ec.europa.eu/environment/nature/legislation/habitatsdirective/index_en.htm). Capacities and skills of BIMR Regional Platform partner institutions are utilized while other available policy and strategic documents are used for emphasizing BIMR priorities in BIMR policy paper. Stakeholders have an opportunity to express national data gaps and needs through a questionnaire where BIMR priorities are selected at the regional level and presented in a proposed set of measures and regulation. Consultative meetings of the BIMR Regional Platform are used for drafting and preparing the document in a form to be ready for endorsement. The BIMR policy paper will be delivered to the Biodiversity Taskforce (BD TF https://www.rcc.int/docs/443/biodiversity-task-force-of-south-east-europe--technical-and-advisory-body-of-the-regional-working-group-on-environment), an intergovernmental technical and advisory body of the Regional Working Group on Environment (RWGE), which coordinates regional activities, facilitates the implementation of the SEE 2020 Strategy (https://www.rcc.int/pages/86/south-east-europe-2020-strategy) and creates a framework for more efficient implementation of biodiversity policies in the framework of accession to the EU. As a final outcome, the BD TF will report on the BIMR policy paper to the RWGE for further endorsement.</p>
<p>Main result: Cooperation between economies is strengthened and their willingness to implement EU standards and fulfill international obligations is fostered by improving the capacities and skills of partner institutions for an active regional exchange, including learning/knowledge transfer and practices. This regional paper enables amplification of BIMR issues in the national legislation by improving the decision-making processes of stakeholders in their own institutions and reporting progress towards international biodiversity agreements.</p>
<p>BIMR Regional Platform is a consultative technical group which represents focal points from the Ministries of Environment, Environmental Protection Agencies and Institutes for Nature Conservation from SEE and Croatia. It facilitates consultant work at the national/regional level, communicates and disseminates information on BIMR activities in respective institutions and other biodiversity relevant sectors and initiatives, verifies and presents BIMR deliverables and mobilizes institutional, scientific and technical networks in support of BIMR activities.

Abstract: Citizen science is well-known as being a very efficient means collecting large amounts of data at a global scale. However, even if it seems nice to collect observations about flowering plants and singing birds, people living in today’s world need to understand this global biodiversity crisis is here to stay. We need to move past the human sensor paradigm and learn to incorporate the general public in the entire research process. We need to move from cheap data labour to truly empowered citizen scientists and realise that stakeholders may not have complex scientific questions but still have questions about their environment. We need to move from citizen science to participatory science (Hinckson et al. 2017,Katapally 2019,Poncet and Turcati 2017), if we want to tackle the challenges we will be facing in the coming years.</p>
<p>Natural Solutions has developed a number of gamified citizen science applications in the past (ecoBalade, Biolit, <em>Sauvage de ma rue</em>, <em>INPN espèces</em>, GeoNature Citizen), through which we have gained a good understanding of what works. Our last project is to create a citizen acting mobile platform using cognitive bias to nudge citizen in acting toward biodiversity. The application will be part of the IUCN congress taking part in Marseille in 2020.

Abstract: GBIF France, the French node of the Global Biodiversity Information Facility, has been hosted by the National Museum of Natural History (MNHN) since 2006 and is actively involved in various national and international projects related to data mobilization.</p>
<p>The INPN (National Inventory of Natural Heritage) information facility, hosted since 2005 by the MNHN, lists the ecological, faunistic, floristic, geological, mineralogical and paleontological data of France and its overseas territories. This system has been recognized since 2012 as the national platform of the French Nature and Landscape Information System.</p>
<p>In January 2017, GBIF France and the INPN team were gathered in the Mixed Unit of Natural Heritage Services (UMS PatriNat), which provides expertise and knowledge management missions under the supervision of the French Agency for Biodiversity (AFB), the National Center for Scientific Research (CNRS) and the MNHN.</p>
<p>In order to support both systems and enhance biodiversity data quality, a flow of occurrences and taxonomic data has been established between GBIF and INPN. On July 2018, the latest delivery of French data coming from the INPN marked the one billionth occurrence to the GBIF network. Data connected to GBIF from French territories, will be included in INPN in 2019.</p>
<p>In this poster, we will explain the detailed process of data exchange between the two platforms, as well as the protocols, the standards and the tools used for data validation, data transformation, data dissemination and data update.

Abstract: Honey is a naturally sweet and viscous product for which the addition of any substance is prohibited by international regulation. Detection of adulteration in honey is a technical problem: adulteration of honey with invert sugar and syrup may not be reliably detected by direct sugar analysis because its constituents are identical to the major natural components of the honey. Therefore, it is important to develop a rapid and reliable analytical method to detect such additions. We used near-infrared spectroscopy (NIR) combined with principle component analysis (PCA) and artificial neural networks (ANN) modelling to discriminate between honey and corn syrup in adulterated honey. Fifteen honey samples from north-west Croatia (Krapina-Zagorje County) were intentionally supplemented with differing proportions of corn syrup ranging from 10-90%. We collected a total of 460 NIR spectra using the Control Development NIR128L-1.7 spectrophotometer (Control Development, South Bend, Indiana, USA) with their software Spec32 software anda HL-2000 halogen light source. For each of the prepared samples, we measured water content by refractometer (Brouwland, Belgium), conductivity byconductometer (SevenCompact, MettlerToledo, Switzerland), and colour using a PCE-CSM3 colorimeter (PCE Instruments, Germany).</p>
<p>Prior to ANN modelling, PCA was used to identify patterns and highlight similarities and differences in data of the individual set of the experiment. The goal of PCA is to extract important information from the data table and to express this information as a set of new orthogonal variables called principal components or factors (PCs or Fs). We conducted PCA of raw spectra using the Unscrambler<sup>®</sup> X 10.4 software (CAMO software, Norway). Data were divided into ANN model training, test, and validation datasets at a 70:15:15 ratio using the first five PCs. ANNs were calibrated using model training data, and evaluated using model test and model validation datasets for their ability to predict: i) the amount of added adultering substance in honey, ii) water content, iii) conductivity and iv) colour of the adulterated honey. Multiple layer perception (MLP) networks were developed in Statistica v.10.0 software (StatSoft, Tulsa, USA). Back error propagation algorithm available in Statistica v.10.0 was applied for the model training. Model performance was evaluated using <em>R</em><sup>2</sup> and root mean squared error (RMSE) values for model training, test, and validation datasets.</p>
<p>Results show that network MLP 5-8-6 with five neurons in the input layer, 8 neurons in the hidden layer and 6 neurons in the output layer predicts the analysed output variables with high precision (<em>R</em><sup>2</sup><sub>validation,concentration </sub>= 0.995, <em>R</em><sup>2</sup><sub>validation,water content </sub>= 0.993, <em>R</em><sup>2</sup><sub>validation,conductivity </sub>= 0.992, <em>R</em><sup>2</sup><sub>validation,L </sub>= 0.939, <em>R</em><sup>2</sup><sub>validation,a </sub>= 0.895, <em>R</em><sup>2</sup><sub>validation,b </sub>= 0.924).

Abstract: The Biodiversity Map is a long-term project of the Polish Biodiversity Information Network (KSIB) aimed at integration, presentation and management of comprehensive scientific data about species. The website (www.biomap.pl) was launched in 2012, following a period of extensive digitization work, covering bibliographic information, specimen collections, research notes and other sources of data. Initially, the project was focused on aggregating data about three insect orders: Coleoptera, Hemiptera and Lepidoptera, reported from Poland. Having achieved this goal, the geographic limits were removed and taxonomic scope is being gradually widened, currently including Araneae, Diptera, Hymenoptera, Odonata, Orthoptera and some other minor insect orders, intended to have a checklist of Polish fauna as a starting point. So far, it covers ca. 21,000 species concepts, including their taxonomic hierarchy and synonymy; and more than 1.1 million occurrence records with 19,000 bibliographic sources.</p>
<p>The key functionality of the toolset supports visualization and management of links between different types of data and related underlying sources of the information, like scientific collections, literature, taxonomy, and occurrence records. The database can be accessed with a number of views, called "perspectives" and also by spatial queries through the map server, as an additional interface. This enables users to discover connections between information entities, e.g. publications based on studies from areas adjacent to a chosen locality on a map, or collections containing species covered in a publication. This approach is not common in existing systems and we trust it supports a wide range of potential scientific uses.</p>
<p>The project database uses PostgreSQL with PostGIS for spatial queries. Two web applications are used for data presentation: the main text-based PHP browser (baza.biomap.pl) and the dynamic map, relying on JQuery, OpenLayers and MapServer (gis.biomap.pl). The latter provides users with an additional spatial dimension of interaction with the database and direct links to the main application. Recently, a third tool was built, making it possible for users to add and edit occurrence records, taxa, publications and authors. The solution is based on PHP and JavaScript combination.</p>
<p>The data held within the system are planned to be connected to Global Biodiversity Information Facility and thus opened for a broad international community. The main obstacle hindering this step is limited resources to improve and scale up the database and software, as well as efforts to mobilize and organize the data. We need to ascertain the optimal method for dealing with numerous datasets derived from publications, and resolve the dilemma of whether to keep them separated or to merge them. The publications as a source of the scientific information and occurrence records have also been one of the main drivers for building the project software as an independent solution. Existing generic software packages popular in the GBIF community do not provide a direct way to link occurrence records with scientific literature, which is essential for scientific communities, at least in Poland.

]]>Conference AbstractWed, 17 Jul 2019 15:30:00 +0300Describing the German Research Infrastructure DCOLL Based on the Criteria Defined by the One World Collection Group - a test casehttps://biss.pensoft.net/article/37875/
Biodiversity Information Science and Standards 3: e37875

DOI: 10.3897/biss.3.37875

Authors: Frederik Berger

Abstract: The standardized description of collections is an important means for prioritizing collection digitization on a supra-institutional level. Different organizational and systematic structures prevent easy comparison of collections sizes and foci, in order to make informed decisions on setting priorities and efficiently distributing tasks. In autumn 2018, the consortium of German Natural Sciences Collections (DCOLL), consisting of seven natural history collections holding institutions integrated as a National Research Infrastructure*1 performed a top-level description of the consortium's collections based on a subset of the criteria defined by the One World Collection Working Group (OWC). OWC is based on an initiative of the directors of the world's largest Natural Science Collections and aims at making collections and ressource allocations comparable. Categories relating to the staff structure of institutions were omitted, as those were considered sensitive and of little use for the purpose. The survey focused on collection size and geographical distribution only. Since some partners already had previously assembled the necessary data and since the OWC criteria are based on a reasonably high level, allowing the integration of heterogenous collections, the OWC dashboard presented an opportunity to achieve quick standardized results. However, as one purpose of the survey was to support decisions on the consortium's digitization strategy, arguably not the objective of OWC, a field to describe the digitization rate had to be added. Another shortcoming of the OWC dashboard for this purpose was the difficulty in subsuming some important sub-collections into the given criteria, namely digital collections (like animal sound archives) and non-biological or non-geological collections (e.g. historical objects and archives). On the positive side, the survey proved to be very helpful to indicate the consortium's collection focus in comparison with other institutions on an international level. This can provide valuable information for establishing an integrated collection development and digitization strategy on a supra-institutional level. It can be shown for example that 41% of the objects with origin from Europe in Natural Science Collections are held by DCOLL. While it may be possible to derive meaningful strategic goals from this information, it is a big challenge to implement practical objectives based on the same criteria. From a bottom-up perspective the OWC dashboard aggregates data, which were collected in a non-standardized form within institutions across organizational structures. Increasing the granularity from this level will result in an unreasonable effort. This presentation discusses the process of collecting information based on the OWC criteria and will present the collection structure of DCOLL.

Abstract: <p style="margin-left:0cm; margin-right:0cm">The Antarctic Biodiversity portal (biodiversity.aq) is a gateway to a wide variety of Antarctic biodiversity information and tools. Launched in 2005 as the Scientific Committee on Antarctic Research (SCAR) - Marine Biodiversity Information Network (SCAR-MarBIN, scarmarbin.be) and the Register of Antarctic Marine Species (RAMS, marinespecies.org/rams/), the system has grown in scope from purely marine to include terrestrial information.</p>
<p style="margin-left:0cm; margin-right:0cm">Biodiversity.aq is a SCAR product, currently supported by Belspo (Belgian Science Policy) as one of the Belgian contributions to the European Lifewatch-European Research Infrastructure Consortium (Lifewatch-ERIC). The goal of Lifewatch is to provide access to: distributed observatories/sensor networks; interoperable databases, existing (data-)networks, using accepted standards; high performance computing (HPC) and grid power, including the use of the state-of-the-art of cloud and big data paradigm technologies; software and tools for visualization, analysis and modeling.</p>
<p style="margin-left:0cm; margin-right:0cm">Here we provide an overview of the most recent advances in the biodiversity.aq online ecosystem, a number of use cases as well as an overview of future directions. Some of the most notable components are:</p>
<ul>
The Register of Antarctic Species (RAS, ras.biodiversity.aq) is a component of the Lifewatch Taxonomic Backbone and provides an authoritative and comprehensive list of names of marine and terrestrial species in Antarctica and the Southern Ocean. It serves as a reference guide for users to interpret taxonomic literature, as valid names and other names in use are both provided.
Integrated Publishing Toolkit (IPT, ipt.biodiversity.aq) allows disseminating Antarctic biodiversity data into global initiatives such as the Ocean Biogeographic Information System (OBIS, obis.org) as Antarctic node of OBIS (Ant-OBIS, also formerly known as SCAR-MarBIN) and the Global Biodiversity Information Facility (GBIF, gbif.org) as Antarctic Biodiversity Information Facility (AntaBIF). Data that can be made available include metadata, species checklists, species occurrence data and more recently, sampling event-based data. Data from these international portals can be accessed through data.biodiversity.aq.
</ul>
<p style="margin-left:0cm; margin-right:0cm">Through SCAR, Biodiversity.aq builds on an international network of expert that provide expert knowledge on taxonomy, species distribution,and ecology. It provides a strong and tested platform for sharing, integrating, discovering and analysing Antarctic biodiversity information originating from a variety of sources into a distributed system.

Abstract: Biodiversity Information Serving Our Nation (BISON - bison.usgs.gov) is the US Node application for the Global Biodiversity Information Facility (GBIF) and the most comprehensive source of species occurrence data for the United States of America. It currently contains more than 460 million records and provides significant augmentation and integration of US occurrence data in terrestrial, marine and freshwater systems. Publicly released in 2013, BISON has generated a large community of stakeholders and they have passed on a lot of questions over the years through email (bison@usgs.gov), presentations and other means. In this presentation, some of the most common questions will be addressed in detail. For example: why all BISON data isn't in GBIF; how is BISON different from GBIF; what is the relationship between BISON and other US providers to GBIF; and what is the exact role of the Integrated Taxonomic Information System (ITIS - www.itis.gov) in BISON.

Abstract: The Flanders Marine Institute (VLIZ) is responsible for the set-up of the LifeWatch Taxonomic Backbone (LW-TaxBB), as a central part of the European LifeWatch Infrastructure. The LW-TaxBB aims to (virtually) bring together different component databases and data systems, all of them related to taxonomy, biogeography, ecology, genetics and literature. By doing so, the LW-TaxBB standardises species data and integrates biodiversity data from different repositories and operating facilities and is the driving force behind the species information services of the Belgian LifeWatch.be e-Lab and the Marine Virtual Research Environment that are being developed.</p>
<p>The mission of LifeWatch is to advance biodiversity research and to provide major contributions to address the big environmental challenges, such as knowledge-based solutions for environmental managers in the field of conservation or dealing with long-standing ecological questions that could so far not be addressed due to a lack of data or a lack of good and easy access to data. This is being achieved by giving access to data and information through a single infrastructure which (virtually) brings together a large range and variety of datasets, services and tools. Scientists can use these tools and services to construct so-called Virtual Research Environments (VREs), where they are able to address specific questions related to biodiversity research, including e.g. topics related to conservation. They are not only offered an environment with unlimited computer and data storage capacity, but there is also transparency at all stages of the research process and the generic application of the e-infrastructure opens the door towards more inter- and multidisciplinary research.</p>
<p>The LW-TaxBB – virtually - brings together different component databases and data systems, dealing with five major components: (1) taxonomy, through regional, national, European, global and thematic databases, (2) biogeography, based on databases dealing with species occurrences, (3) ecology, in the form of species-specific traits, (4) genetics and (5) literature, by linking all available information to the relevant sources and through tools that can intelligently search this literature.</p>
<p>The LifeWatch Taxonomic Backbone is a two-way street: besides using the tools and functionalities it is offering – which are often developed based on identified needs within the scientific community -, scientists can also contribute themselves to make it more complete. Feedback on all available data and information (e.g. taxonomy and traits) is highly appreciated and communicated with the experts involved in the different component databases. All distribution information collected by individual scientists can become part of the biogeographic component of this backbone, by contributing occurrence data to the system.</p>
<p>Through the LW-TaxBB, users benefit in several ways, amongst others by:</p>
<ul>
Easy access to data and information to a variety of resources
The opportunity to quality control their own data, by cross-checking with data available through the LW-TaXBB
Free and easy access to a wide range of data services and web services
Possibility to combine available services into workflows, and link several systems together
</ul>
<p>Major components of the LW-TaxBB are – amongst others - the World Register of Marine Species (WoRMS) and the European node of the Ocean Biogeographic Information System (EurOBIS). WoRMS is an authoritative classification and catalogue of marine names currently containing 233,275 accepted marine species. EurOBIS publishes distribution data on marine species, collected within European marine waters or collected by European researchers outside European marine waters and currently contains 24.8 million distribution records. Both these systems have a strong link and collaboration agreements with international initiatives such as e.g. the Catalogue of Life (CoL), the Ocean Biogeographic Information System (OBIS) and the Global Biodiversity Information System (GBIF) and aim to collaborate with other ESFRIs such as DiSSCO and ELIXIR.

Abstract: The DINA Consortium (“DIgital information system for NAtural history data”, https://dina-project.net) was formed in order to provide a framework for like-minded large natural history collection-holding institutions to collaborate through a distributed Open Source development model to produce a flexible and sustainable collection management system. Target collections include zoological, botanical, mycological, geological and paleontological collections, living collections, biodiversity inventories, observation records, and molecular data. DINA is funded by the participating member institutions. DINA Core Members are organizations or individuals who commit at least one half-time equivalent of resources to the development of the consortium goals, at least half of which should be available for code development.</p>
<p>The DINA system is architected as a loosely-coupled set of several web-based modules. The conceptual basis for this modular ecosystem is a compilation of comprehensive guidelines for Web application programming interfaces (APIs) to guarantee the interoperability of its components. Thus, all DINA components can be modified or even replaced by other components without crashing the rest of the system as long as they are DINA compliant. Furthermore, the modularity enables the institutions to host only the components they need. DINA focuses on an Open Source software philosophy and on community-driven open development, so the contributors share their development resources and expertise outside of their own institutions.</p>
<p>One of the overarching reasons to develop a new collection management system is the need to better model complex relationships between collection objects (typically specimens), research data and associated workflows. We will present the enhancements provided by the approach of the DINA system focussing on the flexibility to plug in compliant components and accommodate additional (meta-)data and specimen related research data with the help of a generic data module.</p>
<p>Furthermore, we will discuss challenges in the governance of the development activities such as organizing the distributed code development of the core modules, the code review process and the choice of the software stack. These organizational challenges will be overcome with the help of a revised Memorandum of Understanding.

Abstract: Integrated Digitized Biocollections (iDigBio) is the United States’ (US) national resource and coordinating center for biodiversity specimen digitization and mobilization. It was established in 2011 through the US National Science Foundation’s (NSF) Advancing Digitization of Biodiversity Collections (ADBC) program, an initiative that grew from a working group of museum-based and other biocollections professionals working in concert with NSF to make collections' specimen data accessible for science, education, and public consumption. The working group, Network Integrated Biocollections Alliance (NIBA), released two reports (Beach et al. 2010, American Institute of Biological Sciences 2013) that provided the foundation for iDigBio and ADBC.</p>
<p>iDigBio is restricted in focus to the ingestion of data generated by public, non-federal museum and academic collections. Its focus is on specimen-based (as opposed to observational) occurrence records. iDigBio currently serves about 118 million transcribed specimen-based records and 29 million specimen-based media records from approximately 1600 datasets. These digital objects have been contributed by about 700 collections representing nearly 400 institutions and is the most comprehensive biodiversity data aggregator in the US.</p>
<p>Currently, iDigBio, DiSSCo (Distributed System of Scientific Collections), GBIF (Global Biodiversity Information Facility), and the Atlas of Living Australia (ALA) are collaborating on a global framework to harmonize technologies towards standardizing and synchronizing ingestion strategies, data models and standards, cyberinfrastructure, APIs (application programming interface), specimen record identifiers, etc. in service to a developing consolidated global data product that can provide a common source for the world’s digital biodiversity data. The collaboration strives to harness and combine the unique strengths of its partners in ways that ensure the individual needs of each partner’s constituencies are met, design pathways for accommodating existing and emerging aggregators, simultaneously strengthen and enhance access to the world’s biodiversity data, and underscore the scope and importance of worldwide biodiversity informatics activities. Collaborators will share technology strategies and outputs, align conceptual understandings, and establish and draw from an international knowledge base.</p>
<p>These collaborators, along with Biodiversity Information Standards (TDWG), will join iDigBio and the Smithsonian National Museum of Natural History as they host Biodiversity 2020 in Washington, DC. Biodiversity 2020 will combine an international celebration of the worldwide progress made in biodiversity data accessibility in the 21<sup>st</sup> century with a biodiversity data conference that extends the life of Biodiversity Next. It will provide a venue for the GBIF governing board meeting, TDWG annual meeting, and the annual iDigBio Summit as well as three days of plenary and concurrent sessions focused on the present and future of biodiversity data generation, mobilization, and use.

Abstract: Over the last decade, the Integrated Digitized Biocollections (iDigBio) organization and the Advancing the Digitization of Biodiversity Collections (ADBC) grant program, both funded by the US National Science Foundation (NSF), have made large strides in the aggregation of pre-existing siloed digital collections data as well as the new digitization of previously dark collections data across the United States. The impact of iDigBio leadership in community engagement (e.g., through discipline-specific workshops and webinars) and data mobilization (e.g., aggregation assistance, portal development) is widespread and with impact across all collection types and sizes. Moreover, the funding model for the ADBC program, which required the development of digitization-based Thematic Collection Networks (TCNs), facilitated engagement and community building across collections, which previously often worked independently from one another or with a smaller group of institutions and/or collaborators. The attempt to create ever-growing biodiversity data aggregators to improve global research access to digital biodiversity data has made huge progress over the past decade and has resulted in increased availability of biodiversity data from fewer, larger data stores. It has also motivated unselfish collaboration between major aggregators in search of strategies for merging these data silos into a consolidated global data product. We describe an ongoing collaboration between the Global Biodiversity Information Facility (GBIF), The Atlas of Living Australia (ALA), Integrated Digitized Biocollections (iDigBio), and the Distributed System of Scientific Collections (DiSSCo) to establish a global framework for integrating technologies, processes, standards, Application Programming Interfaces (APIs), ingestion, data, and data services, with the goal of building a well-documented linked system that relies on the various areas of expertise of the initial partners but with definitive pathways for incorporating new and existing entities as they desire or are developed.</p>
<p>We use the case of paleontological data as an exemplar of the potential impact of this collaboration. The iDigBio Paleontology Digitization Working Group, which was originally created by iDigBio as part of their community engagement program, has continued to be an active and engaged community of data providers and end-users, organizing numerous workshops and webinars. Currently, working group members, in collaboration with iDigBio staff and developers, are examining issues specific to paleontologic data aggregation that were identified by data providers; they are also working on a series of best-practices guidelines for sharing paleontologic data that will ideally help to reduce the number of mistakes made by downstream data aggregation manipulations. The focus of the working group is, and has been, largely community driven and supported by iDigBio through the provision of virtual meeting space for participants and by hosting the group's wiki-page of resources. Additionally, iDigBio has been proactive in working with other digitization initiatives in the paleontologic community (e.g., Paleobiology Database) on projects such as ePANDDA (enhancing Paleontological and Neontological Data Discovery API), which seeks to link existing digital resources through API development.

Abstract: Studying deep time biodiversity and environments is largely based on collections of fossils and sedimentary rocks, and the information acquired thereof. The sedimentary bedrocks of Estonia and neighbouring areas constitute a well-preserved archive of Earth history from the late Precambrian to the Devonian period. This interval of geological time hosts several key events in the diversification of life, notably the Cambrian explosion, the Great Ordovician Biodiversification Event and the Hirnantian mass extinction. Documenting and understanding these events has benefited from the geological and paleontological collections from the Baltic region, a large part of which are deposited in Estonia.</p>
<p>Since 2004 Estonia has had a 'national geological collection' that virtually joins the archives of three major collection-holding institutions: Tallinn University of Technology, University of Tartu and the Estonian Museum of Natural History (Hints et al. 2008). A key to the functioning of this national consortium is the common database system 'SARV', which started as a simple collection management tool, but has grown into a geoscience data platform linking various types of geoscientific information and supporting also the needs of researchers. Technically the system is based on a relational data model and central database server, a REST API and a number of web-based user interfaces from data management tools to public portals and more specialized applications. Individual components of the system are now built on open source software including MySQL, Apache Solr, Django REST framework, and Angular and Vue JavaScript frameworks. The data model and all recently developed software are available in a Github repository (https://github.com/geocollections).</p>
<p>Data on individual fossil specimens, digital images, localities, regional stratigraphic units, rock samples, datasets, published references, field notebooks etc. are publicly accessible in the Estonian geoscience collections portal (https://geocollections.info). A separate gateway provides access to the information on fossil taxa and their distribution in the Baltic region (https://fossiilid.info). Another example of using the same underlying data platform specifically for paleobiodiversity research is the Baltic chitinozoan database CHITDB (http://chitinozoa.net; Hints et al. 2018). Chitinozoans are an enigmatic group of Paleozoic microfossils, very useful in biostratigraphy. Some of the largest collections of these fossils worldwide derive from the Baltica paleocontinent and are deposited in Estonia. The chitinozoan portal was developed for managing and publishing the occurrence-level data on chitinozoans, and for quantitatively analysing their diversification history and biotic crises through the Ordovician and Silurian periods. The main benefit of using such an integrated data system is that a user may easily turn back to individual samples and specimen images (for instance, to verify identifications), and combine the paleontological data with information about past environments and climate that might derive from publications, first-hand geochemical data or even from descriptions in field notebooks. Global tools, such as the Paleobiology database, cannot provide such functionality for the time being.</p>
<p>The next steps in enhancing the national geoscience data platform in Estonia are related to the development of new data collection and publication modules, building a complete digital library of geoscience publications related to Estonia and widening the user base of the system. Participation in the national research infrastructure roadmap project NATARC as well as the Pan-European DiSSCo will support achieving this and safeguarding the sustainability of geoscience data and corresponding e-services in Estonia.

Abstract: Widespread interest in the study of metabarcoding has resulted in data proliferation and the development of a multitude of powerful computational tools. Yet consistent and reproducible interpretation of the data remains challenging. The integration of different data types, software tools, and analytical parameters pose a barrier to scaling research. Further, though the majority of the necessary tools for performing these analyses are already implemented, there is limited support for high throughput analysis due to the requirement for heavy computational capacity. As a result of these complexities, many researchers lack the time, training, or infrastructure to work with larger datasets. </p>
<p>mBRAVE, the Multiplex Barcode Research And Visualization Environment, is a cloud-based data storage and analytics platform with standardized pipelines and a sophisticated web interface for transforming raw high-throughput sequencing (HTS) data into biological insights. mBRAVE integrates common analytical methods and links to the Barcode of Life Data (BOLD) System for reference datasets, presenting users with the ability to analyze large volumes of data, without requiring special technical training. mBRAVE's cloud architecture provides centralized and automated storage and compute capacity, thereby reducing the burden on individual researchers.</p>
<p>The mBRAVE platform seeks to alleviate the main informatic challenges faced by the metabarcoding research community: the storage and consistent interpretation of HTS data. It is now available for researcher use at www.mbrave.net.

Abstract: Understanding natural communities and ecosystems and the services they provide to humanity is highly dependent on knowledge about species composition and diversity through space and time. This is especially difficult in aquatic systems where traditional census methods provide species compositions that are usually truncated since rare species tend to go undetected. Detection of the rare species is important because they are either threatened or invasives at the earliest stage of invasion. One recent approach allowing detection of rare species uses environmental DNA (eDNA), present in water or soil, as traces of their existence.</p>
<p>Here we propose to make use of recent technological developments in the area of high throughput sequencing to characterize freshwater fish communities and detect rare species, using a combination of eDNA metabarcoding and bulk eDNA metagenomics. A case-study will be conducted on the River Tagus (Portugal), which is inhabited by several rare fish species including both native and introduced taxa. In addition, the applicability of eDNA metagenomics for estimating the genetic diversity of populations will be assessed by comparing the results against those produced by traditional genetic screening of individual fish samples.

Abstract: In the context of the French law for the reconquest of biodiversity (Legifrance 2016), public and private stakeholders must share environmental impact assessment data as open data to the French National Inventory of the Natural Heritage (Muséum national d'Histoire naturelle 2019). In order to achieve this, the Information System for Nature and Landscape (SINP) provided standards and guidelines for protocols, taxonomy, and metadata in order to comply with the FAIR (Findability, Accessibility, Interoperability, Reusability; Wilkinson et al. 2016) concept of data management. However, private institutions, who must run environmental impact assessments, can be confused by the number of technical details and the high level of data literacy needed to comply with these standards. Here, we will present several tools (GeoNature 2019, Natural Solutions 2019) that we are currently developing to facilitate the raw biodiversity data conversion and export using SINP standards (Jomier et al. 2018). Although SINP and Darwin Core (Wieczorek et al. 2012) standards share common concepts and properties, SINP standards focus on data reusability in the framework of French environmental programs, resulting in the creation of specific mandatory attributes (Chataigner et al. 2014). Our tools perform extract, transform and load (ETL) operations as well as RDF (Resource Description Framework) exports using ad-hoc ontology adapted to the specificities of the SINP standard. Finally, we observed that despite the success of the process (after one year, nearly one thousand datasets are available on the SINP web platform), several issues still need to be addressed, including data quality issues, which could hamper data reuse by stakeholders.

Abstract: Microbial organisms - including Archaea, Bacteria and unicellular Eukaryota - collectively dominate the Earth in terms of bio- and functional diversity. Their study, often constrained by technology, has strongly benefited from the recent advancements in high-throughput DNA sequencing techniques. The vast amounts of microbial data generated in the wake of these developments, however, remains severely underrepresented on open access biodiversity data repositories (e.g. the Global Biodiversity Information Facility; GBIF). Moreover, when sequencing data has been made publicly available, is often poorly annotated with metadata and environmental variables, making it difficult to find or query. Therefore, the microbial Antarctic Resource System (mARS) aims to fill this lacuna by documenting and geo-referencing microbial datasets and linking the sequence data in the International Nucleotide Sequence Database Collaboration (INSDC) repositories with the associated environmental measurements on mARS, which is aimed to be interoperable with both INSDC and GBIF. This way, mARS helps to preserve environmental data and the metadata that is crucial for the correct processing and interpretation of sequence data, while it also connects researchers via its webportal to the existing wealth of molecular information, and allows these datasets to be more effectively accessed. Given the general complexity of microbial ecological datasets, mARS needs to operate between different data archiving standards, such as MIxS (see https://press3.mcs.anl.gov/gensc/mixs/), which is oriented towards DNA sequence data, and the biodiversity-based DarwinCore standard.</p>
<p>Currently, mARS tries to address the challenges of integrating microbial data with these existing systems as well as connecting with the communities behind them, by documenting the datasets on GBIF's extensions or investigate the feasibility of routinely processing raw sequence data into occurrence datasets using the open computing facilities offered by the European Molecular Biology Laboratory's (EMBL) MGnify resource.

Abstract: Major efforts are being made to digitize natural history collections to make these data available online for retrieval and analysis (Beaman and Cellinese 2012). Georeferencing, an important part of the digitization process, consists of obtaining geographic coordinates from a locality description. In many natural history collection specimens, the coordinates of the sampling location are not recorded, rather they contain a description of the site. Inaccurate georeferencing of sampling locations negatively impacts data quality and the accuracy of any geographic analysis on those data. In addition to latitude and longitude, it is important to define a degree of uncertainty of the coordinates, since in most cases it is impossible to pinpoint the exact location retrospectively. This is usually done by defining an uncertainty value represented as a radius around the center of the locality where the sampling took place.</p>
<p>Georeferencing is a time-consuming process requiring manual validation; as such, a significant part of all natural history collection data available online are not georeferenced. Of the 161 million records of preserved specimens currently available in the Global Biodiversity Information Facility (GBIF), only 86 million (53.4%) include coordinates. It is therefore important to develop and optimize automatic tools that allow a fast and accurate georeferencing.</p>
<p>The objective of this work was to test existing automatic georeferencing services and evaluate their potential to accelerate georeferencing of large collection datasets. For this end, several open-source georeferencing services are currently available, which provide an application programming interface (API) for batch georeferencing. We evaluated five programs: Google Maps, MapQuest, GeoNames, OpenStreetMap, and GEOLocate. A test dataset of 100 records (reference dataset), which had been previously individually georreferenced following Chapman and Wieczorek 2006, was randomly selected from the <em>Museu Nacional de História Natural e da Ciência</em>, <em>Universidade de Lisboa</em> insect collection catalogue (Lopes et al. 2016). An R (R Core Team 2018) script was used to georeference these records using the five services. In cases where multiple results were returned, only the first one was considered and compared with the manually obtained coordinates of the reference dataset. Two factors were considered in evaluating accuracy:</p>
Total number of results obtained and
Distance to the original location in the reference dataset.
<p>Of the five programs tested, Google Maps yielded the most results (99) and was the most accurate with 57 results &lt; 1000 m from the reference location and 79 within the uncertainty radius. GEOLocate provided results for 87 locations, of which 47 were within 1000 m of the correct location, and 57 were within the uncertainty radius. The other 3 services tested all had less than 35 results within 1000 m from the reference location, and less than 50 results within the uncertainty radius. Google Maps and Open Street Map had the lowest average distance from the reference location, both around 5500 m. Google Maps has a usage limit of around 40000 free georeferencing requests per month, beyond which the service is paid, while GEOLocate is free with no usage limit. For large collections, this may be a factor to take into account.</p>
<p>In the future, we hope to optimize these methods and test them with larger datasets.

Abstract: Biodiversity-related studies in the northern part of West Siberia are relatively recent in line with intensive industrial development of the region in recent decades. The region possesses few biological collections within the universities and nature reserves. Still, the Department of Natural Resources pays considerable attention to the sustainable use of natural resources. On the global scale, the success of biodiversity informatics goals largely depends on the local initiatives and progress in data mobilization and sharing. Therefore, organization of regional biodiversity portals is important to promote data mobilization, education and citizen science on local scale.</p>
<p>Previous experience of biodiversity information systems in the region was low. The program on digitization of observations of Red Listed species was launched in 2010 under the support of the Department of Natural Resources of Yugra. The information system for Red Listed species registrations was developed through this project and currently includes about three thousand observations. Another example of digitization in Western Siberia was developed by the biological collection of Yugra State University. Its database is based on the database management system Specify and available online through its web portal (http://bioportal.ugrasu.ru). Some collections of nature reserves have their catalogues in digital form. The need of biodiversity data mobilization is well understood and is discussed at regular workshops on biological collections management held in Khanty-Mansiysk.</p>
<p>Recently, the biologists curating several biological collections in the region started a project on a regional biodiversity portal development (https://nwsbios.org). The portal has three major components:</p>
the database of collections based on Specify software (http://bioportal.ugrasu.ru),
the metadata of different sources of biodiversity information in the region,
an educational platform for learning biodiversity informatics, using data published via GBIF and DwC standards.
<p>This initiative of biodiversity data mobilization in the region includes the organization of workshops, discussions and newsletters helping to reach potential data holders and coordinate work. Through this work four different organizations from Khanty-Mansi region have registered accounts in GBIF since 2019 and started uploading data to the GBIF portal. At present there are about 25,000 observations mobilized in GBIF from the Khanty-Mansi and Yamalo-Nenets regions.</p>
<p>The integrated massive publishing of data in the portal will provide new opportunities for biodiversity research and sustainable management of nature resources in the northern part of West Siberia.

Abstract: ‘Data Quality Test and Assertions’ Task Group 2 (https://www.tdwg.org/community/bdq/tg-2/) has taken another year to clarify the 102 tests (https://github.com/tdwg/bdq/issues?q=is%3Aissue+is%3Aopen+label%3ATest). The original mandate to develop a core suite of tests that could be widely applied from data collection to user evaluation of aggregated data seemed straight-forward. Two years down the track, we have proven that to be incorrect. Among the final tests are complexities that none of the core group anticipated. For example, the need for a definition of ‘empty’ or the ‘Expected response’ from the test under various scenarios.</p>
<p>The record-based tests apply to Darwin Core terms (https://dwc.tdwg.org/terms/) and have been classified as of type validation (66), amendment (29), notification (3) or measure (5). Validations test one or more Darwin Core terms against known characteristics, for example, VALIDATION_MONTH_NOTSTANDARD. Amendments may be applied to Darwin Core terms where we can unambiguously offer an improvement to the record, for example, AMENDMENT_MONTH_STANDARDIZED. Notifications are made where we believe a flag will help alert users to an issue that needs evaluation, for example, NOTIFICATION_DATAGENERALIZATIONS_NOTEMPTY. Measures are summaries of test outcomes at the record level, for example, MEASURE_AMENDMENTS_PROPOSED.</p>
<p>We note that 41 require some parameters to be established at the time of test implementation, 20 tests require access to a currently accepted vocabulary and 3 tests rely on ISO/DCMI standards. The dependency on vocabularies to circumscribe permissible values for Darwin Core terms led to the establishment by Paula Zermoglio of DQ Task Group 4 (https://github.com/tdwg/bdq/tree/master/Vocabularies). A vocabulary of 154 terms that are associated with the tests and assertions have been developed. </p>
<p>As at the time of writing this abstract, test data and demonstration code implementation of each test are yet to be completed. We hope these will be finalized by the time of this presentation.

Abstract: Initially populated with information gathered from The Plant List, the WFO’s taxonomic backbone will be augmented and modernized by newer taxonomic sources delivered by global plant Taxonomic Expert Networks (TENs). At the same time, descriptive data from floras, such as text descriptions, images, geographic distributions, keys, and trait data will be linked to this evolving taxonomic backbone. Semi-automated workflows have to be built to efficiently pull together the work of the TENs and digital floras and monographs into a single source: the WFO portal.</p>
<p>These semi-automated workflows will have to ensure multiple functionalities: format check of the data submitted by the different backbone data providers (TENs or descriptive data providers), integrity and quality checks of the data provided, comparison of these data with the current WFO’s taxonomic backbone (matching with the WFO ID’s and with the names, check at the level of the family and the taxonomic status attribution, etc.). Applications will also be developed as Decision Making Helper for the TENs: thanks to these applications, TENs will be able to make decisions concerning the resolution of names not present in the current backbone or on the questions of conflicts at the level of the belonging to a family or the taxonomic status for a proposed name.</p>
<p>At the end of these processes, new versions of the taxonomic backbone will regularly be delivered to be integrated into the WFO portal.

Abstract: Essential Biodiversity Variables (EBVs) are integrated information products typically derived from disparate sources of primary observations combined by the use of biodiversity models and data integration algorithms. Furthermore, developing policy-relevant indicators from EBVs requires an additional level of integration between datasets that inform on different facets of biodiversity, e.g. at levels from species to ecosystems. The development and dissemination of EBVs requires that the origin of the primary observations, models and algorithms, measurement uncertainties, and the scope of application are consistently reported and traceable. To support this process, the GEO BON Taskforce for Essential Biodiversity Variables - Data is developing the Minimum Information Standards (MIS) for EBVs. MIS are sets of specifications for describing datasets that aim to standardize data reporting and to maximize its discoverability and interoperability. Here we present a community effort to generate minimum standards that can be used across all the EBV classes, including genetic composition, species populations and traits, and the composition, structure and function of ecosystems. MIS for Essential Biodiversity Variables is founded in the description of the EBV-data cube as the unifying framework to deliver interoperable biodiversity observations. They summarize aspects of the spatial and temporal domains of the datasets, as well as uncertainty and bias reporting. Furthermore, they identify traceability along the EBV production workflow from the identification of primary observations to the derivation of a spatially and temporally consistent EBV product. MIS also incorporate the GEOSS proposed principles for data management. Finally, a metadata publishing toolkit ensures that EBVs are discoverable and used under the auspices of GEO BON.

Abstract: Based on own work on species and trait recognition and complementary studies from other working groups, we present a workflow for data extraction from digitized herbarium specimens using convolutional neural networks. Digitized herbarium sheets contain:</p>
preserved plant material as well as additional objects:
the label containing information on the collection event,
annotations such as revision labels, or notes on material extraction,
identifiers such as barcodes or numbers,
envelopes for loose plant material and
often scale bars and color charts used in the digitization process.
<p>In order to treat these objects appropriately, segmentation techniques (Triki et al. 2018) will be applied to localize and identify the different kinds of objects for specific treatments. Detecting presence of plant organs such as leaves, flowers or fruits is already a first step in data extraction potentially useful for phenological studies. Plant organs will be subject to routines for quantitative (Gaikwad et al. 2018) and qualitative (Younis et al. 2018) trait recognition routines. Text-based objects can be treated as described by Kirchhoff et al. 2018, using OCR techniques and considering the many collection-specific terms and abbreviations as described in Schröder 2019. Additionally, species recognition (Younis et al. 2018) will be applied in order to help further identification of incompletely identified collection items or to detect possible misidentifications. All steps described above need sufficient training data including labelling that may be obtained from collection metadata and trait databases.</p>
<p>In order to deal with new incoming digitized collections, unseen data or categories, we propose implementation of a new Deep Learning approach, so-called Lifelong Learning: Past knowledge of the network is dynamically saved in latent space using autoencoder and generatively replayed while the network is trained on new tasks which enables it to solve complex image processing tasks without forgetting former knowledge while incrementally learning new classes and knowledge.

Abstract: Wetland's Ecological Integrity Assessment refers to "an assessment approaches that measures overall wetland condition with an emphasis on the structure, composition, and function of an ecosystem in reference to a natural habitat of the region" It provides government agencies, and key stakeholders with critical information on factors that may be degrading, maintaining or helping to restore an ecosystem, therefore supporting decision making. With continued degradation of protected areas over the last four decades, wetland ecosystems served as soft edges for biodiversity and genetic resources in Rwanda. However, there is no holistic framework to assess the ecological integrity of these important ecosystems, and the little and incomplete information collected is scattered, hence difficult to access. Consequently, wetlands are under continued overexploitation, reclamation pollution, biodiversity and ecological functions loss (Singh et al. 2015). Throughout a six months period, with a participatory planning process that involved individual expert consultations and workshops, supported with a robust literature review and the established multidisciplinary technical team, we came up with a holistic and Integrated Framework for Wetlands' Ecological Integrity Assessemnt. The framework is cost-effective and was designed to collect and analyse data on the structure (soil, hydrology), composition (biota), and functions (Ecosystem services mapping and valuation) as well as the social economy status within and around wetland landscapes in the region. In addition, the ARCOS Biodiversity Information Management System (ARBIMS), an online platform for data sharing, was developed and upgraded. The developed assessment framework will help to mobilize and integrate biodiversity data into relevant documents to inform wetland management plans and support other investment decisions in Rwanda.

Abstract: Life history accounts and taxonomic monographs are a series of publications covering a higher taxonomic group where each account is a compilation of existing knowledge detailing many aspects of a species life history. These life history accounts are extensively used by researchers, ornithologists and conservationists as a main source for the current state of knowledge of a species. Birds, being one of the more easily seen and studied taxa, have a number of specialized life history accounts where data from a wide variety of disciplines are combined into a single easily accessible resource.</p>
<p>The Cornell Lab of Ornithology (CLO) currently manages two of these series focused on different regions of the world, Birds of North America (BNA) and Neotropical Birds (NB). Lynx Edicions has published the Handbook of Birds of the World (HBW), an extensive set of avian monographs covering every species of bird in the world. A recently announced collaboration between CLO and Lynx Edicions provides us with the opportunity to bring together the extreme detail of the life history accounts from Birds of North America with the global coverage of HBW to produce a global, in-depth treatment of every species of bird in the world.</p>
<p>The integration of life history information from these existing projects with different underlying taxonomies presents a variety of real-world examples of the challenges to be overcome to bring these life history accounts into alignment and provide the scientific and lay communities with taxonomically accurate and up to date information.</p>
<p>The Handbook of Birds of the World currently follows the HBW and BirdLife Taxonomic Checklist v3 (with 11,126 species recognized) while Birds of North America and Neotropical Birds both follow the eBird/Clements checklist of birds of the world: v2018 (with 10,585 species recognized). Of the roughly 11,000 species of birds, nearly 9,500 are direct matches between HBW/BirdLife and Clements at the species or species to subspecies levels. The remaining concept mismatches fall into several basic categories including lump and split differences as well as differences in which subspecies are included or excluded.</p>
<p>In this talk we will discuss the challenges we have faced with managing and merging life history accounts where the underlying taxonomies are fundamentally different. With a requirement to ensure that life history accounts remain accurate when the underlying concepts of the original sources differ, we employ a variety of processes, some very labor intensive and some requiring in-depth taxonomic knowledge to produce consolidated species accounts. Existing resources are integral to these type of integrations and in addition to the taxonomies themselves, cross-taxonomy mapping databases such as Avibase are key. Working through this process of consolidating life history accounts highlights the basic need for taxonomic management and publication toolsets built on underlying taxonomic and life history standards. Cross institutional collaboration to produce these toolsets will be key to their development and successful adoption across the biodiversity and taxonomic communities. I will also discuss and propose a set of taxonomic management tools based on taxonomic concepts, some which already exist and are used by bird taxonomists to annually update the Clements Checklist and some which need to be implemented before we can accurately manage and consolidate biodiversity information and the evolving taxonomies on which those data are based.

Abstract: A coherent framework for building Essential Biodiversity Variables (EBVs) is now emerging, but there are few examples of EBVs being produced at large extents. I describe the creation of a species distribution EBV for the United Kingdom, covering 5293 species from 1970-2015. The data product contains an annual occupancy estimate for every species in each year, each with a measure of uncertainty. I will describe the workflow to produce this data product. The data collation step bring togehter different sources of occurrence records; the data standardisation step harmonizes these records to a common spatio-temporal resolution. These data are then converted into a set of 'detection histories' for each species within each taxonomic group, before being passed to the occupancy-detection model. Outputs from this model are then summarised as 1000 samples from the postierior distribution of occupancy estimates for each species:year combination. I will also describe the infrastructure requirements to create the EBV and to update it annually. This endeavour has been made possible because the vast majority of the 34 million species records have been collated and curated by 31 taxon-oriented citizen science groups. I go on to describe the challenges of harmonizing and integrating these occurrence records with other data types, such as from systematic surveys, including count data. Such "integrated models" are statisitcally challenging, but now within reach, thanks to the development of new tools that make it possible to conceive of modelling everything, everywhere. However, a substantial and concerted effort is required to curate biodiversity data in a way that maximises their potential for the next generation of models, and for truly global EBVs.

Abstract: With digitisation of natural history collections over the past decades, their traditional roles — for taxonomic studies and public education — have been greatly expanded into the fields of biodiversity assessments, climate change impact studies, trait analyses, sequencing, 3D object analyses etc. (Nelson and Ellis 2019; Watanabe 2019). Initial estimates of the global natural history collection range between 1.2 and 2.1 billion specimens (Ariño 2010), of which 169 million (8-14% - as of April 2019) are available at some level of digitisation through the <u>Global Biodiversity Information Facility</u> (GBIF). With iDigBio (<u>Integrated Digitized Biocollections</u>) established in the United States and with the European DiSSCo (<u>Distributed Systems of Scientific Collections</u>) accepted on the <u>ESFRI</u> roadmap, it has become a priority to digitize natural history collections at an industrialized scale. Both iDigBio and DiSSCo aim at mobilising, unifying and delivering bio- and geo-diversity information at the scale, form and precision required by scientific communities, and thereby transform a fragmented landscape into a coherent and responsive research infrastructure. In order to prioritise digitisation based on scientific demand, and efficiency using industrial digitisation pipelines, it is required to arrive at a uniform and unambiguously accepted collection description standard that would allow comparing, grouping and analysing natural history collections at diverse levels.</p>
<p>Several initiatives attempt to unambiguously describe natural history collections using taxonomic and storage classification schemes. These initiatives include One World Collection, Global Registry of Scientific Collections (GRSciColl), <u>TDWG</u> (Taxonomic Databases Working Group) Natural Collection Descriptions (NCD) and <u>CETAF</u> (Consortium of European Taxonomy Facilities) passports, among others. In a collaborative effort of DiSSCo, ICEDIG (<u>Innovation and consolidation for large scale digitisation of natural heritage</u>), iDigBio, TDWG and the Task Group Collection Digitisation Dashboards, the various schemes were compared in a cross-walk analysis to propose a preliminary natural collection description standard that is supported by the wider community. In the process, two main user groups of collection descriptions standards were identified; scientists and collection managers. The classification produced intends to meet requirements from them both, resulting in three classification schemes that exist in parallel to each other (van Egmond et al. 2019). For scientific purposes a ‘Taxonomic’ and ‘Stratigraphic’ classification were defined, and for management purposes a ‘Storage’ classification. The latter is derived from specimen preservation types (e.g. dried, liquid preserved) defining storage requirements and the physical location of specimens in collection holding facilities. The three parallel collection classifications can be cross-sectioned with a ‘Geographic’ classification to assign sub-collections to major terrestrial and marine regions, which allow scientists to identify particular taxonomic or stratigraphic (sub-)collections from major geographical or marine regions of interest.</p>
<p>Finally, to measure the level of digitisation of institutional collections and progress of digitisation through time, the number of digitised specimens for each geographically cross-sectioned (sub-)collection can be derived from institutional collection management systems (CMS). As digitisation has different levels of completeness a ‘Digitisation’ scheme has been adopted to quantify the level of digitisation of a collection from Saarenmaa et al. 2019, ranging from ‘not digitised’ to extensively digitised, recorded in a progressive scale of MIDS (Minimal Information for Digital Specimen).</p>
<p>The applicability of this preliminary classification will be discussed and visualized in a Collection Digitisation Dashboards (CDD) to demonstrate how the implementation of a collection description standard allows the identification of existing gaps in taxonomic and geographic coverage and levels of digitisation of natural history collections. This set of common classification schemes and dashboard design (van Egmond et al. 2019) will be contributed to the <u>TDWG Collection Description</u> interest group to ultimately arrive at the common goal of a 'World Collection Catalogue'.

Abstract: <u>DiSSCo</u> – the Distributed System of Scientific Collections – will mobilise, unify and deliver bio- and geo-diversity information at the scale, form and precision required by scientific communities, and thereby transform a fragmented landscape into a coherent and responsive research infrastructure. At present DiSSCo has 115 partners from 21 countries across Europe. The DiSSCo research infrastructure will enable critical new insights from integrated digital data to address some of the world's greatest challenges, such as biodiversity loss, food security and impacts of climate change. A requirement analysis for DiSSCo was conducted to ensure that all of its envisioned future uses are accommodated through a large survey using epic user stories. An epic user story has the following format:</p>
<p>As [e.g. scientist] I want to [e.g. map the distribution of a species through time] so that I [e.g. analyse the impact of climate change] for this I need [e.g. all georeferenced specimens records through time]</p>
<p>Several consultation rounds within the <u>ICEDIG</u> community resulted in 78 unique user stories that were assigned to one, or more, out of seven recognized stakeholder categories:</p>
Research,
Collection management,
Technical support,
Policy,
Education,
Industry, and
External.
<p>Each user story was assessed for the level of collection detail it required; four levels of detail were recognised: Collection, Taxonomic, Storage unit, and Specimen level. Furthermore, it was assessed whether the future envisioned use of digitised natural history collections were possible without the DiSSCo research infrastructure.</p>
<p>Subsequently 1243 identified stakeholders were invited to review the DiSSCo user stories through a Survey Monkey questionnaire. Additionally, an invitation for review was posted in several Facebook groups and announced on Twitter. A total of 379 stakeholders responded to the invitation, which led to 85 additional user stories for the envisioned use of the DiSSCo research infrastructure. In order to assess which component of the DiSSCo data flow diagram should facilitate the described user story, all user stories were mapped to the five phases of the DiSSCo Data Management Cycle (DMC), including data:</p>
acquisition,
curation,
publishing,
processing, and
use.
<p>At present, the user stories are being analysed and the results will be presented in this symposium.

Abstract: For computer vision based appraoches such as image classification (Krizhevsky et al. 2012), object detection (Ren et al. 2015) or pixel-wise weed classification (Milioto et al. 2017) machine learning is used for both feature extraction and processing (e.g. classification or regression). Historically, feature extraction (e.g. PCA; Ch. 12.1. in Bishop 2006) and processing were sequential and independent tasks (Wöber et al. 2013). Since the rise of convolutional neuronal networks (LeCun et al. 1989), a deep machine learning approach optimized for images, in 2012 (Krizhevsky et al. 2012), feature extraction for image analysis became an automated procedure. A convolutional neuronal net uses a deep architecture of artificial neurons (Goodfellow 2016) for both feature extraction and processing. Based on prior information such as image classes and supervised learning procedures, parameters of the neuronal nets are adjusted. This is known as the learning process.</p>
<p>Simultaneously, geometric morphometrics (Tibihika et al. 2018, Cadrin and Friedland 1999) are used in biodiversity research for association analysis. Those approaches use deterministic two-dimensional locations on digital images (landmarks; Mitteroecker et al. 2013), where each position corresponds to biologically relevant regions of interest. Since this methodology is based on scientific results and compresses image content into deterministic landmarks, no uncertainty regarding those landmark positions is taken into account, which leads to information loss (Pearl 1988). Both, the reduction of this loss and novel knowledge detection, can be done using machine learning.</p>
<p>Supervised learning methods (e.g., neuronal nets or support vector machines (Ch. 5 and 6. in Bishop 2006)) map data on prior information (e.g. labels). This increases the performance of classification or regression but affects the latent representation of the data itself. Unsupervised learning (e.g. latent variable models) uses assumptions concerning data structures to extract latent representations without prior information. Those representations does not have to be useful for data processing such as classification and due to that, the use of supervised and unsupervised machine learning and combinations of both, needs to be chosen carefully, according to the application and data.</p>
<p>In this work, we discuss unsupervised learning algorithms in terms of explainability, performance and theoretical restrictions in context of known deep learning restrictions (Marcus 2018, Szegedy et al. 2014, Su et al. 2017). We analyse extracted features based on multiple image datasets and discuss shortcomings and performance for processing (e.g. reconstruction error or complexity measurement (Pincus 1997)) using the principal component analysis (Wöber et al. 2013), independent component analysis (Stone 2004), deep neuronal nets (auto encoders; Ch. 14 in Goodfellow 2016) and Gaussian process latent variable models (Titsias and Lawrence 2010, Lawrence 2005).

Abstract: Nowadays, more and more biodiversity and biogeography studies are conducted with the help of gene sequences. Fresh samples obtained using consistent collection methods can provide DNA for analysis and yield the current status of target species. As there are often no historical samples providing a timescale, it is often difficult to draw conclusions and to provide an evolutionary explanation of the observed biogeographical patterns because of a lack of evidence. The huge natural history specimen collections in museums could possibly provide this information.</p>
<p>The gypsy moth, <em>Lymantria dispar </em>(Linnaeus) is a worldwide forest pest species. Our analyses of mitochondrial COI gene sequencing data in specimens from disparate locations revealed previously unknown genetic relationships in gypsy moth populations across space (in and around China) and time (1955–2012). We recovered 103 full-length COI gene sequences from eight fresh samples and from 95 <em>Lymantria dispar</em> collection specimens that had been captured between 1955 and 1996. Combining 103 full-length COI gene sequences with 146 COI gene sequences from Genbank (https://www.ncbi.nlm.nih.gov) or DNA barcode libraries, we analyzed the genetic differentiation, gene flow and haplotypes (special sequences from individuals) within the gypsy moth populations in order to reflect the genetic structure and population dynamics of this pest.</p>
<p>Twenty-five previously unknown haplotypes were discovered. Regional populations from the same location, but collected at different times, showed high genetic diversity. In some geographical populations (Heilongjiang, Liaoning and Beijing populations), the genetic differentiation was greatest in 1979, but much lower in 1992 and 2012.</p>
<p>This study is an example that shows how specimen collections can be useful to complete gaps in biodiversity studies carried out through genetic sequencing.

Abstract: The Royal Belgian Institute of Natural Sciences (RBINS), the Royal Museum for Central Africa (RMCA) and Meise Botanic Garden house more than 50 million specimens covering all fields of natural history.</p>
<p>While many different research topics have their own specificities, throughout the years it became apparent that with regards to collection data management, data publication and exchange via community standards, collection holding institutions face similar challenges (James et al. 2018, Rocha et al. 2014). In the past, these have been tackled in different ways by Belgian natural history institutions. In addition to local and national collaborations, there is a great need for a joint structure to share data between scientific institutions in Europe and beyond. It is the aim of large networks and infrastructures such as the Global Biodiversity Information Facility (GBIF), the Biodiversity Information Standards (TDWG), the Distributed System of Scientific collections (DiSSCo) and the Consortium of European Taxonomic Facilities (CETAF) to further implement and improve these efforts, thereby gaining ever increasing efficiencies.</p>
<p>In this context, the three institutions mentioned above, submitted the NaturalHeritage project (http://www.belspo.be/belspo/brain-be/themes_3_HebrHistoScien_en.stm) granted in 2017 by the Belgian Science Policy Service, which runs from 2017 to 2020.</p>
<p>The project provides links among databases and services. The unique qualities of each database are maintained, while the information can be concentrated and exposed in a structured way via one access point. This approach aims also to link data that are unconnected at present (e.g. relationship between soil/substrate, vegetation and associated fauna) and to improve the cross-validation of data.</p>
<p>(1) The NaturalHeritage prototype (http://www.naturalheritage.be) is a shared research portal with an open access infrastructure, which is still in the development phase. Its backbone is an ElasticSearch catalogue, with Kibana, and a Python aggregator gathering several types of (re)sources: relational databases, REpresentational State Transfer (REST) services of objects databases and bibliographical data, collections metadata and the GBIF Internet Publishing Toolkit (IPT) for observational and taxonomical data. Semi-structured data in English are semantically analysed and linked to a rich autocomplete mechanism. Keywords and identifiers are indexed and grouped in four categories (“what”, “who”, “where”, “when”). The portal can act also as an Open Archives Initiatives Protocol for Metadata Harvesting (OAI-PMH) service and ease indexing of the original webpage on the internet with microdata enrichment.</p>
<p>(2) The collection data management system of DaRWIN (Data Research Warehouse Information Network) of RBINS and RMCA has been improved as well.</p>
<ul>
External (meta)data requirements, i.e. foremost publication into or according to the practices and standards of GBIF and OBIS (Ocean Biogeographic Information System: https://obis.org) for biodiversity data, and INSPIRE (https://inspire.ec.europa.eu) for geological data, have been identified and evaluated. New and extended data structures have been created to be compliant with these standards, as well as the necessary procedures developed to expose the data.
Quality control tools for taxonomic and geographic names have been developed. Geographic names can be hard to confirm as their lack of context often requires human validation. To address this a similarity measure is used to help map the result. Species, locations, sampling devices and other properties have been mapped to the World Register of Marine Species and DarwinCore (http://www.marinespecies.org), Marine Regions and GeoNames, the AGRO Agronomy and Vertebrate trait ontologies and the British Oceanographic Data Centre (BODC) vocabularies (http://www.obofoundry.org/ontology/agro.html). Extensive mapping is necessary to make use of the ExtendedMeasurementOrFact Extension of DarwinCore (https://tools.gbif.org/dwca-validator/extensions.do).
</ul>

Abstract: Essential Biodiversity Variables (EBVs) are the latest push toward supporting state of the environment indicators (Pereira et al. 2013). The European Union Funded Creative-B Project (see https://cordis.europa.eu/project/rcn/100345/brief/en) outlined the status and strategy for interoperability between what they termed Biodiversity Research Infrastructures (BRIs: such as the Global Biodiversity Information Infrastructure (GBIF), the Atlas of Living Australia (ALA) and the Integrated Digitized Biocollections (iDigBio)). Toward the end of that project, the group decided that a logical follow-on project should position BRIs to support the production of Essential Biodiversity Variables (EBVs). This idea became the GLOBal Infrastructures for Supporting Biodiversity research (GLOBIS-B) project (http://www.globis-b.eu) and this presentation provides a summary of a case study on generating EBVs (Hardisty et al. 2019).</p>
<p>As a part of GLOBIS-B, I suggested that a small team of GLOBIS members should document in detail, each step in the production of an EBV from GBIF and the ALA data for a few invasive species. We wanted address the rarity of detailed recording and justification for each step in the production of a dataset for environmental evaluation. I anticipated that the team would encounter many practical issues, but this case study raised far more significant issues that any of us had anticipated.</p>
<p>The EBV chosen for this study was Area of Occupancy (IUCN Standards and Petitions Subcommittee 2017) and the species selected represented various invasion scenarios: <em>Acacia longifolia; Vespula germanica </em>and <em>Bubulcus ibis</em>. The workflow included 20 steps between locating data and publishing an EBV, and these steps were radically different between GBIF and the ALA. The workflow required manual steps such as resolving invasive status of <em>Acacia longifolia</em> subspecies; only one of which was ‘invasive’. Datasets of occurrence records had to be exported from the ALA and GBIF to enable filtering for purpose, for example, not all Darwin Core terms are exposed in the current public interface of the ALA. After the record filtering, the ALA and GBIF datasets then required merging and deduplication, for which one-off code had to be written.</p>
<p>A few of the 15 significant messages from this study included: a lack of consistency of data between BRIs (e.g., GBIF records should be a superset of ALA records); consistency and adequacy of filtering tools between BRIs; exported data structures massively differed between BRIs; that automation of the workflows may be possible but many manual intervention steps were required. By my figuring, the case study took approximately 10 times longer than anticipated, but the messages to BRIs was clear – consistency and adequacy of data and tools require urgent work.

Abstract: Data contained in the the Biodiversity Heritage Library (BHL) describes collections held in the world's major museums. Finding those collections data, however, remains a challenge. A literal needle in a <em>Festuca</em> stack as some have noted. BHL is actively engaging in incorporating tools (including Digital Object Identifier's (DOI's)and the recently launched full-text search) to make finding and linking to collection specimen information better. Still, it is not easy to find specific collections information in the non-semantically tagged BHL content. This session will call for ideas on how to locate this content.. BHL is an international consortium, making research literature openly available to the world as part of a global biodiversity community. The BHL was created in 2006 as a direct response to the needs of the taxonomic community for access to early literature. The original BHL organizational model, based on United States and United Kingdom partners, provided a template for what is now over 80 global partners. Through this extensive network of Members, Affiliates, and partners, over 56 million pages of biodiversity literature are available through the BHL portal. BHL changes the lives of researchers and assists the work of collections managers. By enhancing daily research at the Smithsonian and Harvard, BHL provides a global network of researchers with an easy-to-use digital library of content and services.

Abstract: Citizen science biodiversity monitoring projects are becoming very common. It is generally accepted that these joint projects, of scientists and the public, have a positive effect on biodiversity and conservation education programs as well as on policy-makers opinion Ganzevoort et al. (2017). Yet, there is still a debate on the quality of the data collected in citizen science monitoring schemes, and especially on the benefits to high-quality research. Here, I present an example of how collection-based research and involvement of the public (non-taxonomists) in taxonomical education, i.e., advanced citizen science, can enhance research on scorpion diversity in Israel. Furthermore, the process of public involvement in monitoring and especially the prerequisites needed for this process, contributed to high-quality research, that in turn is enhancing biodiversity science. Considering this, I will discuss the basic stages required for successful public engagement in high-quality biodiversity research and monitoring schemes.

Abstract: The increased availability of digital floras and the application of optical character recognition (OCR) to digitized texts has resulted in exciting opportunities for flora data mining. For example, the software package CharaParser has been developed for the semantic annotation of morphological descriptions from taxonomic treatments (Cui 2012). However, after digitization and OCR processing and before parsing of morphological treatments can begin, content types must be annotated (i.e., text represents names, morphology, discussion or distribution). In addition to enabling morphological parsing, content type annotation also facilitates content search and data linkage. For example, by annotating pieces of a floral treatment, assertions from various floras of the same type can be combined into a single document (i.e., a "mash-up" floral treatment). Several products and pipelines have been developed for the semantic annotation, or mark-up, of taxonomic documents (e.g., GoldenGATE, FlorML; Sautter et al. 2012, Hamann et al. 2014). However, these products lack a combination of both ease of implementation (e.g., the ability to run as a script in a programmatic workflow) and the use of modern parsing methods, such as text mining and Natural Language Processing (NLP) approaches.</p>
<p>Here I present a pilot project implementing text mining and NLP approaches to marking-up floras implemented in Python. I will describe the success of the project, and summarize lessons learned, especially in relation to previous flora markup projects. Annotation of existing flora documents is an essential step towards building next-generation floras (i.e., mash-ups and enhanced floras as platforms) and enables automated trait extraction. Building an easy-to-use access point to modern text mining and NLP techniques for botanical literature will allow for more flexible and responsive flora annotation, and is an important step towards realizing botanical data integration goals.

Abstract: The need for scientists to exchange, share, and organise data has resulted in a proliferation of research data portals in the past decades. These cyberinfrastructures have had a major impact on taxonomy and helped to revitalise the discipline, by allowing quick access to bibliographic information, biological and nomenclatural data, and specimen information. In addition, several specialised portals aggregate particular data types for a large number of species and can be queried to extract information for a particular taxonomic group. Because of their ecological and economic importance, several early initiatives to develop and deploy information technologies for capturing, sharing, and disseminating information focused specifically on the plant family Leguminosae (Fabaceae). Initiatives such as ILDIS (International Legume Database and Information Service), which was created in 1985, led the way in developing methods and thinking with regard to taxonomic data management more generally. More recently, the Legume Phylogeny Working Group (LPWG) was founded in 2010 with the objective of facilitating collaboration amongst systematists working on the plant family Leguminosae (Fabaceae). As part of this endeavour, the LPWG has explored whether it would be desirable and pertinent to develop a new portal focused on the legume family. We argue that, despite access to numerous data aggregation portals, a taxon-focussed portal curated by a community of researchers specialised on a particular taxonomic group, such as the LPWG, have the interest, commitment, existing collaborative links, and knowledge necessary to verify data quality, thereby providing a valuable resource and actively contributing to other more general data providers. We consider that a new portal focused on Leguminosae would thus serve a useful function in parallel to and different from large international data-aggregation portals. We explored best practices for developing a legume-focused portal that will enable long-term sustainability, data sharing, a better understanding of what data are available, missing, or erroneous, and ultimately facilitate cross analyses and development of novel research. We surveyed existing data portals to see what features are of interest to our goal and we present a general way forward for developing a legume-focused portal that would respond to the needs of the legume systematics research community as well as to the broader user community. We propose to take full advantage of existing data sources, informatics tools, and protocols to develop an easily manageable, scalable, and interactive portal that will be used, contributed to, and fully endorsed and supported by the legume systematics community.

Abstract: The Nagoya Protocol on access to genetic resources and fair and equitable sharing of benefits arising from their utilization in the convention on biological diversity entered into force on October 12, 2014. Accordingly, attention toward securing the sovereignty and discovering the utilization value of biological resources has been increasing to secure national competitiveness. We are developing a freshwater biodiversity information platform for the systematic conservation and industrialization of freshwater biodiversity in South Korea. The platform comprises an integrated management system of freshwater bioresources for systematic registration and management of freshwater biodiversity information based on databases; a management system of storage for managing freshwater biological specimen; a utilization information system that manages efficacy, experimental method, and activity produced by the Nakdonggang National Institute of Biological Resources and external big data such as literature and patent; and a freshwater bioresources culture collection for preservation, ordering and deposition of biological resources. These systems are connected organically. Text mining, one of the big data technologies, helps to determine the utility of biological resources through comprehensive analysis. We tried to establish utilization foundations by predicting the usability of biological resources through systematic collection, processing, and analyzing external data, such as abstract, in order to support industrialization of national freshwater bioresources. Through text mining, we constructed a literature-based corpus and preprocessed the corpus with lowercase conversion and removal of stop word. Then, a word cloud was created and statistical analysis was performed. As a result, genes and diseases associated with specific biological resources have been identified. In this study, through a comprehensive analysis of species, genes, and disease information using text mining, we were able to determine the utilization value of biological resources. This study will help the freshwater biodiversity researchers by adding a function for utilization analysis in the utilization information system of the platform in the future.

Abstract: Hebaria are biological collections of preserved plants, algae, fungi and lichens used for scientific purposes. Fast communication and information exchange are fundamental to accelerate the investigation on biodiversity. The major world herbaria are concentrating efforts to digitise their collections and making available the information online.</p>
<p>Over the last decade, the Herbarium of the University of Coimbra (COI – acronym in <em>Index Herbariorum</em>) has made efforts to make available online the information of its plant collection of c. 800,000 specimens (http://coicatalogue.uc.pt). However, only c. 10% is processed to this date, in part due to the slowness of the methods generally used in herbaria. This work is a contribution to accelerating the digitising process, both by improving digitising procedures and by involving citizens in populating COI database.</p>
<p>To accomplish that, a new workflow was developed to automatically create records in the database from batches of digital images with minimum information, plus a collaborative platform was developed to allow the transcription of specimen labels from digital images in a web environment.</p>
<p>Creating records from the images benefits from the physical organisation of the herbarium, with specimens grouped in taxon folders. This way, when taking pictures of a set of specimens, it is possible to store them in folders with the name of the taxon. A script will then read the name of the folder and check in the database if each ranking of the taxon exists on the taxon tree (genus, species, infraspecific ranks), and in case it does not, it creates one, and then it creates a record based on each of the specimens inside that folder assigning a determination to it.</p>
<p>The collaborative application (http://coicatalogue.uc.pt/collaborative) has innovative features, such as displaying forms sequentially, revealing only one field at a time (Fig. 1). But the most differentiating feature is probably the process of validation for submitted values. Registered users are included under a category, according to their contribution history. Contributors can be upgraded to the next level when they submit a certain number of validated fields. Therefore, there is a progression based on proficiency, allowing users to become familiar with the specimen information system as they use the platform and, simultaneously, it attributes a confidence level to users. This can be used to validate data, assigning a confidence value to a submission, based on user status (points system). Validation of values submitted by users is obtained when the sum of points for a concurrent value meets a threshold, so a single answer from an expert user could be enough to get validation but would require five basic users submitting the same value to be accepted (Table 1).</p>
<p>Although collateral, there is a major, and unique, advantage to this project. The collaborative application can be used as a tool to make corrections to the herbarium database, easily and directly online. This quickly improves the database as such effortless procedure increases this kind of contributions.

Abstract: We provide an overview and update on initiatives and approaches to add taxonomic data intelligence to distributed biodiversity knowledge networks. "Taxonomic intelligence" for biodiversity data is defined here as the ability to identify and renconcile source-contextualized taxonomic name-to-meaning relationships (Remsen 2016). We review the scientific opportunities, as well as information-technological and socio-economic pathways - both existing and envisioned - to embed de-centralized taxonomic data intelligence into the biodiversity data publication and knowledge intedgration processes.</p>
<p>We predict that the success of this project will ultimately rest on our ability to up-value the roles and recognition of systematic expertise and experts in large, aggregated data environments. We will argue that these environments will need to adhere to criteria for responsible data science and interests of coherent communities of practice (Wenger 2000, Stoyanovich et al. 2017). This means allowing for fair, accountable, and transparent representation and propagation of evolving systematic knowledge and enduring or newly apparent <em>conflict </em>in systematic perspective (Sterner and Franz 2017, Franz and Sterner 2018, Sterner et al. 2019).</p>
<p>We will demonstrate in principle and through concrete use cases, how to de-centralize systematic knowledge while maintaining <em>alignments</em> between congruent or concflicting taxonomic concept labels (Franz et al. 2016a, Franz et al. 2016b, Franz et al. 2019). The suggested approach uses custom-configured logic representation and reasoning methods, based on the Region Connection Calculus (RCC-5) alignment language. The approach offers syntactic consistency and semantic applicability or scalability across a wide range of biodiversity data products, ranging from occurrence records to phylogenomic trees. We will also illustrate how this kind of taxonomic data intelligence can be captured and propagated through existing or envisioned metadata conventions and standards (e.g., Senderov et al. 2018).</p>
<p>Having established an intellectual opportunity, as well as a technical solution pathway, we turn to the issue of developing an implementation and adoption strategy. Which biodiversity data environments are currently the most taxonomically intelligent, and why? How is this level of taxonomic data intelligence created, maintained, and propagated outward? How are taxonomic data intelligence services motivated or incentivized, both at the level of individuals and organizations? Which "concerned entities" within the greater biodiversity data publication enterprise are best positioned to promote such services? Are the most valuable lessons for biodiversity data science "hidden" in successful social media applications? What are good, feasible, incremental steps towards improving taxonomic data intelligence for a diversity of data publishers?

Abstract: Taxon concepts are complex, dynamic representations of the real world that are labelled with scientific names designating them. While names, taxa and classifications should be managed separately in databases (Bourgoin et al. 2019, Gallut et al. 2005), students may have difficulty comprehending the dynamic nature of the link between the three entities because taxa circumscriptions are complex to apprehend through textual representation and because names are independently ruled by nomenclatural codes. Exploring, reporting and training users about taxonomic knowledge are complex challenges that could be alleviated through development of efficient visualization tools.</p>
<p>We propose here a tool that generates a graphical representation visualizing the successive concepts of a taxon accepted as valid with its different names and positions in classifications, including its synonyms, homonyms, chresonyms, and other related taxonomic and nomenclatural issues during its lifetime. This tool has been successfully implemented both in database visualisation and used in training students about taxonomy.</p>
<p>In the database FLOW, Fulgoromorpha Lists On the Web, (Bourgoin 2019), the tool creates a graphical translation of the referenced nomenclatural and classificatory story of a taxon such as the one presented in Table 1.</p>
<p>Thanks to a dedicated editor, the textual/html chronological account of the nomenclatural history of the taxon is displayed on the taxon page of the taxon Elicini. A javascript code reinterprets the chronological account displaying the corresponding graphical view as shown in Fig. 1. </p>
<p>The graphic provides a global view of the classification and nomenclatural history of the taxon. Different shapes and colours are associated with the different types of nomenclatural acts or information of both nomenclatural and taxonomic value. The tool is also easily adaptable to any domain dealing with changing knowledge with traceable chronology. This tool has been used successfully in the last several years to better visualize concepts of synomymy, homonymy and chresonymy; and, because it makes the differences between names and taxa, and the importance of contextualizing taxa in classifications, clearly understandable, it has proven particularly useful in training students in taxonomy.

Abstract: Data-sharing has become a key component in the modern scientific era of large-scale research, with numerous advantages for both data collectors and users. However, data-sharing in Uruguay remains neglected given that major public sources of biodiversity information (government and academia) are not open-access. As a consequence, the patterns and drivers of biodiversity in this country remain poorly understood and so does our ability to manage and conserve its biodiversity. To overcome this critical gap, collaborative strategies are needed to communicate the importance and benefits of data openness, exchange and provide technical tools and training on all aspects of data management, sharing practices, focus on incentives, and motivation structures for data-holders. Here, we introduce the Biodiversidata initiative (www.biodiversidata.org) – a novel Uruguayan Consortium of Biodiversity Data. Biodiversidata is a collaboration among experts with the aim of improving the country’s biodiversity knowledge and the open-access of the vast resources they generate. Biodiversidata aims to collate the first comprehensive open-access database on Uruguay's whole biodiversity, to support advancements in scientific research and conservation actions. Currently, Biodiversidata consists of over 30 experts from across national and international institutions, studying diverse biodiversity groups. After less than two years, we have collected, curated and standardised a dataset of ~70,000 records of primary biodiversity data of tetrapod species – the first and most comprehensive open biodiversity database ever gathered for Uruguay to date. However, the process is hampered by multiple challenges:</p>
the lack of support for sampling of specimens and maintenance of collections has contributed to the situation were data are often perceived as personal property rather than collective resources;
institutions have no plans or strategies directed to digitisation of their collections which actually places biodiversity data in Uruguay ‘at risk’ of being lost;
the scarce governmental and academic incentive structures towards open scientific research relegates data-sharing to a personal decision;
although scientists individually are willing to share their research data, the lack of data management plans within their research groups hampers the capacity to digitise the data and thus, to make them available;
former initiatives aimed to create comprehensive biodiversity databases did not consider the balance between openness and gain for researchers, setting the subject of data-sharing more of an obligation than a path of promotion, which impacted negatively in the perception of scientist to open their data.
<p>To overcome some of these challenges, we decided to direct Biodiversidata to individual researchers/experts and not institutions. We called them with the plan of collecting the maximum possible amount of data from vertebrate, invertebrate and plant species, use it to collaboratively generate impactful scientific research. An important aspect was that we requested data only to fit the premise of being primary biodiversity data (i.e., data records that document the occurrence of a species in space and time). This meant cleaning and standardising very heterogeneous information, from a variety of source types and formats, including updating scientific names and georeferentiating sampling locations. However, centralising the cleaning process allowed researchers to send their raw records without spending time cleaning them themselves and, as a consequence, enlarged the amount of data being collated. Collectively, Biodiversidata’s approach towards changing the culture of data-sharing practices has relied on the reinforcement of a scientific collaboration culture that benefits not only researchers at the individual level, but the progress of larger-scale issues as a whole. There is a long way to go on the subject of open research data in Uruguay, though, aiming strategies to people, capitalising data management and progressing with step-by-step rewards, is already showing some preliminary encouraging results.

Abstract: ELIXIR (ELIXIR Europe 2019a) is an intergovernmental organization that brings together life science resources across Europe. These resources include databases, software tools, training materials, cloud storage, and supercomputers. One of the goals of ELIXIR is to coordinate these resources so that they form a single infrastructure. This infrastructure makes it easier for scientists to find and share data, exchange expertise, and agree on best practices. ELIXIR's activities are divided into the following five areas: Data, Tools, Interoperability, Compute and Training, each known as “platform”. The ELIXIR Tools Platform works to improve the discovery, quality and sustainability of software resources. The Software Development Best Practices task of the Tools Platform aims to raise the quality and sustainability of research software by producing, adopting, and promoting information standards and best practices relevant to the software development life cycle. We have published four (4OSS) simple recommendations to encourage best practices in research software (Jiménez et al. 2017) and the Top 10 metrics for recommended life science software practices (Artaza et al. 2016).</p>
<p>The 4OSS simple recommendations are as follows:</p>
<p>Develop a publicly accessible open source code from day one.</p>
<p>Make software easy to discover by providing software metadata via a popular community registry.</p>
<p>Adopt a license and comply with the licenses of third-party dependencies.</p>
<p>Have clear and transparent contribution, governance and communication processes.</p>
<p>In order to encourage researchers and developers to adopt the 4OSS recommendations and build FAIR (Findable, Accessible, Interoperable and Reusable) software, the best practices group, in partnership with the ELIXIR Training platform, The Carpentries (Carpentries 2019, ELIXIR Europe 2019b), and other communities, are creating a collection of training materials (Kuzak et al. 2019). The next step is to adopt, promote, and recognise these information standards and best practices. The group will address this by (i) developing comprehensive guidelines for software curation, (ii) through training researchers and developers towards the adoption of software best practices and (iii) improvement of the usability of Tools Platform products. Additionally, a direct outcome of this task will be a software management plan template, connected to a concise description of the guidelines for open research software; and production of a white paper for the software development management plan for ELIXIR, which can be consequently used to produce training materials. We will work with the newly formed ReSA (Research Software Alliance) to facilitate the adoption of this plan for the broader community.

Abstract: “Data is the lifeblood of decision-making and the raw material for accountability.” (Food and Agricultural Organisation of the United Nations)</p>
<p>Agriculture depends on living organisms: the resources that we grow and harvest (crops, livestock), the many naturally-occurring organisms that support agro-ecosystem services and those that are considered pests. Biodiversity represents the pool of resources upon which production systems are based. This ecosystem-based understanding of agriculture is fast replacing the simplistic input-output paradigm that was once prevalent. Reflecting this trend, the Food and Agricultural Organisation of the United Nations created a Department of Climate, Biodiversity, Land and Water in 2016. This shift is also perceived in trends in research supporting agriculture, where research towards fast fixes (eg. genetic engineering) for food security and production sustainability, is steadily losing ground to more systemic approaches and “Slow Science”. This includes information derived from observation-based, long-term, multi-location data gathering from, for example, newly-termed Living Laboratories or Genomic Observatories. These activities generate a diverse set of data types complemented with extensive physical-chemical and climate data to enable more comprehensive and scientifically defensible assessments of biodiversity for policy and regulatory decision-making. However, the data associated with these entities can be very large, raising several challenges. First, a robust IT infrastructure must exist and persist to both store the data and to facilitate analysis often associated with high-performance computing. Next, this infrastructure must be supported by a data management policy, which responsibly dictates what data is preserved and for what time period as most institutions must be cognisant of their limited IT resources. The data, where not sensitive, must typically be openly shared to allow for use in other projects from local to global scale. This sharing relies on global data standards managed by a multitude of standard bodies and spread across a diverse set of data types. However, this new paradigm, especially with the constant advent of new technologies such as those implemented in genomic observatories and precision agriculture, has challenged standards bodies to keep up. Another challenge lies in the analysis and management of this big data, which relies on a new set of skills not traditionally found in agricultural science including data scientists and data curators. Finally, a major challenge to this “Slow Science” approach are the research funding models themselves that are not highly supportive of long-term, and often non-hypothesis-driven research. We will discuss these challenges and provide some examples of progress being made to embrace this new “Slow Science” agricultural paradigm.

Abstract: Primary biodiversity data, data documenting presences of particular species at particular sites at a point in time, available in standard digital formats, provide the basis for many quantitative studies that can inform effective and reliable national, regional, and global biodiversity conservation decisions. However, these datasets are often unavailable, incomplete, or unevenly distributed across regions and landscapes. We assessed the survey completeness and gaps in current knowledge of birds of West Africa, using digital, accessible primary biodiversity data, obtained from the Global Biodiversity Information Facility and eBird. Additionally, using ecological niche modeling approaches, we modeled the current and potential future geographic distributions of a diverse suite of range-restricted and ecologically important bird species, and used the resulting models to identify priority areas for conservation and future surveys (Fig. 1). The survey completeness and gap analyses revealed marked spatial, seasonal, and temporal (historical) gaps and biases in the coverage of bird records across the region (Fig. 1). Well-surveyed sites were clustered around points of access such as major cities, roads, and national reserves or parks, mainly in Ghana, The Gambia, Senegal, Côte d’Ivoire, and Cameroon (Fig. 1). For our distributional analysis, we found broad present-day potential distributions with respect to climate. Future potential distributions, taking into account climate change processes, tended to be still-broader and more inclusive than present-day distributions, so climate-change-driven range losses and gains were minimal. Our models identified Liberia, southeastern Sierra Leone, southwestern Côte d’Ivoire, and southwestern Ghana to have high climate suitability in the present and in the future for most species. These results illustrate the spatial and temporal biases and gaps in West African bird data, and emphasize the need to promote high-quality biodiversity data mobilization and publication in West Africa and by extension the developing world. To address these biases at the regional level, research institutions and individuals need to engage in more systematic planning and biodiversity research, taking into account the potential for spatial, temporal, and seasonal biases.

Abstract: The Beaty Biodiversity Museum (BBM), at the University of British Columbia, houses over two million biological research specimens, with nearly 40% of the specimen records digitized into databases, unlocking a wealth of information for research and teaching (Table 1). However, these collection databases were neither available nor unified for users. Even museum and collections staff could not digitally access each other’s collections. With a total of 6 collections (in different colors in Fig. 1) in 13 separate databases in differing stages of development, across several varying interfaces and systems, our goal was to unify the collection databases through the development of a single search interface (Fig. 1).</p>
<p>This was a large collections project with multiple stages of development. Integration of the data was made possible through the efforts of multiple groups to standardize the fields of each database so they conformed to the Darwin Core standard (Group 2009). This mapping of fields allows each of the databases to be displayed and shared in a consistent format. It also simplified the integration of data for popular data aggregators (Canadensys, VertNet, FishNet2, Consortium of Pacific Northwest Herbaria, Electronic Atlas of the Plants of British Columbia, and Global Biodiversity Information Facility). When this first step was achieved, many features such as standarized georeferencing, simplified reporting and unified search interfaces could be implemented to aid all users, e.g., curators, museum staff, researchers, and the public. Through this new interface, it is possible to browse the near entirety of every digitized record within the museum with an in-house solution provided by the Beaty Biodiversity Museum.

Abstract: Although still strongly intertwined, taxonomy and systematics are diverging more and more in their paradigms, methods, and agendas: it is not possible to consider them as synonyms anymore. While taxonomy remains an analytical science based on abductive reasoning (trying to find the historical pathways leading to present species delineations), systematics diverged as an information science since the rise of computers: summarizing, organising, and exposing taxonomical, biological, and ecological data, information, and knowledge in the most efficient ways, with respect to various targeted audiences. One could even consider synonymising biodiversity informatics with systematics instead!</p>
<p>Schematically, this led to two different types of information systems: one dedicated to pure taxonomic and nomenclatural data; one oriented to record life-traits. Obviously, the latter must be built along reliable taxonomic backbones, therefore the former should have been built before the latter. It did not happen as exemplified in fishes by the Food and Agriculture Organisation (FAO) Fisheries Global Information System (FIGIS) and related, Catalog of Fishes, and FishBase, and later the databases of the International Union for Conservation of Nature (IUCN) and World Register of Marine Species (WoRMS), or for aggregators with Catalogue of Life and e.g., GBIF, Encyclopedia of Life. This has created some confusion for end users, “which system should I use” being their regular question with the invariable answer, “it depends”.</p>
<p>In the absence of a formal preexisting taxonomic information system, each biodiversity information system has developed its own way to manage its taxonomic backbone with more or less impact of taxonomists. For fishes, we are not far from reconciling the various systems, and data to knowledge flows are becoming clearer, but it is not without unnecessary extra work. A real breakthrough is necessary to move from a collaboration stage to a cooperative stage, where systems are interconnected (not necessarily integrated) in such a way that the same taxonomic work is not repeated over and over to synchronise the systems. The difficulty is that taxonomic information systems must be designed for the needs of taxonomists, while their resulting classifications and the way they are exposed must fit the needs of systematics/biodiversity systems purposes, and by extension of the rest of scientific domains and the society in general. Conditions for this breakthrough to happen are discussed.</p>
<p>The breakthrough does not reside in only one action but rather is the result of multiple simultaneous advances in the theory of taxonomy, its (mathematical?) formalization and informatics implementation, technology (although progress in that domain may be well in advance over others), data entry, networking, and sociology of science. The "potential taxon" concept (Berendsohn 1995) led to important theoretical progresses but its actual implementation lags behind in many systems, due probably to the huge effort of data entry it requires. Data entry is certainly a part that was neglected at the beginning of biodiversity informatics, because it has to be sustained endlessly, while the development of new systems was seen as more rewarding in time-limited frameworks. This has been corrected at least for occurrences and specimens, with the development of national and international digitization programs. Besides, the development of the Biodiversity Heritage Library and text extraction technologies is quite promising for taxon information.</p>
<p>As in all complex situations with many interacting dimensions (e.g., fisheries management in ichthyology), progress must be balanced among all dimensions to have effective results for the overall domain. Among others, issues in sociology of sciences, for instance, must be addressed to make significant progresses. In particular the way data, information and knowledge are published, and jobs delineated and careers evaluated, must still be seriously reviewed in the light of information system development.

Abstract: Climate change is affecting the ecosystems and the services they provide for human well-being. For a better understanding of the causes and effects of this change in the functioning of ecosystems, detailed information of environmental parameters needs to be considered. Researchers generate multiple databases in their academic disciplines but often this data collection is not openly available nor does it fulfill the FAIR principles of Findability, Accessibility, Interoperability, and Reusability (Wilkinson et al. 2016). In this regard, Long-Term Ecosystem Research (LTER) plays an important role as an open research infrastructure that helps to integrate in-situ data, remote sensing products and modelling efforts, related to biodiversity and geodiversity.</p>
<p>Within LTER networks, the Dynamic Ecological Information Management System – Site and Dataset Registry (DEIMS-SDR), is used as the central site catalogue to provide information about facilities, ecosystems and environmental parameters in an openly available and standardized way (Wohner et al. 2019), organized by each location’s site. These LTER sites from all around the globe, receive a special protection status due to their ecological value, which through research and observation, enhances the protection and conservation of these areas.</p>
<p>The LTER Ria de Aveiro site in Portugal (DEIMS.ID; LTER website), classified under the Natura 2000 network, is of paramount importance for the regional and national economy, agriculture, commercial fisheries, aquaculture, manufacturing, tourism, sports and recreational activities (e.g. Lillebø et al. 2019).</p>
<p>Since the establishment of the LTER Ria de Aveiro site in 2011, its research has focused on the contribution to the effective implementation of the Water Framework Directive and European Union Biodiversity Strategy targets. Studies have been developed to target key policies within Natura 2000 areas, Action 5 including habitat mapping and assessment of ecosystems and their services, data collection of important fauna groups, and engagement of stakeholders and common frameworks for the conservation of biodiversity. Currently, the site is part of the Portuguese e-Infrastructure for Information and Research on Biodiversity (PORBIOTA), being aligned with the European Research Infrastructure Consortium (LifeWatch-ERIC).</p>
<p>The LTER site team infrastructure includes laboratory facilities for field observation and environmental monitoring of water quality and environmental parameters that are used to feed models. A recently obtained Unmanned Aerial Vehicle (UAV), commonly known as drone, will contribute to ecological observations, generating data to provide biodiversity monitoring in space and time.</p>
<p>As LTER site managers and data providers, we have to deal with how to make the transition from our metadata to FAIR ecological data. Our aim is to deal with the implementation of standardized data profiles in our own data. For instance, to upload data files to the central repositories (e.g. DEIMS-SDR), to store and publish our raw data (e.g. B2Share), to create online distribution links and digital object identifiers (DOIs), and to use a convenient vocabulary (e.g. Environmental Thesaurus) to be understandable by everyone (Pérez-Luque et al. 2019). This substantially will increase the potential of databases in the scientific community, and will contribute to a successful building of LTER.

]]>Conference AbstractTue, 2 Jul 2019 11:00:00 +0300 Catalogue of Life Plus: A collaborative project to complete the checklist of the world's species https://biss.pensoft.net/article/37652/
Biodiversity Information Science and Standards 3: e37652

DOI: 10.3897/biss.3.37652

Authors: Olaf Banki, Donald Hobern, Markus Döring, David Remsen

Abstract: Although the Catalogue of Life (CoL) continues to expand, its coverage is still far from complete, with several important megadiverse groups mostly lacking. Additionally, some segments of the Catalogue require major work to resolve synonymy or to incorporate recent names. As a result, key infrastructures that seek to organise data on the species of the world or particular regions (including GBIF, EOL, BHL and BOLD, among others) are unable to use CoL as a complete checklist of the world's species. It has become common for such platforms to augment CoL with automatically constructed arrangements of additional names and taxa to ensure that all species can be placed in a common framework. Since these efforts usually depend on algorithmic interpretation of text strings, the results are variable and lead in some cases to undesirable results. There is no easy way for all of these infrastructures to fix these issues in a consistent and interoperable way.</p>
<p>Accordingly, Catalogue of Life Plus (CoL+) was started in 2017 as a collaborative project between the Catalogue of Life, the Global Biodiversity Information Facility, Naturalis Biodiversity Center and other partners, with financial support from the Netherlands Biodiversity Information Facility and the Netherlands Ministry of Education, Science, and Culture. CoL+ seeks to replace existing disconnected efforts with a shared, extended catalogue and to complete coverage of expert-reviewed names without sacrificing quality. The project aims to create an open, shared, and sustainable consensus taxonomy to serve the proper linking of data with in and between biodiversity information initiatives.</p>
<p>The goals for the CoL+ project are to build a new infrastructure for the Catalogue of Life that:</p>
<ul>
Extends and enriches the catalogue with more taxonomic sources to create a larger pool of scientific names, where feasible fill taxonomic gaps, and provide other enrichments with linkages to literature and references
Replaces the current GBIF Backbone Taxonomy
Separates names and taxonomy with different identifiers and authorities for names and taxa for better reuse
Provides avenues for (infrastructural) support to the completion and strengthening of taxonomic and nomenclature content authorities
</ul>
<p>Specifically the project will establish a clearinghouse for taxonomy and nomenclature and rebuild all existing infrastructure of the Catalogue of Life including webservices, the portal and the software for the assembly of the catalogue and its editorial work.

Abstract: Within the Netherlands, large scale digitization efforts of natural science collections have taken place in recent years. This has led to a wealth of digital information on natural science collections. Still, large quantities of collection data remain untapped and undigitized. The usage of all these digital collections data as driver for science and society remains underexplored. Especially important, is the opportunity for such data to be combined and/or enriched with other data types with the aim to empower different user groups.</p>
<p>A consortium of Dutch partners has committed themselves in working together to make biological and geological collections into a joint research infrastructure, underpinning other research infrastructures and scientific uses also beyond the biodiversity research domain. This consortium combines the Dutch contribution to the Distributed Systems of Scientific Collections (DiSSCo), LifeWatch, the Catalogue of Life and the Global Biodiversity Information facility, under the coordination of the Netherlands Biodiversity Information Facility.</p>
<p>As part of a preparatory project for DiSSCo, funded by the Dutch science council, we connected the different users groups of collection managers (data providers), scientists (end-users), IT-specialists and policymakers. With collection managers we explored how to move towards an overview of all natural science collections in the Netherlands. In addition, we studied to what extent collection holdings of different musea could be combined, managed, and shared into one research infrastructure. Using a research data management cycle perspective, we surveyed and interviewed the Dutch research community for the barriers and opportunities in using natural science collections and related data.</p>
<p>The outcomes of the project should lead to the next steps in creating a more comprehensive and inclusive biodiversity research data infrastructure in the Netherlands that interacts seamlessly with existing international research infrastructures, including DiSSCo.

Abstract: Capturing data from specimen images is the most viable way of enriching specimen metadata cheaply and quickly compared to traditional digitisation. Advances in machine learning and computer vision-based tools, and their increasing accessibility and affordability, are greatly increasing the potential to take automated measurements and capture other data from specimens themselves, as well as to transcribe label data.</p>
<p>More sophisticated segmentation of images allows us to find parts of interest: particular labels; individual specimens on a slide; or barcodes. Following segmentation, there is the potential to use colour analysis of specimens to perform conditional checking, such as looking for bad cases of verdigris in pinned insects or discoloration of gum-chloral mountant. Automating measurements and landmark analysis of specimens can be used to create trait datasets, all of which will enrich our knowledge of specimens. Segmentation of labels can allow us to cluster similar labels based on their visual properties including colour, shape and patterns—this in turn can be used to make optical character recognition, handwriting recognition and manual transcription much more efficient. Atomising, validating and resolving label data will create structured label data that can be more easily stored, searched and linked to other datasets.</p>
<p>We present a landscape analysis on the approaches, summarising previous work, and outline our plan to build future tools and systems in the SYNTHESYS+ Project as part of the Specimen Data Refinery. This will cover the sharing of tools, reducing barriers to access, integrating workflow engines into a software architecture that allows the components to be re-used and re-purposed with provenance data for repeatability, and conforms with the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles (Wilkinson et al. 2016).

Abstract: The bdverse is a collection of packages that form a general framework for facilitating biodiversity science in R. We build it to serve as a sustainable and agile infrastructure that enhances the value of biodiversity data by allowing users to conveniently employ R, for data exploration, quality assessment, data cleaning, and standardization. The <em>bdverse</em> supports users with and without programming capabilities. It includes six unique packages in a hierarchal structure — representing different functionality levels (Fig. 1). Major features of three core packages will be highlighted and demonstrated: (i) <em>bdDwC</em> provides an interactive Shiny app and a set of functions for standardizing field names in compliance with Darwin Core (DwC) format; (ii) <em>bdchecks</em> is an infrastructure for performing, filtering and managing various biodiversity data checks; (iii) <em>bdclean</em> is a user-friendly data cleaning Shiny app for the inexperienced R user. It provides features to manage complete workflow for biodiversity data cleaning, including data upload; user input - in order to adjust cleaning procedures; data cleaning; and finally, generation of various reports and versions of the data.</p>
<p>We are now working on submitting the <em>bdverse</em> packages to rOpenSci software review, and as soon as the packages meet core requirements, we will officially release the <em>bdverse</em>. The <em>bdverse</em> project won the 2nd prize in the 2018 Ebbe Nielsen Challenge.

Abstract:
Specimens held in private natural history collections form an essential, but often neglected part of the specimens held worldwide in natural history collections. When engaging in regional, national or international initiatives aimed at increasing the accessibility of biodiversity data, it is paramount to include private collections as much and as often as possible. Compared to larger collections in national history institutions, private collections present a unique set of challenges: they are numerous, anonymous, small and diverse in all aspects of collection management. In ICEDIG, a design study for DiSSCo these challenges were tackled in task 2 "Inventory of content and incentives for digitisation of small and private collections" under Workpackage 2 "Inventory of current criteria for prioritization of digitization".
</p>
<p dir="ltr">
First, we need to understand the current state and content of private collections within Europe, to identify and tackle challenges more effectively. While some private collections will duplicate material already held in public collections, many are likely to fill more specialised or unusual niches, relevant to the particular collector(s). At present, there is little evidence about the content of private collections and this needs to be explored. In 2018, a European survey was carried out amongst private collection owners to gain more insight in the volume, scope and degree of digitisation of these collections.
</p>
<p dir="ltr">
Based on this survey, all of the respondents’ collections combined are estimated to contain between 9 and 33 million specimens. This is only the tip of the iceberg for private collections in Europe and underlines the importance of these private collections. Digitisation and sharing collection data are activities that are overall considered important among private collection owners. The survey also showed that for those who have not yet started digitising their collection, the provision of tools and information would be most valuable. These and other highlights of the survey will be presented. In addition, protocols for inventories of private collections will be discussed, as well as ways to keep these up to date.
</p>
<p dir="ltr">
To enhance the inclusion of private collections in Europe’s digitisation efforts, we recognise that we mainly have to focus on the challenges regarding the ‘how’ (work-process), and the sharing of information residing in private collections (including ownership, legal issues, sensitive data). Where necessary, we will also draw attention to the ‘why’ (motivation) of digitisation. A communication strategy aimed at raising awareness about digitisation, offering insight in the practicalities to implement digitisation as well as providing answers to issues related to sharing information, is an essential tool. Elements of a communication strategy to further engage private collection owners will be presented, as will conclusions and recommendations.
</p>
<p dir="ltr">
Finally, digitisation and communication aspects related to private collection owners will need to be tested within the community. Therefore, a pilot project is currently (2018-2019) being carried out in Estonia, Finland and the Netherlands to digitise private collections in a variety of settings. Preliminary results will be presented, zooming in on different approaches to include data from private collections in the overall (research) infrastructures.

Abstract: Numerous studies over the past decades have shown that species phenologies are shifting. Behind the large-scale patterns of shifting phenologies lies, however, large variability across species and space in terms of both the sign of the shifts (advance or delay) and their magnitude (rate of change). The shifts in the timing of seasonal events are usually studied by measuring change in one part of the phenological distribution over the season, such as the mean or first appearance of the event. This, however, gives us a mere glimpse of how (part of) a population is changing, thus limiting our ability to understand the underlying mechanisms. We demonstrate the benefits of taking a holistic approach to describing phenological change by considering shifts in the complete phenological distribution and interrelationships within it. As a case study, we make use of a database on bird chick banding events to understand the shifts in breeding phenology for 74 bird species across 43 years and 4 bioclimatic zones distributed across Finland. We find that bird breeding is stacking up towards earlier and more compressed peak periods. The majority of change can be attributed to a faster advance of the tail of the distribution, alongside minor advancement of the beginning of the season. We conclude that the observed shifts potentially intensify intraspecific competition by increasing the temporal co-occurrence of broods across large geographical areas. Nevertheless, these patterns would likely have not been recognized through summarizing the data by species mean dates alone. We therefore urge scientists and data managers to compile and utilize phenological data at higher resolution, retaining the original detail. This will allow us to capture multiple modes of population-level change, thereby providing deeper insights into how species are responding to ongoing climate warming.

Abstract: Recent progress in using deep learning techniques to automate the analysis of complex image data is opening up exciting new avenues for research in biodiversity science. However, potential applications of machine learning methods in biodiversity research are often limited by the relative scarcity of data suitable for training machine learning models. Development of high-quality training data sets can be a surprisingly challenging task that can easily consume hundreds of person-hours of time. In this talk, we present the results of our recent work implementing and comparing several different methods for generating annotated, biodiversity-oriented image data for training machine learning models, including collaborative expert scoring, local volunteer image annotators with on-site training, and distributed, remote image annotation via citizen science platforms. We discuss error rates, among-annotator variance, and depth of coverage required to ensure highly reliable image annotations. We also discuss time considerations and efficiency of the various methods. Finally, we present new software, called ImageAnt (currently under development), that supports efficient, highly flexible image annotation workflows. ImageAnt was created primarily in response to the challenges we discovered in our own efforts to generate image-based training data for machine learning models. ImageAnt features a simple user interface and can be used to implement sophisticated, adaptive scripting of image annotation tasks.

Abstract: Ideally, an information system that automates the integration of disparate datasets should be able to minimize the loss of information from any one dataset, achieve computational complexity suitable for working with large datasets, be flexible enough to easily incorporate new data sources, and produce output that is easily analyzed and understood by data users. Achieving all of these goals within highly heterogeneous and highly complex data domains is a major challenge. In this talk, we present the results of our recent efforts to develop such a system for data about plant phenology. Our data integration system, which is built around the Plant Phenology Ontology, currently supports semantically fine-grained integration of phenological data from both field observations and herbarium specimens. We show that even with a heavily axiomatized ontology and sophisticated, machine-reasoning-based data analysis, it is possible to implement a high-throughput data integration pipeline capable of processing millions of individual records in a matter of minutes while running on modest, server-class hardware. Success requires careful ontology design and judicious application of machine reasoning techniques. We also discuss some of the many challenges that remain for designing efficient, general-purpose data integration systems.

Abstract: Trait-based research spans from evolutionary studies of individual-level properties to global patterns of biodiversity and ecosystem functioning. An increasing number of trait data is available for many different organism groups, published as open access data on a variety of file hosting services. Thus, standardization between datasets is generally lacking due to heterogeneous data formats and types. The compilation of these published data into centralised databases remains a difficult and time-consuming task.</p>
<p>We reviewed existing trait databases and online services, as well as initiatives for trait data standardization. Together with data providers and users participating in a large long-term observation project on multiple taxa and research questions (the Biodiversity Exploratories, www.biodiversity-exploratories.de), we identified a need for a minimal trait-data terminology that is flexible enough to include traits from all types of organisms but simple enough to be adopted by different research communities.</p>
<p>In order to facilitate reproducibility of analyses, the reuse of data and the combination of datasets from multiple sources, we propose a standardized vocabulary for trait data, the Ecological Trait-data Standard Vocabulary (ETS, hosted on GFBio Terminology Service, https://terminologies.gfbio.org/terms/ets/pages), which builds upon and is compatible with existing ontologies. By relying on unambiguous identifiers, the proposed minimal vocabulary for trait data captures the different degrees of resolution and measurement detail for multiple use cases of trait-based research. It further encourages the use of global Uniform Resource Identifiers (URI) for taxa and trait definitions, methods and units, thereby readying the data publication for the semantic web. An accompanying R-package (traitdataform) facilitates the upload of data to hosting services but also simplifies the access to published trait data.</p>
<p>While originating from a current need in ecological research, in the next step, the described products are being developed for a seamless fit with broader initiatives on biodiversity data standardisation to foster a better linkage of ecological trait data and global e-infrastructures for biological data. The ETS is maintained and discussion on terms are managed via Github (https://github.com/EcologicalTraitData/ETS).

Abstract: <p style="margin-left:0cm; margin-right:0cm">Nucleic acid and protein sequencing-based analyses are increasingly applied to determine origin, identity and traits of environmental (biological) objects and organisms. In this context, the need for corresponding data structures has become evident. As existing schemas and community standards in the domains of biodiversity and molecular biological research are comparatively limited with regard to the number of generic and specific elements, previous schemas for describing the physical and digital objects need to be replaced or expanded by new elements for covering the requirements from meta-omics techniques and operational details. On the one hand, schemas and standards are hitherto mostly focussed on elements, descriptors, or concepts that are relevant for data exchange and publication, on the other hand, detailed operational aspects regarding origin context and laboratory processing, as well as data management details, like the documentation of physical and digital object identifiers, are rather neglected.</p>
<p style="margin-left:0cm; margin-right:0cm">The conceptual schema for Meta-omics Data and Collection Objects (MOD-CO; https://www.mod-co.net/) has been set up recently Rambold et al. 2019. It includes design elements (descriptors or concepts), describing structural and operational details along the work- and dataflow from gathering environmental samples to the various transformation, transaction, and measurement steps in the laboratory up to sample and data publication and archiving. The concepts are named according to a multipartite naming structure, describing internal hierarchies and are arranged in concept (sub-)collections. By supporting various kinds of data record relationships, the schema allows for the concatenation of individual records of the operational segments along a workflow (Fig. 1). Thus, it may serve as a logical and structural backbone for laboratory information management systems. The concept structure in version 1.0 comprises 653 descriptors (concepts) and 1,810 predefined descriptor states, organised in 37 concept (sub-)collections. The published version 1.0 is available as various schema representations of identical content (https://www.mod-co.net/wiki/Schema_Representations). A normative XSD (= XML Schema Definition) for the schema version 1.0 is available under http://schema.mod-o.net/MOD-CO_1.0.xsd.</p>
<p>The MOD-CO concepts might be integrated as descriptor/element structures in the relational database DiversityDescriptions (DWB-DD) an open-source and freely available software of the Diversity Workbench (DWB; https://diversityworkbench.net/Portal/DiversityDescriptions; https://diversityworkbench.net). Currently, DWB-DD is installed at the Data Center of the Bavarian Natural History Collections (SNSB) to build an instance of its own for storing and maintaining MOD-CO-structured meta-omics research data packages and enrich them with ‘metadata’ elements from the community standards Ecological Markup Language (EML), Minimum Information about any (x) Sequence (MIxS), Darwin Core (DwC) and Access to Biological Collection Data (ABCD). These activities are achieved in the context of ongoing FAIR ('Findable, Accessible, Interoperable and Reuseable') biodiversity research data publishing via the German Federation for Biological Data (GFBio) network (https://www.gfbio.org/). Version 1.1 of the schema with extended collections of structural and operational design concepts is scheduled for 2020.

Abstract: Herbaria hold large numbers of specimens: approximately 22 million herbarium specimens exist as botanical reference objects in Germany, 20 million in France and about 500 million worldwide. High resolution digital images of these specimens take up substantial bandwidth and disk space. New methods of extracting information from the specimen labels have been developed using OCR (Optical character recognition) techniques, but the exploitation of this technology for biological specimens is particularly complex due to the presence of biological material in the image with the text, the non-standard vocabularies, alongside the variation and age of the fonts. Much of the information is handwritten and natural handwriting pattern recognition is a less mature technology than OCR. Today, our system (eTDR-European Trusted digital Repository) provides the OCR technology (using Tesseract software) adapted to the requirements of herbarium specimen images and requires minimal installation in each institution. This is what we propose to make available to botanists with our portal.</p>
<p>The goal for a museum is to be able to submit a large number of scanned images easily to a long-term archiving system in order to automatically obtain OCR texts and retrieve them by a full text search on an open data portal.</p>
<p>Most of the images are provided for reuse through CC-BY licenses. In each case, the rights of reuse associated with the data are specified in associated metadata.</p>
<p>This pilot was an opportunity to test the long-term storage service eTDR provided by CINES. The services (B2SAFE, B2Handle) developed by EUDAT were used to facilitate the transfer of data to the storage repository and to provide indexing services for access to that repository.</p>
<p>This workflow that has been tested for the european project ICEDIG is presented as a poster: See the document (Suppl. material 1).

Abstract: Phenoscape has developed methods to render phylogenetic characters from the systematic literature machine computable and interoperable with genetic data from model organisms, by annotating them with taxon, anatomy, quality, other ontologies. Moving these trait data into a semantic framework enables their integration with other data types and provides the potential for powerful new computational tools to aid discovery. For example, trait similarities can be quantified, assessed against phylogenetic trees to determine whether they are based in homology or homoplasy, and linked back to candidate genes. Another example is the ability to automatically construct a matrix on the fly, for a user-selected set of traits and taxa. Guidelines for consistent representation of characters have been developed through manual annotation of over 20,000 systematic characters. These represent a limited number of design patterns that are applicable to traits from any source. The level of detail to which characters are annotated will influence how they may be used in research; a minimal approach will still enable basic trait aggregation. Manual curation effort is substantial and does not scale well to biodiversity traits. Natural Language Processing (NLP) methods, using our newly developed Gold Standard for semantic traits, can accelerate annotation from published text, particularly with well-provisioned ontologies. Efforts to establish new trait databases might profitably explore machine learning for morphological discrimination and semantic annotation from digitized images for a high-throughput approach.