Developing a digital U.S. biological collections national resource: First steps towards a strategic plan

Summary

A strategic plan for a 10-year national effort to digitize and mobilize images and data associated with biological research collections is being developed. The key objective of the plan is to create a publicly available, comprehensive national collections resource that can be used to address a wide range of research questions and serve stakeholders in government agencies, academic institutions, and international biodiversity organizations. A workshop, held at the National Evolutionary Synthesis Center on February 5-7, 2010, drafted the present outline for the digitization and web mobilization of data and images associated with U.S. biological collections. Input from the community is requested as this plan develops to ensure that it builds appropriately on existing projects and reflects the missions and needs of the nation’s diverse biological collections.

Significance of Collections Digitization

Biological collections, gathered over more than two centuries of research and exploration, represent a significant national resource for research and applied biology that has been underutilized in the digital realm. Knowledge of the history of life is accessible only through biological collections of specimens, fossils, tissues, images and other data that are held in perpetuity by museums, universities and various state, federal, and non-governmental agencies. Knowledge of biodiversity, obtained through the use of collections, is critically important for studies of invasive species, biological conservation programs, land management strategies, biotic responses to climate change, the spread of pathogenic organisms, and research and management activities of many kinds. A coordinated effort to digitize existing biological collections and to mobilize the data and images in a freely available online resource is needed. Recent technological advances in the digitization of collections, combined with decades of experience and emerging efforts to standardize and integrate across collections, put the collections community in a position to address the problem in a concerted way. This effort would have major, positive impacts on U.S. scientific achievement and global scientific collaboration.

The Scope of Collections Digitization

Collections digitization is defined broadly to include transcription into electronic format of various types of data associated with specimens, the capture of digital images of specimens, and the georeferencing of specimen collection localities. In order to assess the scope of the undertaking required to digitize the nation’s collections, the collections community has conducted a survey to provide an overview of the number and diversity of specimens contained in U.S. collections. Additionally, the community has held three workshops on “Future Directions in Biodiversity and Systematics Research”. These, in addition to two recent reports (1,2), highlight the scale of the challenge, the need to address the integration of digitized biological data, the need to coordinate the capture of specimen data and images, and the necessity of providing broad accessibility to specimen data by scientists worldwide. Estimates of collection size range as high as three billion specimens globally, with as many as one billion or more specimens preserved and cared for by U.S. institutions, most of which (as high as 90%) are not accessible online.

At the current time there does not exist a comprehensive strategic plan for the digitization of the nation’s biological research collections. To be effective, such a plan should be conceived as a grand challenge and undertaken as a unified mission involving a coordinated funding program and well designed strategy for execution. In addition to addressing needs for physical care and housing of collections and support of collections-based research broadly (3,4), it is vital that the U.S. increase the online accessibility of its biological collections through an integrative and focused digitization effort in order to be able to best utilize the full value of our national biological collections resources. The plan also calls for the development of cyberinfrastructure to promote efficient and standard capture and mobilization of these data to make the national biological collections resource publicly available for analysis. The present focus of this strategy is on the digitization and mobilization of existing collection data. This initiative would not directly support the development of new collections or collection improvement through enhanced infrastructure, curation or management.

Objectives, Vision and an Outline for Organizing of Effort

The key objective of the plan is to create a publicly available, sustainable and comprehensive national collections resource by digitizing and mobilizing data from the nation’s biological research collections. Some of the desirable features of this new digital collections resource are:

• Images and data from all U.S. biological collections, large and small, integrated in a web accessible interface using shared standards and formats.

• New web interfaces, visualization and analysis tools, data mining, image analysis, and georeferencing processes developed and made available for using and improving the collections resource.

• The existing massive backlog of non-digitized collections digitized and web mobilized, while tools, training, and infrastructure created for preventing the reoccurrence of such a backlog.

A suggested framework for the digitization effort is presented here, for the purposes of obtaining community feedback on models for developing a biological collections digitization initiative.

Three tiers of effort that will accomplish this objective have been identified:

1. Develop a coordinated effort to provide technological support for the nationwide collections digitization effort, to organize new efforts with existing collections-based projects and international efforts, and to disseminate standards, techniques and best practices. This effort might take the form of a new center based at a single institution, a collaborative administrative group across institutions, or some other model that will achieve the same function.

2. Develop a network of regional collaborations for collection digitization across the U.S. These regional efforts might consist of institutions housing both large and small collections from the same region that unite to focus on digitization and web mobilization of collections in order to contribute to the national collections resource.

3. Develop investigator-driven and cross-regional collaborations driven by the specific needs of collections of a particular clade or preservation type, or motivated by a particular scientific question to be addressed by the use of collections images and data.

Strategy for Community Involvement

The plan to create a national digitized biological collections resource requires a strategic plan with broad support and input from the collections community and a diversity of stakeholders. Such a strategic plan incorporating community suggestions will be the product of this effort. The mechanism for community participation in this planning includes wide distribution of the present outline to institutions, agencies, and professional societies. The responses to the plan, collected through email and blog commentary will be used in future meetings to complete a strategic plan. Community feedback on the initiative outlined here is critical. Feedback can be made by adding a comment on the blog page (https://digbiocol.wordpress.com/), sending an email to wg-digitization@nescent.org, or contacting individual participants in the recent meeting (www.nescent.org/wg_digitization/Main_Page). Group feedback based on institutional priorities or taxon-based needs is welcomed. Specific feedback is needed in areas such as support for the proposed model, suggestions for revision, ideas regarding the three-tiered approach suggested here, priorities for collection digitization, and ways to maximize collaboration across institutions and federal agencies, and at the international level. This feedback will be aggregated and provided to participants in future planning sessions that will develop a final strategic plan.

3. Stevenson, J. W. and D. W. Stevenson. 2003. Development of a national systematics infrastructure: a virtual instrument for the 21st century. Report to the National Science Foundation, Biodiversity Surveys and Inventories Program. New York, December, 2003.

As a member of the group that put together this first draft, I wanted to add that I think that a program like this has great potential to elevate museum collections and the data we serve to a new level of importance. A national collection resource like this will make new kinds of biodiversity science possible, and perhaps we will gain some traction in getting more solid institutional support for collections, at least during the program period when funding is available. I strongly support NSF developing this kind of program, and encourage people to make suggestions about the way to organize the program and priorities for digitization of their groups. My own perspective on our suggested 3 tiered structure is that we will want to make the administrative level as thin and inexpensive as possible, and focus the effort and funding on the regional centers and the investigator driven, self-assembled collaborations to digitize and web mobilize our collections.

This is an excellent start that I fully support. Commenting on tier number 3, it could benefit significantly from an expansion to include “collaboration driven by the specific needs of collections of a particular geographic area or collector” as well as a particular “clade or preservation type.” As one example, I am actively digitizing and georeferencing my own 5500+ botanical collections, duplicates of which have been distributed world-wide. I’m sure I’m not the only one. Other investigators develop digitized specimen data for geographic areas of floristic or other interest. Because of the personal knowledge and vested interests involved, all such data are likely to be of higher quality than if the collections were just digitized randomly by the hosting institutions. The results would be (1) saving duplicate labor and costs at multiple institutions by leveraging the existing efforts of individual investigators, and (2) higher-quality data.

One of my concerns is regarding the veracity of identifications. In any cataloged collection, there will be some errors, and those errors become problematic in a large shared database where the chain of custody may be unknown, and those errors can only be fixed by looking at the specimens involved. I’m more concerned about that issue than easy to fix things like georeferencing.

To follow up on Mark’s comment: I would argue that the best and most productive form of collaboration would be the sharing of specimens of a given taxon from multiple institutions with taxonomist(s) that are authorities on that taxon. Assuming there is funding that can pay for the taxonomists’ time, then everyone wins: the taxonomist has access to material that aids in their research, and the institutions submitting their specimens get them authoritatively identified. While it might not be considered necessary, if the digitizing takes place at the taxonomist’s institution, this would presumably allow for greater standardization of the data, and speed up the process (I believe that one person digitizing 5000 specimens of taxon X is more efficient and consistent than 10 people each independently digitizing 500 specimens apiece).

not sure if this is the right place but my comment would be how does the group see this initiative crossing over with archaeological/anthropological/palaeontological collections?

Mark I agree with your sentiment

“…great potential to elevate museum collections and the data we serve to a new level of importance. A national collection resource like this will make new kinds of biodiversity science possible… ”

Isn’t a large part of any work on the long term present and historical affect of biodiversity shifts & climate change the anthropomorphic connections? Is the digitization of those collections being covered elsewhere?

Doug’s phrase, “assuming there is funding that can pay for the taxonomists’ time”, highlights what seems to me to be a key weakness in the plan, also relating to Mark’s concern. Unless the digitization of the collections is supervised by someone knowledgeable about the taxa in question, the result will be numerous errors propagated over cyberspace and little reliable data. Even those collections that have up-to-date identifications are likely to be afflicted by typos, OCR errors, and the like if the digitization is not checked by someone who knows about the group. Yet there is very little support for doing the basic taxonomic work that must go before any successful effort at compiling data.

Unless there is significant funding for taxonomic review of the collections in question, the underlying data will not be good. Especially in understudied groups (e.g., most invertebrates), the proportion of misidentifications, outdated names, hard-to-read labels, and the like can be very high. As is, it sounds like yet another grand scheme that will draw funding without benefit while basic taxonomy fades away through lack of support.

One way that taxonomy might benefit from the effort would be to hire taxonomists, train them to digitize collections, and send them around, rather than hiring primarily on the basis of data informatics background. Taxonomic expertise will also be needed long-term to keep the classification up to date, not to mention to support future input to the data.

This is a fantastic proposal that would have a tremendous impact on natural history collections in the US. The data made available through such an initiative would be of broad interest to the general public, policy makers, the scientific community. It would provide baseline data for long-term studies of plant and animal distributions, especially for studies of the impact of climate change on the flora and fauna of North America. As pointed out by Mark Westneat, this would also breath new life into natural history collections by demonstrating to University and Museum administrators that collections are valuable. I strongly support the effort to combine data from large and small collections to take full advantage of the historical records that collections provide.
I know that NSF has a program for just this activity:

I would hope that this program could be expanded to provide additional funding.

Bees (and other pollinators) are an obvious group for this effort because of increasing concerns over the decline of honey bees and native pollinators. A good illustration of the utility of historical collections for analysis of pollinator decline is:

Thank you for contacting me about the proposal to digitize US Biological Collections. I am very much in favor of such a proposal. I am the director of the Herbarium at Knox College, a small liberal arts college. We have a fairly large and historically important collection, with many specimens first collected by Dr.Samuel Meade, a pioneer physician who actively collected in western Illinois from the 1830’s until the late 1870’s. His specimens are one of the best records of plants in western Illinois at the time of Euro-American settlement. He also traded with other plant collectors and thus we have a lot of material of similar vintage for all across the US and even other parts of the world. We estimate that we have 25,000-30,000 specimens in our collection. It is not well cataloged so we aren’t certain.

In terms of your proposal I am especially interested in item 2:

2. Develop a network of regional collaborations for collection digitization across the U.S. These regional efforts might consist of institutions housing both large and small collections from the same region that unite to focus on digitization and web mobilization of collections in order to contribute to the national collections resource.

Along with directing our herbarium, I am also a professor of biology, the chair of the department and the director of our field station. I don’t think I would ever have time to digitize all the specimens on my own. I would have some kind of assistance – preferably someone who could come to Knox to do the digitizing. I would be especially supportive of your efforts if it included such a plan for assistance to people like me who manage important herbaria but who already have heavy time commitments.

You are proposing a vital but ambitious and almost daunting undertaking. But it is critical that we have a central storehouse for all the knowledge that sits in our collections that your proposed project is absolutely to understanding biodiversity in the US.

There are many different efforts to digitize and make available biological collections data online. These can be broken into two groups – those that can share data through a common set of standards allowing a unified webportal / service like GBIF to make the data available and those that for whatever reason, don’t. I recommend that this distinction be made powerfully clear as you proceed so efforts that are not serving data to GBIF can become transformed so that they do, and that efforts which do not intend to serve data to GBIF are not funded. It would be inefficient to fund projects that digitize data but fail to make them available through a common set of data standards and GBIF.

Additionally, the MANIS, ORNIS, and HERP-NET/FISH-Net projects are leading the way in the US with massive large-scale digitization of specimen data and it would be best to not reinvent any wheels. Given the progress made in protocols and data standards by these projects the majority of future efforts should be towards getting data digitized, not in building new databases or arguing about standards, or making a new service like GBIF (although GBIF could use some serious improvement!)

1) your support for the proposed model, suggestions for revision
THIS IS AN EXCITING AND LONG-OVERDUE INITIATIVE TO WHICH I AND THE CLEMSON UNIVERSITY ARTHROPOD COLLECTION (CUAC) OFFER OUR STRONG SUPPORT. SPECIFICS ARE NOT YET AVAILABLE IN THE DRAFT MODEL, BUT THE CONCEPT IS FINE.

3) priorities for collection digitization
BEGIN WITH COLLECTIONS THAT HAVE ALREADY BEEN DIGITIZED, SUCH AS OURS, AND REFINE AND EXPAND FROM THERE. (OUR DATA ARE AT THE COUNTY LEVEL FOR SOUTH CAROLINA, MANAGED IN FILEMAKER PRO, AND AVAILABLE AS A SEARCHABLE DATABASE ON THE WEB http://bugs.clemson.edu/database/museum/search.php .)

4) ways to maximize collaboration across institutions and federal agencies, and at the international level.
REGULAR EMAILS TO PARTICIPANTS WITH LINKS TO RELEVANT DOCUMENTS IS A GOOD WAY TO START.

I support the plan and we http://www4.nau.edu/cpbc/ are working towards the same goal, but at regional scales. I support the plan and would like to be included on any lists for future planning meetings. I agree with the three-tiered approach, but I would reword the third focus, I had to re-read it a couple of times. I think a great first step is to simply document who is doing what with regard to imaging collections.

I have spent much of the last 6 years working with these issues. In pretty much that same period of time we have seen the development of GBIF, UBio, Discoverlife, EoL, and the like, which gives us some options on how to view this landscape. I personally am of the view that grandiose schemes will be less successful than efforts focused on individual collections with eventual aggregation of data. I say this because institutions “own” the collections and most institution also want to own their data as well.

At the same time we have seen the NSF pour a lot of money on Specify (which does not seem to have all that much community acceptance) but only lately has the foundation seen fit to do more large-scale digitization. Furthermore, TDWG has had meeting after meeting about standards, and I personally do not see that much, if anything, has come out of that effort.

Web-based approaches to data capture are the way to go based on my experience. They greatly simplify the IT and software maintenance issues. This is one of the areas where Specify has been a total flop. Web-based approaches also allow buy in by multiple institutions at little cost.

So, I do not think spending lots of money on advanced technology is the way to go. We know what kinds of data natural history collections contain, we know how to structure it, and there are some models for how to capture it through web interfaces. EoL has spend a huge amount on infrastructure, and I am not optimistic. GBIF has shown how to make the process bureaucratic and sluggish. Discoverlife has spent very little and produces the most nimble products.

Randall T. Schuh
George Willett Curator
Division of Invertebrate Zoology
American Museum of Natural History

This is a great project although the scope is daunting. I curate the Vasey (ISU) Herbarium, a small collection of about 50,000-60,000 specimens that functions largely in support of teaching. But this collection dates back some 200 years and has some notable components (specimens collected by George Vasey on Powell’s Colorado expedition, and specimens collected by Mary Strong Clemens in the Great Basin in 1909-1912 before she and her husband moved to the Philippines) which would not be readily accessible to others without digitalization. Presently a HR camera exists on our campus for imaging the specimens, but the necessary software and the support for actually photographing the whole shebang do not, and clearly this will be the rock this project founders upon if any.

Great timing. Our biological station has a small collection (~20,000 floral and faunal specimens) which we are just beginning to digitize. Given staffing constraints and variable confidence in taxonomic identifications we are using the following, low-overhead approach.

We are beginning with our largest and highest quality collections – vascular plants and bryophytes. We are collaborating with a larger institution (the University of Michigan Herbarium) from which we are receiving a database schema, hardware and software recommendations, and training for the digitization and QA/QC process.

In exchange, the UM Herbarium will receive unique specimen records and more precise location information for known collection sites in our region. Once this phase is complete, we will be better able to tackle our smaller and more complicated collections. We will also use this time improve the quality of the identifications in those collections.

My thoughts for the project outlined on this website are as follows:

For the digitization process, I believe a regional (i.e., dispersed) effort that pairs complementary institutions is best. A mentor-mentee relationship, if you will.

However, how to get our collection online, integrated with multi-institutional databases and how to leverage these databases is an outstanding question. For this phase of the project I think the ability to consult with technical experts located at a centralized institute would be best. I think this is the phase where uniformity should be enforced. I could foresee help in mapping our database schema to whichever data standard is adopted as well as how to best serve up and leverage data from an IT perspective.

Kyle Kwaiser, Information Manager
University of Michigan Biological Station

Support for specimen digitization is surely needed, so I think this NSF initiative is great. However I share the concerns expressed above about data quality. I would hazard a guess that the majority of specimens in existence have never been vetted by an expert, much less kept taxonomically up to date. So I am troubled by this sentence:

“This initiative would not directly support the development of new collections or collection improvement through enhanced infrastructure, curation or management.”

This seems to reflect a vision of one-way data mobilization: from the dusty shelves of museums to the eager hands of scientists. Surely it would be better if it were a two-way street. Assuming that at least some of those scientists using the data are taxonomists, it seems very important that that expert knowledge be integrated back into the collections themselves. So I would argue that this initiative should explicitly allow and encourage informatics research for the purpose of collections improvement, with the goal of positive feedback between specimen curation and taxonomy.

I think this is a fabulous idea and I want to point out a strength of a database of specimen images that was not stressed above. If, like me, you are interested in studying morphological evolution, you need images of your specimens to score the appearance or presence or absence of orthologous traits across taxa. Someone else, using the same images, may later look at the evolution of a different trait. They will not need to handle the specimens directly again. Images contain thousand of information points about a species phenotype, just like genomes contain information about thousands of genes. Most people will only study a few genes and traits at a time, but its good to know that the potential to study all genes and all traits will be there.

Every institution and collection will have its own unique way of capturing information. Most database/imaging effects have focused upon a single institution/collection or search queries that try and pull this information from various sources. As others have commented, the problem is that many institutions don’t have the money nor labor to get their data onto the internet, and maintain the servers and database software etc… Rather this data hides out on non-networked personal computers, but could be easily uploaded onto the web.

Focus on making web based software that will allow each institution to customize their own database and the way it can be used and accessed. This would be similar to websites like MySpace and Facebook, which allow users to customize their webpage. A national web based system that allows each institution to create and customize their own database fields, data structure and images would be of get benefit to many museum collections and institutions. Furthermore, provide the ability to users to block data from public access if they choose to. Undoubtedly many smaller institutions and museums would participate since they would have a feeling of self-control over how their data would be viewed and displayed. Shoehorning every one’s data into a single database would alienate people and institutions. Whether you have been using FilePro, Microsoft Access, or Specify you could upload your data onto your institutions website and start using the website, without having to invest time or hire a database software programmer.

I would also recommend working with various U.S. government agencies, USFS, BLM, NPS, and USGS. This would give some legality to the process. Many museums are very nervous in today’s climate of releasing data that may conflict with various governmental laws, or data that could be used for illegal collection of endangered plants/animals/fossils and archeological sites. Having the U.S. government endorse a national database/imaging system, and give guidelines on what can be released to the public would be beneficial

This initiative would build on programs the NSF has developed and supported on both research and database fronts. Several people have mentioned the importance of standardization with GBIF, what people may not know is that the U.S.’s support for GBIF has come through budgets at NSF and that as NSF funding gets cut, things like support for GBIF could get cut also. I would hope that a stable national (and international) biological database along the lines of GenBank might evolve from this, but this could only happen with a solid long-term US government commitment. The existing portals (e.g., ORNIS et cet.), while providing important steps forward with respect to biological diversity databases, do not yet take full advantage of the potential that digitization provides. The proposed program would be a logical next step that would help support and ultimately promote and secure the value of biological collections, large and small, for the future.

I hate to be among the naysayers, but I think this plan addresses the easy and romantic part of managing our biological data resource while neglecting the pressing issues of its physical conservation. The utility of mobilizing the vast store of data residing in the biological collections of this nation are manifest, but this report is critically unbalanced in emphasizing the value of, and the strategies for doing so without addressing the critical need to preserve the physical resource upon which it is based. I am struck here by the parallels between this proposal and the current debate between journalists and the outlets that disseminate their work for free.

The point is that national support for conservation of our biological data base has been shamefully inadequate and capricious, at best, for decades and the initiative as outlined in this proposal would seem to do little to address this pressing need. I, for one, would advocate that the systematics community withhold support for this proposal until a balanced and commensurate share of funding was devoted to the physical conservation and augmentation of the object collections upon which it would be based. I note with dismay that conservation is not among the proposed objectives and only mentioned in passing and as footnotes in the second paragraph of the “Scope of Digitization” section. To say that this can be addressed in a parallel initiative is delusional, once the users have the data they think they need at this point in our scientific development, the real and continuing source of those data will be left to deteriorate in obscurity!

I strongly support this project. I can relate some positive experiences. Over about 10 years we have been developing the Southwestern Ecological Information Network (SEINet) that now hosts 1.1 million specimen records on-line. We (various PI’s) have been fortunate in getting NSF money to set up some software (mainly database structures, mapping tools, search engines to start with) and money for databasing the three largest herbaria in Arizona. Due to a very dedicated and diplomatic programmer/organizer we were able to add in smaller collections from the Arizona and from adjacent states. Enthusiasm grew and the New York Botanical Garden and others joined so now we host specimens from around the Southwest and we would like to add more. Luckily herbarium specimens are usually identified correctly or nearly so. We spend time checking to see that this is true with our own collections. Mistakes often occur, but having specimens in a database can help us find misidentifications or georeferencing errors.
The benefits of databases of specimens, especially when they are georeferenced, are great. I was in Utah last summer and was able to make a checklist of the species for the Wasatch Mountains that helped me recognize what I was seeing. That was mainly based on specimens held at herbaria outside of Arizona but in SEINet. I can in a few minutes make species lists for most parts of the Southwest (AZ, NM, NV, UT and parts of northern Mexico, CA and CO). We have images of many specimens and living plants linked to names so the checklists can be illustrated.
My goal is to image every specimen and have it linked to a database record. We are now proceeding with this goal with our Latin American specimens and I hope we will have funds to get all the previously databased specimens imaged and linked also.
One of our programmers is developing a databasing program (SALIX, Semi Automatic Label Extraction program) that uses OCR on images of labels and parses the data to a database. I have a few workers that use it for several hours a week and are helping us perfect it. SALIX requires an operator to be working with it and it is not especially fast yet, but all the users prefer it to typing in data and we get images of sheets and labels as we go along.
We take two photographs of each specimen with 14 and 15 mp cameras. In each photo a barcode with a HRN (human readable number) is present. We have a program that finds the HRN and renames the image. So once that specimen is in the database the images can be linked to the record rather easily.
I would like to have something like SEINet for the whole country and similar website exist. Because handling huge amounts of data is difficult, I think regional nodes that can easily exchange data may be best.
One problem that is not foreseen by some is that once you have your databases and software they require attention. They are very useful and make our specimens more accessible but they also give use new jobs– information curation. For instance, every new specimen that comes in can not be filed only, but must be databased (and perhaps photographed) as well.

Question: At what degree of accuracy of identification is it proposed that data be captured? Specimen identified based on a recent taxonomic revision; specimen tentatively identified based on presumed association with type specimen [modern revision of taxon lacking]; specimen unidentified at the specific level; specimen unidentified at the generic level.

My background is systematic entomology, particularly the superfamily Gelechioidea [Lepidoptera]. I know that a significant percent of specimens in most collections are incorrectly identified or identified to a relatively high taxonomic level.

I would think that to keep maximum momentum and productivity for the project, identification would need to be captured at whatever level(s) of accuracy is(are) available for each specimen. This will be variable in any collection, and it is precisely the lower-accuracy specimens that need to be easily available digitally for specialists to either (1) ID more accurately if possible from the digital info, or (2) put on a “hit list” for examining the physical specimens. Annotations on specimens should always be captured and easily searched/viewed. (And yes, I agree that the digital data will be next to worthless if we can’t at the same time preserve their physical sources!)

Some feedback from a meeting with EMu users- collection managers and core collections staff that use the database for providing data online and handling most collection tasks like loans, info requests, cataloging etc. The group was positive about the idea and scope of a potential digitization project such as this, and had a number of good suggestions and important perspectives to keep in mind.

1. International collections as part of scientific objectives (third tier of effort) will be important and should not be excluded.

2. Collection management –in addition to higher level data mining and web tools, the basic collection management tools of loans, invoices, cataloguing etc that are part of EMu and Specify should not be forgotten or created over again- just need to improve the tools we have.

3. Imaging is less important than other data for many groups of organisms

4. Collections have been understaffed for decades- will this project help or make that fact even harder? Impact on current collections staff needs to be factored in- need to maintain collections care, loans, etc without additional burden.

5. Training issues- not just anyone can digitize the collections, so training of taxonomists and collection professionals is key.

6. New web tools are less important than upgrading and fixing existing tools.

This is an important step in the right direction. In order for this to have the desired result, however, a couple of points must be considered:

The proposed end use is the availability of collections data on the web. Use of these data by the scientific community is manifold, but of even higher importance is their use beyond, by the public, by applied research, etc. This requires substantial quality control. The goal must not be quantity but quality. A large part of currently available collection data on the web are useless, especially in understudied groups, because of geographic gaps of collections, wrong taxonomic identifications, and lack of georeference data. Therefore, in addition to the digitizing effort, there must be a strong component of quality control, which can be implemented in different ways:

1. Building a common thesaurus of locality data that can be shared between all institutions to streamline the effort of georeferencing and locality data cleaning. Ideally, all collections should be georeferenced and should have links to standardized locality data points. This would also help cross-databasing since many collections are shared between different institutions.

2. Update taxonomy and nomenclature on digitized records. This can be done by involving specialists in the digitizing projects or by sharing digitized images with the scientific community and provide a feedback platform. All collections plus images could be available through a central portal which is accessible to taxonomic specialists that can view the images of groups of interest and leave taxonomic feedback that is submitted to the curators and collection managers responsible for the collection. Example: A specialist of the lichen genus Physcia would go to the portal, retrieve collection data from different institutions of that genus, view the images and IDs, and leave individual comments on each that are emailed back to the corresponding collections managers. This would greatly help to improve the quality of taxonomic identifications, although for many collections, it will only cover a fraction of what is available. One strategy to help this process is georeferencing, so that the specialist can see immediately if there are records that fall outside the expected distribution range of a species; these would have priority to be checked.

3. Many collections are available as duplicated exsiccati works that have been distributed among a large number of herbaria. The label data are usually available in electronic or otherwise published form. It would be a highly valuable effort to have a subgroup working on creating a central database on exsiccati records, by accumulating already digitized sets and adding further sets. This effort alone will create hundreds of thousands of high quality data records.

In order to promote the project, I would like to see some example cases created, e.g. selected species with their collection records and what can be done with such records. One example could be the mapping of records over time to document changes in distribution that could be linked to land use changes, deforestation, climate changes, changes in environmental pollution etc. Another example could be the mapping of biodiversity hotspots by overlaying species numbers of different organisms in predefined grids.

As a member of the group that put together this draft, some quick thoughts and comments.

1. We did have meeting participants representing fossil collections and I am sure the plan is to include fossil collection digitization and mobilization.

2. My understanding is that such an effort as is being proposed would only have a positive impact on physical collections improvement. I do not think funding for this effort would impact already available resources for such work. Just the opposite in fact – it would help provide multiple possible funding streams for different kinds of essential work. Some programs might focus more on physical collections care, and some on digitization and mobilization.

3. Again, just my view, but the goal here is not to legislate database choices or to really impact local curatorial decision making at all. The big challenge here would be to develop publishing technology that would allow different data owners/contributors a very simple way to share their data in a common format with larger network. The data standards for this are pretty much in place (DarwinCore) and there are lots of great approaches being developed for simplifying publication/transmission. There are some great models out there right now (GBIF, VertNet, etc).

4. There is a big need to grow our training and capacity to disseminate knowledge about mobilization/publication and I am very hopeful that could be as important a process as any other here. The draft strategy was an attempt to meet that challenge head on with national and regional infrastructure to help transfer knowledge and technology to all institutions.

As you know, the taxonomic/collections-based community is already badly underfunded and overworked in relation to what is generally hoped for in biodiversity documentation, including not only extant collections but also the staggering numbers of unknown undetected life forms. I recognize the value that could be added to existing collections by digitization in general. The question is, what budget category will be looted to make this possible, assuming that the overall pie does not increase dramatically? Second, the people who can do the work are in short supply, except for the experts on vertebrates and Lepidoptera. Such people are bottom-dwellers in the impact factor race, and probably won’t want to spend a lot of time on digitization as another activity that does not accrue money and academic glory. That leaves their students, who in the funding and hiring climate of the last 20 years would be better off going into genomics and/or molecular bioinformatics.

Now, it would be nice if all this was not a problem- which means someone must throw many millions $$$ per year at it- admittedly chump change in Federal terms- so I will now assume that it is not a problem in a happy vision of the future.

Next question: technical details. Many soft bodied invertebrates require dissection. I work on earthworms, and there is no way an earthworm can be identified from a photo, even a good photo or set of really good ones, with a few notable exceptions that are in any case known only to a very few people. When I look for important external characters, sometimes it takes two hands to make the thing visible under 10-20X magnification. Sometimes it has to be seen from an oblique angle, and at several Z-scale settings. Internal characters are no easier to deal with and are typically essential in documentation of the species. So in short, it takes a person like me, or a well-trained technician, to do this digitization. The global total of earthworm species described last year was about 36. The majority of them were authored or co-authored by me. Revisionary work uncovering cryptic diversity is just starting- so it was one step forward, two back in 2009. If there is not a money problem, there is still a staffing problem. For many inverts there are not enough trained people to make a dent in the digitization pile, and there won’t be unless people can see a career in biodiversity study, not just a few months or years as a research assistant.

This is a welcoming and exciting initiative statement. I concur many
others comments posted on the website, and my ideas overlap with them.
I add comments here just to emphasize what the others already say:

Basically I think the main foci of biological digitization should
prioritize type specimens and their literature for all published
biological names as far as types are preserved in the collections in
the U.S (ideally in the world but the project is focused on the United
States, right?). Second, but less importantly, the focus should also
be more on microorganisms than organisms we can easily “see”. The
diversity of microscopic organisms are underestimated, and little
known and described. Visualizing them hopefully will speed up the
species-level and community-level description and accurate estimation
of the global biodiversity including the cryptic and inaccessible
habitats.

When funding is asked for digitizing and mobilizing biological
collection data, it is important to consider resources required for
the maintenance and long-term preservation of information generated in
the same manner physical collections management entrails, such as
training of digital collection managers, digital collections storage
space, digital collections management ethics and rights management;
long-term digital preservation, etc. Libraries are ahead of the
natural history museum community for sure in this respect.

As the Digital Collections Manager of the Herbarium at the Royal Botanic Gardens, Kew, I am keeping a keen eye on the discussion generated by this draft strategy.

As I understand it, this strategy is directed squarely at collections held in the US. (Is that correct? I’m not sure if Mark Westneat’s first suggestion is that collections outside the US should also be considered for inclusion?). However, any outcome will obviously be of extreme significance to collections elsewhere, in terms of data access and the development of tools, standards etc. Like many others who have posted here, I’m interested to find out how this initiative will build on infrastructures (e.g. GBIF), tools and standards that already exist or are in development.

Robert Luecking’s suggestion for the development of example cases as a means of promoting the specimen digitisation effort strikes me as something that would particularly valuable. There are still plenty of people who need convincing of the value of this kind of effort (as is evident from reading the posts above). I would expect a broad portfolio of specimen-based analysis examples to be a valuable aid to fund-raising not only for additional digitisation effort (globally) but also for the maintenance and enhancement of the physical collections themselves.

I think this is a long-overdue excellent idea that will have wide repercussions. Given the size of many botanical collections, digitization would be an immense help in bringing together the unusual specimen not otherwise found by simple loan requests. My concern is the implementation. There are currently not enough curatorial people to handle this along with their other numerous responsibilities. Further sometimes (based on my past experience) those who have well-digitized collections or are well versed in the techniques of georeferencing leave the rest of us floundering because we no longer speak the same language. I cringe at the thought of “well-go-to-this-website-and-it-explains-everything” approach – I don’t have the time to wade through a new language. What kind of real support will this massive project provide the small institution as well as the large-but-poorly-staffed institution?

This project would be of great value to multiple communities and many different kinds of users with disparate goals—from schoolchildren seeking to learn about species found in their geographic area to scholars investigating biodiversity. I would like to suggest that the group also consider as specimens the artifacts made from biological materials held by museums, universities, individuals and others. For example, there are indigenous artifacts made from plant materials in some collections that are linked to herbarium collections and journals containing field observations not only with biological data on the plant but the process by which the artifact was made. The economic botany community has been seeking support for inventorying, linking and digitizing the collections of useful plants and products which all to often are in a state of endangerment—for example a collection of thousands of plant medicines and food crops from the late 1800’s on–with limited success to date. These collections provide interesting and important data on the species and cultivars people utilized in the past through the present. For example, a recent project studied the chemical composition of one of these ancient biological specimens including evaluating the stability of individual compounds following a century of curation, the results of which have application to the present concept of shelf life. It is important that these collections are more widely known and utilized, and digitization would help accomplish this. On another topic, digitization would be a wonderful way to back up some of the small collections in remote parts of the world, the “microherbaria” for example that are precariously housed, vulnerable at many levels and underfunded, but which contain important historic and contemporary material on both native and introduced species–key to understanding biogeography of these organisms and answering many other questions. Please include in Tiers 1 and 2 the parts of the U.S. that are territories and/or have Compact of Free Association status (thus eligible for Federal funding), for example in the Caribbean and Pacific regions. Having materials digitized in areas outside the 50 states would facilitate biodiversity studies in habitats that are often costly to visit and thus understudied.

This proposal focuses on a central issue that most of us involved in natural history collections wrestle with on a daily basis: How can we best digitize our holdings, link them within a broader network, and make them accessible to researchers, students, and the broader public? The responses have been thoughtful and informed; reading through them has been valuable and instructive.

Echoing some of the comments, I tend to agree with Randall Schuh’s point that, “grandiose schemes will be less successful than efforts focused on individual collections with eventual aggregation of data.” There is a certain value in a network and diversity of collections, a value that centralization sometimes tends to overlook. Finding a way to foster digitization by those who know their collections best, while developing the platforms that can integrate those data (something akin to GBIF) might offer the best of both approaches. Rich Ree’s concern about “data quality” is well taken. Again, by working with those who know their collections best, some of these concerns would be addressed. In the case of the vascular plant collections with which I work, we often have an empirical sense about the reliability of the data. I readily concur with the point made by my colleague, Mariko Kageyama, concerning “digital curation”. Turnover in technology is occurring so fast that personnel specialized in IT is essential for the tasks proposed – both for implementation and for preservation in perpetuity.

I also sympathize with those who have expressed concern about such an initiative drawing funding away from the physical conservation of our collections, as well as the essential taxonomic work that needs to continue. I think Rob Guralnick’s suggestion that this proposal may foster more support for improvement of physical collections is as likely as not. This is probably true for basic taxonomic and systematic research as well. None of us can (or want to) do all these tasks – modern collections call for some combination of research scientists, field workers, collection managers, and information technology maestros. At any rate, the need to make our data digitally accessible is imperative if we are to be relevant in the 21st Century.

I think this is an important initiative and is worthy of enthusiastic community support. Clearly access to specimen data is a critical issue and there is an urgent need for new tools that promote the use of these data in novel ways. This is an area where investment by NSF could have a major impact on how we use natural history collections and how widely they are used. However, I think there are a couple of important caveats.

First, not all collections are easy to digitize. For example, I can testify from painful experience that it is a lot harder to capture meaningful specimen images from vertebrate fossils than it is from herbarium sheets. There are limits to what can be achieved through technology, because automation and standardization is tricky.

Second, and a much bigger issue, is what to do about the significant backlogs of uncataloged material in many institutions. This is another area where a strategy predicated on technology probably won’t work, because the major bottleneck in cataloging is staffing; people to identify, prepare, label, and place specimens in collections – if you’ve never thought about how time-consuming it is to label a fossil, for example, take a quick look here – http://tinyurl.com/yelqwrq .

Backlog cataloging is time-consuming and represents a long-term commitment on the part of the institution; 5-10 years plus in the case of the larger collections, assuming at least 1-2 FTE efforts. The FT in FTE is important because most existing collections staff are required to perform many different tasks (processing loans, dealing with visitors, answering enquiries, assisting with public and educational programs, etc), all of which are mission-critical for the institutions concerned, and most of which show an instant return on a short timescale (unlike the sustained effort required for cataloging).

This is not to say that I don’t think NSF should tackle this problem – cataloging backlogs are a major barrier to collections access. But I think it would be important for any collections receiving funding under this initiative to demonstrate sustained institutional support for cataloging in the long-term, possibly (and I realize I’ll probably get hate mail for this) through a requirement for matching funding.

Thank you! This is fantastic. I feel this is enormously important. I
am currently a graduate student at an institution with an impressive
entomology collection numbering in the thousands that has never been
databased, beyond the most rudimentary lists detailing selective parts
of the collection. And this is only entomology! There are also a
general vertebrate and herpetology collections here – which I imagine
are in the same sad state of affairs. Further, when I volunteered as
an undergrad at my alma mater, its entomology museum numbered in the
tens if not hundreds of thousands, again, & 95% not databased; after my
undergrad I worked for a provincial museum in Canada with an enormous
collection, which was in a similar state. From talking with curators
professors this seems to be the norm in collections, largely due to
lack of will & paucity of funds. As someone who wishes to work as a
curator in an invertebrate collection down the road, I applaud your
efforts. Accessibility is the key to having these enormous data
warehouses being used by the scientific community! I’ve said it many
times myself, and I’m a mere spring chicken. So: thank you!

Having pushed for creation of the US Virtual Herbarium (USVH) project which has its goal digitizing all specimens in all US herbaria, I am obviously very supportive of this initiative. A few points that I think merit more emphasis:
1) We need to incorporate an educational initiative that is directed at first year (or even high school) students; graduate school is too late. Biodiversity Informatics 101. Why? We need more students and graduates who are aware of the power of collections information and the enormous benefits that come from observing international standards in recording collection information; we need students and graduates who can envision new uses of the information provided; we need more students and graduates with education that spans what, in many larger institutions will be multiple departments and even colleges. We need US students and graduates, not just those in other countries, to be able to use the information.
As an aside, a niece in Canada took a GIS course. They worked with US data – there was so little available – for free – from Canada at the time. Result? She learned a lot about US geography – and nothing about Canadian geography. Bad move Canada. This program is for all specimens, not just US specimens – but we need US students learning to build and use it.
2) Emphasize the extent to which involving all US natural history collections will benefit communities throughout the country. Moreover, it will create employment opportunities. Most will be temporary, student positions, but our students need opportunities that enable them to expand their education while earning money. Perhaps they will not become systematists. Let’s hope they graduate with an understanding of the role of systematics and a supportive attitude. There will probably also be some new professional positions. Equally usefully, the initiative may persuade some institutions to place greater emphasis on collection-based research when announcing new openings.
3) How to organize? There needs to be flexibility – possibly focusing on how specimens are stored/housed more than (or as much as) their taxonomic group. There will be groups where the value of imaging is questionable. Remember though that the imaging will capture the label (I presume) – and perhaps field notes. This is an area where those working in a particular group should make recommendations as to the most effective procedures. Overall, I prefer a collaborative approach to creation of a single center.
Yes, it would be wonderful if we could train a bunch more systematists – but that is going to take time – and I bet there have never been enough. This initiative will make more students aware of the discipline, open up different kinds of opportunities to engage in it, and increase our opportunities for making contact with potential students early on (think BI 101).
Let’s do all we can to support this initiative. My thanks to all those involved in developing it.
Mary Barkworth

With all due respect friends, I feel like I’m watching a crowd of lemmings follow each other over the cliff. The benefits of digitizing collections-based information are real (though often wildly exaggerated), but digitization loses much of its value if the collections it is based on are not continuously expanded, improved, and curated properly. Except in a few well-funded enclaves, our natural history collections are in trouble – big trouble – that stems from decades of underfunding, mismanagement, staff cutbacks, and the continual devaluation of collections -based research by the so-called illuminati that govern science funding in this society. Digitization is of limited value if we have an inadequate specimen base, if the specimens are poorly identified, poorly curated, or likely to perish in the next few years at the whim of some highly-placed ignoramus that needs space for another genomics lab. As an entomologist, I can testify how serious these problems already are in many places, despite the heroic efforts of curators and staff alike. I can also testify to how surprisingly thin our specimen base is, even in many supposedly well-known groups.

The customary response of the digitizing faithful to such comments is that digitizing museum collections will demonstrate the importance of collections and that eventually this will attract new funding for them. This is the silly old economic trickle-down theory in a new guise. I have yet to see the slightest evidence that this is true. My experience (not inconsiderable) is that digitization projects, no matter how well funded or well-intended, suck up scarce resources badly needed by the specimens themselves. This is especially true of the time and energy of curators and staff. Almost invariably, the tail ends up wagging the dog, and it is the collections that suffer.

The present document is all about exploiting collections rather helping them. I urge casting it into a digital paper-shredder and creating a proposal that emphasizes linking every digitization project with a substantive collections improvement or expansion project. We desperately need to shift the priorities here and to treat collections as resources that need conservation, improvement, and care, rather than as cows to be milked.

This is an addendum to my previous comments. I took another look at the document today to see if my previous remarks were not too harsh. I’m sorry to report that they were not nearly critical enough. Let me try to say why:

1) The document treats collections as a static resources that are sitting out there underutilized – ready to be “mined” for the benefit of digital use. Collections (and indeed specimens themselves) vary enormously in quality, and thus so does their value for the digitization being proposed. A lot of the stuff sitting in museums today is of nearly microscopic utility compared to modern, well-collected material – but it sure takes up a vast amount of space!

2) The document also reflects an implicit misunderstanding of the value of images. Specimens are not simply raw material for images. Value resides in specimens, not images, and images have scientific value only because they are linked to actual specimens. Plus, even the best images have only a limited utility compared to actual specimens. To overreactors out there, I’m not saying that images are useless, I’m saying that they are secondary sources of information, and they need to be treated and valued only as such. Our primary concern should always be focussed on the specimens, from which all digital image collections derive their value. There is no sign of this understanding in the document.

3) There is no indication in the document that anyone involved had a sense that the job of collections remains undone. Our collections may be vast, but our coverage of biodiversity remains poor. In general, we need to do a lot more collecting, accessioning, and taxonomic research (vertebrates and a few other showpiece groups excepted). Because the proposed “national digital collections resource” is only as good as its specimen base, it seems to me that concern for improving our specimen base is critical to the ultimate value of the digitizing effort.

4) I can already hear the databasers and the web developers say, “But we are not talking about these issues – they are a separate subject.” To that I say, “Don’t kid yourselves.” As a collections person, active collector, and taxonomic researcher, I am besieged by people who want identifications, alcohol specimens for DNA sequencing, interpretations of data, real distributional data, ecological information, and even alcohol extracts of specimens for bioprospecting! A national digital collections resource might help with some of that, but the effort to get the data and images online would take me “offline” for years – a devils’ bargain I’m not willing to make. The same is true of many other organism people. To get what you want, you will have to engage with collections people and organism people and take their concerns seriously. Without such real engagement all we will have here is another cosmic plan imposed another out-of-touch, self-proclaimed managerial elite that will just impede and annoy the few troops remaining on the front lines in the struggle to understand and conserve the world’s biodiversity.

Your points are very well taken. What would be most helpful from you now is a follow-up with some specific and constructive ideas for recasting the proposal in ways that address your very valid concerns, while still moving us forward on both the physical and digital fronts. Whether or not you choose to accept this challenge, digitization is not going to go away, and is going to continue adding very real value and visibility to the physical collections, such as they are. So let’s turn it into an opportunity to strengthen the physical foundations while we’re at it.

This is a worthwhile initiative, which I heartily endorse. It promises to coordinate dissemination of information critical to improving the efficiency of taxonomic and biodiversity research. Thus, it will help systematists, conservationists, and naturalists. As well, it will be a valuable educational tool, though more could be said about this point. I see tremendous benefits for all branches of natural history, but I will mention how this project would help two kinds of projects, speaking from direct experience.

Those who work on Floras, like me, will benefit tremendously from realization of this project. Any herbarium does not have all the specimens necessary to complete a Flora, especially when the Flora is of a large region. Having the nation’s herbarium collections digitized will allow study of many more specimens than is presently possible, making the product much more accurate and useful. Also, this project will enable loan requests and visits to other herbaria to be highly targeted, conserving limited resources.

The revisionary systematist in me is also excited by the prospects of this project. Most of my revisionary work is on sedges (Cyperaceae), a species-rich, cosmopolitan group whose members are almost uniformly difficult to identify and undercollected. High-quality images of specimens and databased locality data will enable sedge systematists to answer many basic questions for which we still lack answers. I’m sure the same is true of many other poorly known groups.

I have two specific recommendations. First, to address the concerns of several others as well as my own about the correctness of the identifications (sedge specimens have a high rate of misidentification in herbaria, after all), be sure to include a dynamic annotation feature. That way, those who wish to enter determinations in the future will be able to do so. Second, reach consensus on standards for restricting access to locality data for vulnerable species and those of serious conservation concern. I do not suggest restricting access to legitimate researchers, but I do not want to make it easy for poachers!

Rob Naczi
Curator of North American Botany
The New York Botanical Garden

My comments of the Strategic Initiative are personal, but are influenced by nearly 40 years of working with collections and my experience in organizations including active participation in ASPT and SPNHC. I have broken my comments in several sections:

—- support for the proposed model, suggestions for revision

The coordinated approach that is suggested is absolutely critical if we actually want to accomplish the task at hand. Coordination rather than each collection competing against all peers as in current NSF competitions may be difficult to instill in light of the desire for everyone to get a piece of the often-coveted “overhead” that comes with many grants. Such coordination might suggest a model more like the Mellon Foundation has used in funding the Global Plants Initiative – make a good case for the need via a short proposal and the “wish” is often granted.

There is one critical factor that must be addressed. It must become easier for interested collections personnel to both become aware of, and stay current with, the advances in methodology and collaborations in the field. The herbarium community has embarked on efforts to help their colleagues and hopefully other groups are also investigating this area.

—- ideas regarding the three-tiered approach

Tier 1 – One aspect that is critical here is making technical support easily available. Many collections are heavy on taxonomic experience, light on IT. Unfortunately, this is not a problem only for small collections; some large collections also face IT challenges.

Tier 2 – Working within regions seems clearly to be the most efficient organizational option to get the most “bang for the buck”. This model is being used as a basis in initiatives like the US Virtual Herbarium and encourages cooperation, such as cooperative georeferencing as employed in the MANUS project.

Tier 3 – Investigator/clade –driven collaborations focus on an area that I don’t think has been used much so far. I often think about the number of data sets that are created by monographers that never get back to the owners of the collections.
·
—- priorities for collection digitization

The sheer size of the collections in the US argues for the need to “start somewhere.” I suggest that we need to focus on the collections that will give us the most information about the US ecosystems first. It seems logical to group our knowledge into regional centers that could then be the primary source of info about their areas – much like the intent of the NBII regions.

This is especially critical in the case of botany collections where duplicates often abound, sometimes as many as 70% of the specimens in a herbarium were not collected by personnel at that institution. Herbaria should concentrate on the collections where they hold unique and “first sets” and should be responsible for their capture. Since ancillary information, like collection books, is often held where the primary set of specimens is housed, that herbarium should have the best chance to produce the highest quality data record.

—- ways to maximize collaboration across institutions and federal agencies, and at the international level.

One of the challenges that I see facing many US institutions is, if one wants to digitize “everything”, where are the resources going to be found to digitize, much less georeference, all of our Old World collections? For some collections, those collections represent a large fraction of their holdings. Is a funding agency in the US going to care that herbarium XX has the best collection of Sumatran orchids outside of Europe? Working cooperatively with each other and international repositories seems to be offer the best chance for getting these specimens digitized.

Another thing that we need to be aware of, and be ready to support, are international efforts aimed at similar promotions of collection databasing. One example which may be close to fruition is SciColl, Scientific Collections International, a project proposed under the OECD Global Science Forum.

The most pressing need in this area involves continued efforts to establish more common understanding and appreciation of what is best for each constituency. I have found that some of the efforts undertaken by NBII-GBIF and National Park Service have helped in this arena. While both using biological collections, academics and “agencies” use them for different reasons – each may not necessarily appreciate those differences and reasons that lead to them. We need to continue work here – skepticism of the “other camp” is certainly still a problem!

I think that the effort being outlined and developed is very ambitious and will be tremendously useful. It is not without its problems to be solved, many of which have been pointed out by others in this blog. But the more specimen information that is electronically accessible, the easier it will be for us as systematists to do all of our work and in the future this will continue to be ever more useful beyond the systematics community.

I am not worried about the discussion of whether this data is held in institutional databases or with a single community-wide model as it will probably be both. As a community, we are already going in this direction, working together to make more specimen data available, and better support from NSF will only increase our ability to get there.

Several writers have expressed concerns of possible misidentified specimens and again I don’t see that as an issue that we should see as excessively daunting. Misidentifications are more likely to be corrected as more people see the collections and making them electronically accessible will certainly help address that issue.

So while there are certainly hurdles to be crossed and problems to be solved, I think as a community we are already working towards the goals outlined here and that we can work together to make this a successful venture and help us all move toward the goal of making the collections that we all care for more readily accessible to our own systematics community and also other groups that could benefit from access to specimen data.

James S. Miller
Dean & Vice President for Science
The New York Botanical Garden

Very important–as mentioned, this inventory needs to be as broad as possible in scope. That includes living as well as non-living collections both zoological and botanical (i.e. zoological and aquaria parks; botanical gardens; and inventories from managed areas–national and state parks, private collections, and commercial). Although inventory of invertebrates and plants will be the biggest challenge there are ways to accomodate this as well. All species need to be documented in managed natural areas from the aquatic, to marsh and upland to montane. Information needs to be filed digitally as well as non-digital (i.e. narrative) information where appropriate. My assumption is that all references in print will not be included since these should be accessible from other sources (i.e. Libraries, etc.). An inventory of (a) all collection holders will need to be assembled and (b) an agreed-on list of data to be filed both what and where. Go for it!

I have one comment. The following phrase appears in the draft document:

“This initiative would not directly support the development of new collections or collection improvement through enhanced infrastructure, curation or management.”

I think that I understand what the authors mean (in a strict sense, or “directly” as it is put here), but curation and management will be critically improved by this concept. I’m thinking especially of “push-pull” techniques of annotating duplicate specimens, updating nomenclature, reducing the number of physical loans (or numbers of specimens per loan), reducing direct requests for data, etc. Perhaps a sentence of clarification could be added.

Also, if politicians are likely to be involved, it would be a good idea to point out how other countries (e.g., Australia) have done it, both to show that it can be done and to shame us into keeping up!

I agree with many (most) of the comments posted here and am particularly sensitive to the criticisms expressed. I confess that I have not read every comment in detail, so if the comments I make are redundant, my apologies.
Images: I frequently receive photographs of specimens for identification. I find that it s virtually impossible to confidently identify species, even of those that are familiar to me from photographs alone. There would have to be some protocol for photography of each taxon to be sure that all diagnostic characters are included in sufficient detail. If this effort is to be made, it should focus first on type specimens and be overseen by a taxonomst familiar with the group. A secondary category would be specimens too rare or valuable to send on loan and that would otherwise require a collection visit for examination.
Much of the most valuable information (at least in herpetology) is already available in the oline catalogues that exist and more collections need to be made available including this information. The most important information is the location of accessable specimens, the collection and locality data, and the means of preservation. Many models for this sort of digitized catalogue already exist.
One essential concern I did not see in the previous comments is that tissue used in molecular taxonomic studies needs to be correlated with reference specimens, and if specimen reidentification occurs, these changes need to be included simultaneously in tissue catalogues. Specimen identification is highly subjective under the best of conditions and strictly molecular taxonomists need to be aware of this.
Finally, having once been a curator, I have some idea of the costs, training, and labor that such a project would invove. Most of the hands-on work will fall to the collections managers, most of whom are already fully tasked with the maintenance of collections. Before a proposal is submitted a realistic assessment of needs and costs must come from these people, who are the hearts and souls of contemporary collections.
Steven C. Anderson
Professor Emeritus
Department of Biological Sciences,
University of the Pacific,
Stockton, CA

There are enormous differences in how to make useful images of different specimens, sometimes within a clade; in the number of collections of different groups; and in the number of specialists in different groups. There must be flexibility in approach – and the project must sell itself to the public.
1) Start with organisms that are usually identified by examination without dissection. Within that group, start with those organisms where standards for sharing digital information are well developed and there is a group with a functional plan for implementing a digitization plan throughout the country.
For organisms with many different sizes but still usually identified without dissection, are there ways to maximize sharing of equipment? Perhaps crossing clades/collections on the basis of size? Sharing is probably best done regionally – which would help build collaborations.
For organisms requiring dissection to get useful information (such as earthworms?), imaging and even capture of collection information should be set aside until those involved can see a way in which either activity will be beneficial. Their concerns must be addressed even if it means being a little less ambitious in the immediate goals.
2) Look for groups for which there are many collections scattered throughout the country. There are two reasons for this: We need to educate students (and ourselves) as to the kind of information that will dramatically increase the value of collection information; reaching many collections will maximize this impact; We need to develop an appreciation of the value of the initiative by multiple groups; absent widespread support, this project will not get the funding it needs.
Use state level collaborations to discover how many different national history collections there are. These could tie into state educational and agricultural interests. Do not conduct yet another survey. Get people to ask but ask only for essential information and offer something in return –a map showing where collections are located, their taxonomic purpose, role, and size. It is not much, but it would be simple and the information would be useful.
3) Organization: Taxonomic groups (which may not be clades) seem the best way to go because the project could build on commonality in methods of curation and disciplinary interactions. This will also facilitate development of educational components.
4) IT, yes, there are lots of bits and pieces that would help. Ed Gilbert argued at the US Virtual Herbarium meeting for adopting a modular approach. Such an approach has already substantially reduced the time required for data capture from herbarium specimens.
5) Maximizing collaboration: in vascular plants this seems to work well at the regional level. So long as we all adhere to international data standards, sharing will be easy. Minimizing redundant work is good, but opinions on how to do this vary. I do not find widespread support for the idea that it would be most efficient for each herbarium to start with the specimens that originated from its institution. International Collaboration. Let’s first get doing things within this country –observing international standards in so doing.
6) Avoid Mission Creep. The proposed project is far larger than anything biologists have previously undertaken – and will require major organization. Attempting to bring in living collections, both public and private, would vastly increase its complexity.
7) Priorities: Start with groups that a) will be easy to digitize because of how they are stored; b) have an organization with a plan in place or under active development – and support of their curators; c) have international standards for the information required; d) have many collections scattered throughout the country.

I recommend starting with vascular plants. Yes, I am prejudiced but: most specimens are flat and of a standard size; there are well established standards for the data associated with such specimens; there is widespread support among vascular plant curators for digitization and an organization in development to accomplish exactly what this initiative has in mind; there are about 650 collections scattered throughout the country and a structure for maintaining information about them; there are easily accessible authority files for names and institutions for organisms in herbaria.

Having looked over all the commentary to date, and in considerable detail, I feel I have to concur with most of the observations made by Leo Hickey, Randall Schuh and Stefan Cover. I also feel this approach has all the ponderous, monolithic feel of those predecessors of which I am aware, and that the concept’s development has too little to do with the overall needs of collections and collecting institutions.

A good deal of what I see going on around the museum and research fields supports the idea that web-based solutions have the most potential for success. A number of you know that I’m not a research scientist or a full-time collections manager, and that I spend a great deal of my time at the point of interaction between collections and the general public. Increasingly, web sites are where museums and the collections they hold meet their broadest audience, and the monitoring of the uses of these websites (in the best of cases) leads to the shaping of the sites and how the data – text and images – is gathered to go into them. Even a small discovery in the collections, obtained during inventory, condition assesments, work by a visiting researcher, ect., can have immediate benefits once it is shared with the word via the web. Large scale, one-size-fits-all approaches can muffle, even stifle such interaction and its benefits.

While building support for biological and natural history collections within the scientific community seems an endless task, the general public appears to me to have a fairly automatic connection to and appreciation for the significance of collections. To the extent that legislators and bureaucrats are unaware of the public’s interest in collections is a reflection, I feel, of the limited extent to which many collections conduct public outreach. In my view, to say that effective outreach will be an important spin-off of this sort of digitization effort is to place “the cart before the horse”.

I review grants for the Federal Institute of Museum and Library Services, which funds digitization projects. In evaluating these proposals, we are expected to ascertain if an audience is identified for the project, if the project plan is adequate and achievable, and if the project outcomes are measureable. In this case, I am once again impressed with the technical acumen of my colleagues in planning these kinds of projects, but I feel strongly there needs to be greater consideration given to audience, and outcomes.

We write to support the initiative to digitize natural history collections in the United States and to offer some suggestions based on our experience with the herbarium at the University of Connecticut (CONN). We began entering information on plant specimens into a database 10 years ago and had one-third of the collection databased a year ago when the pace of the work increased dramatically, thanks to a three-year NSF grant. We hope to have the entire collection databased and online within the next two years; it probably will take us one additional year to finish making digital images of the collection. All data and images are being made available online, through our web site, through a multisite hosted by the Royal Botanic Garden Edinburgh and through GBIF. One of us (RSC) also is active in the regional herbarium node, the Consortium of Northeastern Herbaria, which is considering ways to help small herbaria database their collections and to establish a regional node that would serve data for all herbaria in the region. Based on these experiences, we have these observations and recommendations:

The most important part of digitization involves entering information on specimens into a database. The work is not difficult, but it requires great care. The work takes time – a lot of time. The single greatest obstacle to digitization, therefore, is money. When money becomes available, people can be hired to do the work. There is no question about the importance of digitization and the potential benefits of making specimen information available online; the only question concerns how quickly it can be done, and that will be determined by the amount of money that is made available to hire people to enter the data. Critically related to this issue is appropriate supervision and quality control, to insure that digitized database records accurately reflect the original specimen information.

Having a single database platform available (and free) to all collections (Specify, for instance) would enormously simplify the work of linking individual collections in regional and eventually national collections databases.

Provision must be made in the national plan to provide technical expertise and support to collections. The smaller the collection, the more help will be needed. Technical advice, including step-by-step directions, will be needed on how to set up a database, how to enter data, how to georeference specimen records, how to find information on collectors. Detailed instructions will be needed on setting up web pages, backing up the data, connecting computers in networks, establishing connections to GBIF, uploading data so they can be viewed by the public. Adequate, step-by-step instructions are available for none of these processes, and it cannot be assumed that in-house technical support exists. It certainly does not here at UConn, and smaller institutions would be even more limited.

Particular attention should be paid to the smallest collections. Traditional models for digitization, in which people at the institutions do the work, cannot be used with small collections. The staff responsible for small collections (if there is any) do not have the resources and will not have them. Yet small collections house large number of specimens collectively. For instance, the Consortium of Northeastern Herbaria, which includes the eastern Canadian provinces, represents more than 50 herbaria with fewer than 10,000 specimens. We need new ideas on how to get these collections digitized.

Donald H. Les, director

Robert S. Capers, plant collections manager

George Safford Torrey Herbarium (CONN)
Department of Ecology and Evolutionary Biology
University of Connecticut

A recent workshop sponsored by NSF to discuss the creation of the US Virtual Herbarium (USVH) was held at the Missouri Botanical Garden from February 23-25, 2010. The 65 participants included curators, researchers, collection managers, information technologists, informatics specialists, and stakeholders, who together focused on issues concerning digitizing and mobilizing specimen-based botanical information housed in herbaria throughout the United States. During this meeting the Strategic Plan for the digitization of all US biological collections was presented for comment. The USVH workshop participants have reviewed the details of this plan and stand firmly in support of it. Further, we believe that our complementary effort highlights the fact that the botanical collections community is well organized to mobilize the over 600 herbaria in the US and we are prepared to help lead in this national effort. Given the number and distribution of herbaria across the US, we believe we can provide significant organizational support to the formation of a national effort
aimed at digitizing all biological collections.

Herbaria have a long history of collaboration with other natural history collections. Recent technological advances and changes in perspective of the community have created an opportunity for a fundamental shift in collections-based science. In leading this shift in our science, the herbarium community is generating workflows and tools that are directly transferable to other domains, and we can provide a model system that is extensible to other collections within our respective institutions, as well as to other natural history collections regionally and nationally. The herbarium community is also unique within the biological sciences in the organization of our collections and the ease with which we can obtain digital images from our typically two dimensional specimens. In addition, many collections across the country are already partially digitized, giving us a ready dataset for establishing proof-of concept to decision-makers and government officials. There are also several portals that have established the value of such datasets to such individuals as well as to educators and the general public.

Therefore, we urge the authors of the Strategic Plan to consider our maturing infrastructure as a springboard for building an efficient and effective nationwide strategy to accomplish the tasks at hand. We are in complete agreement with the immediate need to digitize collections and make them available as broadly as possible to the scientific community and the public. Near-term investments in advances in digital photography, computation, higher speed networks, digital storage and Internet integrated workflow tools will enable the efficient complete retrospective digitization of the herbarium collections. Simultaneously, these advances will allow all newly collected specimens to be digitally documented in the field and the data made available in near real-time, leading to much more efficient metadata generation and use. We believe the Strategic Plan should encourage these investments as the availability of high quality images and metadata will allow stakeholders such as land managers, agricultural specialists, customs officials, and many others to quickly make more informed decisions, and in turn would be of tremendous benefit to our society.

The strategic planning document is in direct alignment with the first broad goal of the USVH which is the digital capture of all specimen-based information, including images, of the more than 70 million known specimen objects [B. Thiers pers. comm, Index Herbariorum data, Feb. 2010]. The botanical collections community has been working toward achieving this goal since 2004 when, as an outcome of an NSF-funded workshop we set a target of completely digitizing herbarium collections by the year 2020 [Rabeler & Macklin, 2006].

The botanical collections community has two national groups that focus on collection-based resources and initiatives: the Systematics Collections Committee within the American Society of Plant Taxonomists (ASPT) and the Society of Herbarium Curators (SHC). The USVH project is currently working to use these groups to provide information to the community, and also to use regional groups to facilitate resource management and outreach. In the last decade the botanical collections community has formed 13 regional organizations, which vary in both governance structure and level of organization and functionality. These organizations facilitate inclusion of smaller herbaria and provide capacity building in the form of increased regional coordination, communication, and sharing of expertise and resources. Two of our most successful such consortia are the Consortia of California Herbaria (http://ucjeps.berkeley.edu/consortium/) and the Southeast Regional Network of Expertise and Collections (SERNEC – http://www.sernec.org/). The USVH is currently working to build on these established structures; we believe that this model could benefit the mobilization of the other domains.

Thus, the botanical collections community has already accomplished the second tier of effort cited in the strategic planning document. However, for the over 600 herbaria in the United States that curate from 250 to 7.2 million specimens [B. Thiers pers. comm, Index Herbariorum data, Feb. 2010], with widely varying levels of curatorial expertise and resources, full digitization remains a daunting task without a large influx of new resources. At present, around 30-40% of herbaria are digitizing collections, but there are significant impediments to further progress. The variation in size andinstitutional support among herbaria leads to challenges in resource allocation, some of which can be alleviated by resource sharing but some will require significant financial assistance in order to complete the digitization of all US herbaria in an efficient and timely manner. Several of the largest collections face the prospect of missing the proposed goal, lacking the physical resources to tackle the mountain of specimens they are curating. Many of the smaller collections are bereft of adequate IT support, but can use student labor to assist in digitizing efforts, if provided with the financial resources to train and manage these students. Current funding programs available for digitization force our community to proceed toward our 2020 goal in an inefficient piecemeal fashion. An enhanced regional and national approach to effective and efficient resource allocation would significantly improve our ability to reach our goals.

The USVH project is focused strongly on a second goal of improving efficiency of data capture and enhancing the quality of this data through the development of tools [e.g. HERBIS – Heidorn and Wei, 2008], standards, and best practices, that can then be implemented into established workflows [e.g. Apiary Project http://www.apiaryproject.org/%5D. This goal depends not only on a solid and sustainable cyberinfrastructure but also the need to include support for the taxonomic expertise associated with collections. We believe that it is fundamentally important for data we mobilize to be of significant and verifiable quality, in order to be fit for scientific research within and beyond our community, and for education and other applications. Fortunately, our community has a long history of contributing toward the development of standards and best practices for specimen data capture and dissemination, but it is essential that support for curators and associated taxonomic expertise be a component of the Strategic Plan and subsequent requests for funding.

We believe it is critical to recognize that the curatorial and taxonomic expertise associated with collections is at least as valuable as the collections themselves. In order to make use of this expertise in the digital age, we need to provide an infrastructure that allows for efficient and effective annotation and correction of the digital records. Botanical collections differ somewhat when compared to other biological collections in having many specimens represented by duplicates distributed from one institution to another regional, national, or international herbarium. Many herbaria have as much as 70 percent of their holdings consisting of specimens duplicated at least once (Rabeler & Macklin, 2006). This presents an advantage as we advise herbaria to prioritize their data capture, focusing on those specimens that were collected at their own institution where practical. The records can then be shared with all other herbaria containing a duplicate of that collection assuming that an intelligent network exists to allow discovery (Macklin et al., 2009), further increasing both efficiency and quality. If we can couple this data acquisition strategy with a suitable infrastructure for digital annotation and communication, then individual curators and experts can contribute their expertise efficiently as well. Intimately tied to this goal is the need for continued and expanded educational training of the next generation of taxonomists and biodiversity informaticians who will be knowledgeable in the use of modern tools for specimen data capture, analysis, and exploration.

The third goal of the USVH, which is closely associated with tier three of the draft Strategic Plan, is to present the data to the scientific and educational communities, along with the general public, in ways that promote the data’s use. This will necessitate determining what the user requirements are in order to help prioritize what we provide; seeking feedback beyond our traditional fields will be essential. Our specimen-based data are currently sought by data aggregators such as the Global Biodiversity Information Facility (GBIF), USDA Plants, the Encyclopedia of Life (EOL), and GenBank. It is also apparent that we will need to advertise and market our resources to expand our audiences through the creation of new web presentations and portals. Providing a cyberinfrastructure that makes the data more widely available should also benefit our community directly through annotation by users (Morris et al., 2009). These annotations may include new determinations, georeferences, corrections, additions, and/or citations, which iteratively enrich the quality and knowledge about herbarium specimens, and thus the data that we can provide.

The USVH workshop participants are now actively pursuing a number of tasks related to our goals. We would be happy to share our insights with the strategic planning group at any time. Further, our USVHefforts, along with a solid infrastructure already built, maintained, and being expanded among botanicalcollections, will enable us to respond rapidly to this initiative if it comes to fruition and to play a majorrole in its development.

Thank you for the opportunity to comment on the initiative.

Sincerely,
James Macklin, Ph.D.
Harvard University Herbaria, on behalf of the US Virtual Herbarium Steering Committee and the 65 workshop participants.

At Black Hills State University (BHSC) we are pleased to support the development of a digital U.S. biological collection resource. We are currently developing a database of vascular plants for the northern Great Plains thanks to support from the National Science Foundation (2006-2010 and 2009-2012) that incorporates data from 24 herbaria of various sizes. The herbaria include collections from public and private large and small universities as well as from federal facilities (e.g. USFS and NPS). Early utilization of the database includes surveys by land managers and research on the flora that would have been impossible without access to the digital collection.

Our experiences to date indicate that the following must be addressed in the proposal to develop the digital database of U. S. biological collections:

1. Demonstrated sustained institutional support of physical collections is vital. How would this project really affect the maintenance of physical collections? While project funding could be conditional upon physical support of collections by the home institution, critical data would probably be lost if it is not soon incorporated into a digital form before physical collections deteriorate in facilities with no active maintenance. Physical specimens are paramount to natural history studies. Could we encourage institutions with inactive collections to either make them active or donate them to an institution that would properly care for them?

2. The long-term maintenance of the electronic data needs to be addressed prior to the start of a digitization project. The stability of a dataset is critical and challenges include changing personnel (both botanical and in information technology), political will, and available fiscal resources. Will the results of the project still be viable in 20 years if the funding were to cease in a decade? There should be a plan for continued access to the data regardless of the types of computers/software available in the future.

3. “Microherbaria” as mentioned by Michael Balick are sometimes threatened by inattention or fiscal problems. It is important to incorporate these data into the system to allow biologists access to the specimens. Sometimes these collections contain specimens that are valuable as range extensions, state records or first or last date of collection. Even if the collections have not been updated recently, they retain enormous value. There are certainly critics of databasing small herbaria, but few suggestions provide a better or realistic course of action to mobilize these data and make them available.

4. Biodiversity Informatics 101 as described by Mary Barkworth is exceedingly important. This type of outreach will help create a pool of students and later graduates who will value natural history collections and the information that can be gleaned from them.

5. Regional collaborations are important in making efficient use of resources, providing technical
support, as well as producing collaborations for investigation driven uses of data.

6. Technical support should be readily available (as described by Richard Rabeler). The lack of support from Specify over the last several months has hindered much progress at our institution, and we suspect for other users of Specify. Will the interface be friendly to students, biologists, land managers and other users?

At my second reading of the document, and after seeing the numerous comments here, I do have some trouble with the apparent contradiction in the following two sentences: “The present focus of this strategy is on the digitization and mobilization of existing collection data. This initiative would not directly support the development of new collections or collection improvement through enhanced infrastructure, curation or management.” It is a contradiction in two respects: (1) clearly the intent is not to simply accept all specimen IDs uncritically, yet this part of the process requires the efforts of taxonomists – and since identification is part of curation, it means that curation (in at least this sense) MUST be supported (2) mobilization of data *about* specimens certainly does qualify as something that enhances collection management.

If this initiative does not explicitly support the involvement of taxonomists, then it cannot claim to contain any reliable form of quality control. The value in making 30 million specimen records available online is GREATLY diminished if 20 million of the identifications are outdated, incorrect or incomplete (e.g., to family-level only). This entire endeavor has the identifications of the specimens at its foundation, and unless there is to be an appropriate level of funding to ensure that the identifications are sound, I would find it very hard to support.

While the plan as currently evolving seeks to establish a digital infrastructure for systematics collections, there is a valuable ancillary consideration. More than a few ecological legacy data sets exist solely in notebook form, and often in the personalized shorthand of the originating specialist. Or, they exist digitally on media (or in formats) no longer supported by current technology. Like data archived for specimens in collections, these ecological data sets form a foundation for evaluating distributions of organisms and micro-habitat changes that bear upon those.

One such example, presently being moved to database format by the Kansas Biological Survey (KBS), is the extensive (>50,000 records) set of snake field-observation records compiled by Henry Fitch from 1948-ca 2001. Though these data form the basis of Fitch’s many publications, they also represent a major historical record of herpetofaunal succession over time within a geographically small part of northeast Kansas that is ecotonal between Tall Grass Prairie and deciduous woodland in potential natural vegetation. Future data mining of these records will allow detailed comparison as anthropogenic alteration of habitat continues.

Since this effort is largely being conducted by one specialist, working as a volunteer, progress has been slow though meticulous. About 50% of the records have been captured. Data files are maintained in database (FileMaker Pro), spreadsheet (Excel) and comma-delimited text formats. Records are keyed to a georeferenced (largely) list of over 700 local sites within a ca 2 square mile area of KBS tracts. A metadata file accompanies the records, and the eventual goal is to make these data and associated ARC-GIS shape files available to future specialists.

The Fitch data are by no means the only example of legacy data that, though unrelated to a systematics collection per se, provide detailed historical infrastructure for ecological research. To date, funding to advance capture of these legacy data is unavailable, or only minimally available. The issue is a focus of Global Biodiversity Information Facility (http://www.gbif.org), though again, funding is at present minimal.

I feel the national plan would be strengthened by inclusion of such ecological data before more of it is irretrievably lost. Though it could be argued that since these data are represented by publications they are of lesser significance than data related to systematics collections, that argument lacks substance since detailed micro-habitat data over prolonged time forms a valuable resource for understanding pland and animal distribution and speciation.

My feedback to it is mainly to stress the value of georeferencing correctly the collections. I was at a seminar at Dominican Republic for Latin American herbaria and one of the things that we stressed with them was that. Important to provide help and allocate resources to that. Not only the information on what do we have in terms of collections but how can they be used for conservation.

Databasing collections is rather easy and there are different options available. Almost all easy to use and data transfer is easy.

A similar project to this one was done for Central America recently so getting some feedback from them will be useful too. I did collaborate a bit with that. People to contact could be from STRI at Panama and INBIO.

Just some comments, but will be glad to collaborate and contribute more.

I have read with great interest the information sent to me by Dr. Barbara Thiers about developing digitized national biological collections resources. This approach would greatly enhance research capabilities by members of the systematics community nationally and globally. If we could make this resource a reality, it would begin to lead us to a well informed understanding of the planet’s biodiversity, assuming contributions from all levels of herbaria and collections (that is large institutions and the more modest regional based collections mostly at colleges and universities across the US). The interconnected resource base would be useful to systematists and other researches across the globe.

Funding at some level to help with this effort of digitizing the collections and the information associated with them is essential. We have just begun digitizing and barcoding our plant collections, the fungal and other cryptogamic collections will follow in the next several years.

Dr. Timothy J. Baroni
SUNY Distinguished Professor
Department of Biological Sciences
State University of New York – College at Cortland

On behalf of the University and Jepson Herbaria, UC Berkeley, and the Consortium of California Herbaria, I would like to endorse this document and the approach it takes. Making the data in natural history collections available broadly in electronic form will serve science and society in many ways, and result in an enhanced profile for collections on the national scene.

One caveat: let’s be sure that federal support for the curation and housing of the physical specimens themselves is increased. State support for collections has waned due to a series of budget crises. We need virtual collections, but we also need the real things! I would like to see a sentence added to the document affirming this.

We started a database project here at the High Plains Herbarium
several years ago, but ran out of money at the 17% completion level.
We are also working with the Black Hills State University Herbarium,
to add our data to their project on the Missouri Plateau, a NSF-funded
project.

We want to continue our work here as funding permits because
we deem it to be our way to contribute to the overall effort, which
we highly applaud. To put it another way, ‘Go Colleagues Go, and
other pleasantries’.

I strongly support the development of a national strategic plan for the digitization of natural history collections. Digitization has been a strong focus of work at my museum for many years because of its clear and significant benefits to the ability of researchers to use our collections and for the efficient management of our collections. Our participation in ORNIS has been of enormous benefit because it has made our collection easier to find and because it has made our database available online. I am pleased to see that the national plan will apparently build upon the work of the VertNet group of projects. I particularly like the aspect of these projects that allows each institution to use the database software that they prefer (or which they can afford or which their institution supports or which they have invested large amounts of time in learning and implementing). I absolutely think that trying to force collections to use a particular software package for their databases will not work.

In the paragraph on “Significance of Collections Digitization” I was disappointed to see so much emphasis on conservation-related uses of collections and very little emphasis on more traditional uses of collections that are still of great importance. For example, the role of collections digitization in: taxonomy (without named species it is hard to study biodiversity), ecology, and evolution (only by understanding the present and the past can we hope to predict what will happen in the future), and identification of gaps in our knowledge about biodiversity. While conservation-related uses of collections may be a current “hot” topic that may sell this initiative to funders, I think failing to recognize the vital importance of more traditional collections uses sends the wrong message. Instead this is an opportunity to educate funders about why these traditional uses are still of vital relevance today.

Overall in the proposed plan I find the emphasis on images puzzling. It is unclear what will be imaged and why (realizing, of course, that this is just a preliminary document). I (and colleagues) are having difficulty in imagining circumstances under which images of specimens would be helpful for research or management purposes. For example, standards of research for many (most?) taxa require that researchers examine specimens directly, not just view images. Many aspects of research can only be accomplished by direct inspection and measurement of specimens. Images of type specimens might be the one exception, since many museums don’t lend these specimens. The ability to view images of types might assist a researcher in determining which types they need to examine for their research. Identification of specimens from photos might be another possible use of images, but in many cases direct inspection will still be needed. For example, color is notoriously difficult to capture and reproduce but of vital importance (at least in birds). Perhaps this issue of images is taxon-related. If generation of identification tools is the goal of images, I’d like to see additional taxon based input on the value of such an effort. The skills and equipment for such an effort are pretty specialized (if it is to be done well). This would be a good example of where establishing a taxon-based collaboration (tier 3) would be relevant. I’d like a clearer explanation of what is intended with images and how it will benefit users of collections data.

On the other hand, I see great value in getting items like field notebooks and data cards digitized and online and centrally available. I don’t know how many times I have been struggling to read a specimen label and wished that the field notebook for that collector were available. I also think that these paper objects will be fairly easy to get scanned (for example, many newer copy machines now also have the ability to scan into pdf files at the rate of about 2 seconds per page). Field notebooks also may contain additional data of interest in the form of observations.

One area that needs attention is the challenges institutions face in getting their collections online. They may have databases but may not have the resources or technical expertise to get them online. This was one of the best parts of Ornis for my institution- an expert elsewhere figured out how to make this happen, funded the server we needed, and then worked with our IT people to make it happen. We were not in a position to do this ourselves. I’m sure there are other collections in this same position (with no Ornis-like effort to help them). The maintenance and routine replacement of this equipment and technology is also a key concern- there are quite a number of institutions (like mine) that have come online through the VertNet projects. There need to be ways for them to continue to receive technical assistance to deal with problems as they arise and access to resources to pay to replace servers as needed. As more institutions come online this issue will only get larger.

I would like to see more emphasis in the plan on finding ways to accomplish basic databasing, by which I mean entry of basic information for each specimen into a database. I estimate that it will take at least 10 years before the collections at my institution are fully databased. My institution is struggling to maintain a basic level of collections staff and we have always relied quite heavily on volunteers to do our databasing. The fewer staff we have the less time we have to train and supervise volunteers to do databasing and to deal with the inevitable issues that arise during the process. Until all collections are databased and online we won’t have a complete picture of world biodiversity. The current situation seems to be quite taxonomically biased, with vertebrate collections generally better databased than invertebrate collections (at least in my limited experience- survey data should confirm this). Given that the vast majority of the world’s biodiversity is invertebrate, this is a serious problem.

Related to basic databasing is the issue of detecting and fixing errors in databases. Serving up error-filled collections data to users who are not aware of this issue will not serve anyone’s purposes. Every database has errors and I would guess most collections have not been able to invest much effort in finding and correcting these errors. Also, most database systems aren’t really set up to find errors. I’m sure, however, that smart computer people out there could easily devise ways to check for errors in key database fields (identification, location, date). Again, people in the collections field understand the problem of errors in databases and will know to check questionable data against the original specimen label, but I am doubtful that non-traditional users understand this. As more collections data becomes available online this will become a bigger problem.

I’m struggling to reconcile the proposed objectives with the 3 tier approach and understand what the priorities of the plan will be. I’d like to suggest a set of priorities as follows which perhaps start at a more fundamental level and places more emphasis on the actual data and less emphasis on the end user:

Proposed Priorities

– Get all existing databases online and integrated (funding for equipment, technical expertise) – I think making this jump is a big problem for some collections, and if solved should produce a big jump in the amount of data available online. Pulling them together (as the VertNet projects have) is also fairly simple but highly useful

– Get collections georeferenced (with associated tool development) – this might run in parallel to #2; some collections are ready for this and some collections databasing for the first time might prefer to georeference at the same time.
Work on adding value to databases (additional data, images, etc.) with input from the collections community and other users)

– Work on tools for end-users (analysis, data mining) – to a certain extent I think that if the data is available end-users will develop these tools themselves. Perhaps the way to start here is to set up a way for tools to be shared.

The proposed three tier systems is interesting, although there is not enough information for me to really understand what is proposed. Tier 1 seems pretty clear and straight-forward (get everyone online and get us all to follow the same standards). Tiers 2 & 3, although they sound good, aren’t so clear in how they would work and to some extent seem somewhat opposite in intent. It isn’t clear how regional collaborations are necessarily the best way to help individual collections get digitized. Most larger collections that I am familiar with aren’t composed of locally acquired specimens and nearby collections don’t necessarily have similar geographic or taxonomic strengths, making it unclear how they would be able to help each other. This is all about the digital world, where it is just as easy for a museum to work together with another institution across the country as in the same town. It needs to be clearer what these collaborative groups are supposed to add to the process to help accomplish the broader goals of the plan. I think tier 3 might work better- unite (virtually) collections with similar needs so that they can share effort (e.g. georeferencing, taxonomic issues, stage of database development, training needs).

One area where I think regional efforts might be useful is related to georeferencing. We (and I assume every other collection) constantly have trouble figuring out the localities of our specimens. This stems from bad handwriting, bad spelling, names that are no longer used, variable translation of foreign names, labels written in languages other than English, political entities that no longer exist, etc. What about developing a set of tools and resources to help with this problem? It might include tools to search the national “pool” for similar localities, online resources that are helpful (rather than each collection spending the time to find these by themselves), and maybe a place to post problem localities where regional experts can look at them and possibly resolve them. Regional experts could be notified when a problem has been posted for their area (and I’ll even volunteer to be the Delaware expert). Given the large number of non-U.S. specimens in U.S. museums this would ideally be an international effort. This could also be tied to a tool to “harvest” georeferencing data from localities in the national “pool”. (Question- when entering a new locality into a database is it better to use a tool like BioGeomancer to georeference it or is it better to access the national locality pool and copy the georeference found there?) What about reconstructing the itineraries of major collectors and expeditions from their specimens and field notebooks? Due to sales and exchanges specimens from a single expedition or collector may now be in multiple collections- pulling these all back together virtually might be quite helpful in improving data capture efficiency and data accuracy. The more tools there are to help with this the faster databasing will go and the more accurate the databases will be.

Funding is obviously a huge issue. I think the first principle needs to be that any funding going towards digitization efforts won’t reduce funding already available for collections support (mostly IMLS and NSF-BRC). There is no point in building a national database if the specimens themselves are not properly housed and cared for. Data can’t be more important than specimen care. In this tough economic climate it needs to be understood that state educational institutions or non-profits, both sectors that have seen enormous funding cuts, fund most collections operations. There needs to be a realistic understanding of the ability of collections to contribute to digitization efforts with their current resources and focus on ways to improve resources. This is presumably the most important factor limiting current digitization efforts.

I think there are a number of opportunities that might be explored to draw on resources outside the collections community to assist with some of this work. I think it might be useful to have a way to collect ideas that individual collections or institutions have worked out successfully so that others might try them. For example, my museum has been successful at developing a summer internship program and we are also working with an academic department at a nearby university to get assistance with technical projects. Explore links with academic departments in the GIS, data management, geography, etc. fields and try to get professors or graduate students interested in developing useful tools.

Another area where I think a national group could make a difference is in developing online training tools or other tools that would improve the speed, efficiency and accuracy of digitization efforts. A big one for us would be training in how to use automated georeferencing tools and how to apply georeferencing best practices. Development and training on error checking tools would also be helpful.

Another initiative that I would like to see is an effort to bring very small collections into the online world. For example, I know of three bird collections, each <1000 specimens, in our local area. Two of the three have databases of their specimens. Not major collections by any means, but still relevant to our understanding of biodiversity. Across the country there may be hundreds of these very small collections. These databases (and collections) are pretty static, so the same solution for getting major collections online might not work as well for these small collections, but this does not seem to be a technically difficult issue. I think the VertNet group may have given this some thought. This might be an area where regional efforts to find and recruit these collections would be the best way to work. These also tend to be collections that are generally not under the care of anyone with much collections experience and thus might appreciate access to best practices for databasing.

In summary, I think the document “Developing a digital U.S. biological collections natural resource: First steps towards a strategic plan” is an excellent first step in developing this long-overdue plan. As this process moves forward I would urge that the following questions be kept in mind as various parts of the plan are developed:
– How does this idea improve access to biological collections data?
– How does this idea contribute to growth of collections databases?
– How does this idea improve the quality & accuracy of collections databases?
– How does this idea help meet the data needs of end users?
– What resources do collections need in order to implement this idea?

I look forward to updates as this plan is developed and opportunities for further comment.

Submitted by:
Jean Woods, Ph.D.
Director of Collections
Curator of Birds (and all other vertebrates)
Delaware Museum of Natural History

On behalf of The Executive Committee of the Phycological Society of America, I am writing to express support for the Digitization of Biological Collections effort.

The Phycological Society of America (www.psaalgae.org) is a scientific society founded in 1946 with the mission of promoting research and teaching in all fields of Phycology. Members number approximately 900 worldwide, and the society publishes *The Journal of Phycology* through Wiley-Blackwell (http://www.jphycol.org/). The Executive Committee of the PSA endorses the goals of the plan to digitize biological collections. Collections of algae are diverse and include conventional herbarium sheets, permanent slides of microscopic taxa, and living cultures. Although living material is not strictly within the purview of the proposed digitization effort, these collections are a vital resource for phycological researchers, in particular those who focus on microscopic taxa, and many are vouchered in natural history collections.

The three tiered approach outlined in the February 7, 2010 web document (https://digbiocol.wordpress.com/) is a logical and effective beginning to the development of a strategic plan. We anticipate that PSA members will want to participate in digitization efforts so that the wide variety of algae represented in collections are adequately represented in the eventual product of this effort. Participation may include taxonomic expertise, collaborations including staff of collections holding algal specimens, and input into database construction and other technological aspects of the work.

PSA will provide information on the plan to its members through its Newsletter and its Annual Scientific Meeting. The contact person for this effort is Dr Richard McCourt (Academy of Natural Sciences of Philadelphia).

We support the concept of a national strategic plan for U.S. biological collections provided it is done in a collaborative manner, and provided that it is affordable in time and money for small collections like our herbarium to be integrated.

Access to a centralized source for images of all the specimens in the US is something that will be very valuable to various kinds of research, such as research on changing plant communities. There is a need to ensure that the focus on preservation of the actual collections is not lost, as enthusiasm about digital collections increases. Our concern is that funding for herbarium
collection maintenance will continue to have low priority. Digital images are very useful for many things, but collections-based research often requires examination of the original specimens.

Part of the strategic plan should include standardized practices for ensuring quality control of the digitization so that that the underlying specimens are correctly represented.

We should build on progress already made with data standards such as Darwin Core, single database platforms such as Specify, and established unified web portals such as GBIF. Small herbaria like ours will need technical support and step-by-step instructions to participate, but we welcome the prospect of our historic specialized collection data being made more accessible to a wider
community.

I am the curator of non-vascular plants at the UWSP herbarium. Although I understand and agree with the importance of this effort, digitizing images of minute bryophytes (liverworts) would be a huge task. We are extremely busy at UWSP, the curatorship is just a side duty. Teaching load is very heavy here. I am not sure how to approach this effort. I could not ask students to handle our tiny specimens without supervision (we do not have graduate students here). I suppose we would have to hire a qualified person, full time for this purpose (this would need to be added to your proposal budget). Also, many of the specimens may be misidentified or in need of actualization (a slow process). What do you suggest?

Virginia Freire
Associate Professor
Department of Biology
University of Wisconsin – Stevens Point

Having responded to this proposed initiative personally, I’m now writing this response on behalf of the Society for the Preservation of Natural History Collections (SPNHC). SPNHC has an international membership of over 600 professionals working in the field of collections care, including curators, collection managers, conservators, and registrars.

Our Society enthusiastically supports any initiative that increases access to collections and that promotes novel uses of specimens and specimen data. Capturing the mass of currently inaccessible data in natural history collections will present massive challenges. It is essential that the process be set up and managed efficiently. Technology must be applied effectively where it can be used, in order to free up curatorial staff time to tackle areas where technology is ineffective.

Many of the responses to date have focused on user demand. Stakeholder perspectives are essential for the success of this initiative. However, it is equally important to consider the potential challenges from the perspective of the data providers. An important first step is to understand what information is available in collections – not just the core metadata that form the basis of current cataloging efforts, but also the mass of associated data, such as field notes, maps, photographs, specimen annotations and other resources that are generally excluded from on-line catalogs. Many potential users may be unaware of the existence of these resources.

Equally critical are the documentation of collections workflow and the establishment of best practices. Workflow differs radically between different types of collections; a strategy for automating data capture based on specimen imaging, for example, may work well for herbarium specimens but will be ineffective for fluid-preserved materials or fossils. Before effective tools can be developed to speed the process of capturing data, it is necessary to understand what processes are currently in use. While this may seem obvious, published synopses of this information do not yet exist.

Fortunately, our membership is well-placed to begin this effort. Through our existing activities in the area of best practices, we are already engaged in documenting collections operations. We would be happy to take on this role under the umbrella of the proposed digitization initiative.

There is no pot of money, but a lot of knowledge and experience in digitizing herbaria (particularly the vascular plants). Contact Patrick Sweeney (his email is at http://sweetgum.nybg.org/ih/person.php?irn=128959) or James Macklin (his email is at http://sweetgum.nybg.org/ih/person.php?irn=133144) to connect with the Consortium of Northeastern Herbaria and find out what is going on in your region. Also, if you will be attending Botany 2010 in Rhode Island, plan to spend Sunday afternoon finding out about the US Virtual Herbarium project where you will be able to talk with people interested in digitizing herbaria.
Folks. I would reply directly to James if I could, but I do not know his email address.