Sharing paleodata (Part 1): What databases are out there?

Science depends on the ability to make observations, repeat experiments, test hypotheses, and share knowledge. When a new study comes out, other researchers evaluate an author’s arguments based on the data they present and the analyses they perform. This is why most journals require that at least some of the original data be included as part of the manuscript. Without access to the original data, science loses that critical requirement of repeatability.

Original data can be a description: “This is what the anatomy looks like, and here is a picture showing that feature”. Other times, it’s a table of measurements or a map of locations. Original data can also take the form of a chart or graph that summarizes the individual data points and reports trends for the group. Depending on the study, the number of data points you used might be 20, or 20,000. You might have to write new software to run a specific analysis. Printed in a journal, the spreadsheet or code involved could take up hundreds of pages. In these cases it’s impractical to publish all of the original data or code in printed format. But you still need that data and code for the work to be repeatable.

My particular line of research involves the examination of bone tissues in order to study the growth patterns of living and fossil animals. The primary data I use are microscopic bone tissue characters (the number or shape of bone cells, patterns or orientation of blood vessels) that vary in appearance around the slide. It’s hard to reproduce images in a print journal at the size and resolution others need to verify my observations, much less enough to capture variation around the slide. To make my observations verifiable or repeatable, you need an enormous image – ideally, the whole slide, at high resolution and decent magnification (say, 4x or 5x). Even if the file sizes were reasonable (they’re not – I create a lot of gigapixel images), a printed version that allowed you to see all the bone cells and blood vessels would be more like a map in size than a page of a journal.

It is hard to see the histological details of an image this size (click on the image for slightly larger one). Go to MorphoBank to see it large enough to examine histological details. Image (c) 2012 Sarah Werning / The Sam Noble Oklahoma Museum of Natural History.

How to solve this? Some researchers publish data to their own websites. For example, Drew Lee (Midwestern University) posts high-resolution histology images on his Paleohistology Repository. Larry Witmer’s lab (Ohio University) posts 3D visualizations of their CT data on their website, along with a bunch of downloadable interactive projects on animal anatomy. Emma Schachner (University of Utah) posts supplemental images, posters, and other ancillary images to her website, www.theropoda.com. These are pretty spectacular, but not every researcher has the time to maintain a data site, and eventually the researcher that owns them won’t be around to maintain them. So while they fill the gap now, this is not an ideal longterm solution.

Most academic journals allow you to publish online supplementary materials, including ginormous spreadsheets or software code (ginormous images are sometimes problematic – some journals limit uploaded files to 10MB). So while the data is usually out there, even in this form it’s not always accessible. You might need to subscribe to the journal to access it, or the file could get lost over time (I know the latter seems ridiculous, but unfortunately, it’s not uncommon for older papers, or when journals change publishers).

Even if the data is easy to access, comparing it in a meaningful way could require pulling together hundreds or thousands of sources. In these cases, a sort of universal clearinghouse for data would be ideal. These exist, for certain types of data. For example, almost all journals published in the US (and elsewhere) require anyone whose data includes a new genetic sequence to upload it to GenBank before publication. GenBank is a database maintained by the National Center for Biotechnology Information (NCBI), a subdivision of the National Institutes of Health (NIH). GenBank shares and synchronizes data with two similar databases (one in Europe and one in Japan) as part of the International Nucleotide Sequence Database Collection (INSDC). GenBank assigns a unique accession number to each sequence, and these are reported in the paper when it is published.

There is no standard repository for measurements or morphological data, but I outline some options below. UCMP 77305, Isoodon femur. (c) 2013 Sarah Werning

The advantage to hosting your raw data on publicly-accessible databases or data clearinghouses is clear: researchers around the globe have access to your sequences and can discover them easily, even if they weren’t familiar with your original paper before running a search. If you receive federal funding (say, from the NSF), it’s now also required. Every grant proposal submitted to NSF requires a Data Management Plan, which explains how the grantee will share “the primary data, samples, physical collections and other supporting materials created or gathered in the course of work” at “at no more than incremental cost and within a reasonable time”.

There’s no universal designated clearinghouse for bone measurements or pictures of fossil histology (yet), but a number of databases and repositories have sprung up in the last ten years that host different types of data. A few of these (for example, Dryad) have become the repository of choice for multiple journals, the first steps toward the type of data clearinghouses I described above.

Because paleontology is so interdisciplinary and draws on so many types of information, our data can be found in a number of repositories (in addition to supplemental information hosted on journals’ sites). There so many options for a paleontologist who wants to share data that it might be hard to keep them straight. Over my next few blog posts, I’ll be discussing some of the common data repositories used by paleontologists. For each of these, I’ll give pertinent information, such as funding sources/costs and site features/limitations, discuss my thoughts on usability, and highlight some recent paleo papers that take advantage of these databases. My hope is to assemble a comparative guide for paleodata repositories over the next month or so. The first of these will go up on Wednesday and discuss Dryad (mentioned above). Other sites I will discuss in the near future include Figshare, MorphoBank, and Morphbank: Biological Imaging.

Today, I’ll leave you with a list of some of the data repositories I’ve used and encountered in my paleontological research, along with a brief description of the type of data they host. Please post good suggestions for additions to this list in the comments.

SARAH’s REPOSITORY REPOSITORY

A comprehensive list of data repositories can be found at Databib. However, not all of these are data repositories in the sense that you can use them to host the data for a given manuscript (some of them just provide data summaries). Particularly useful to me is this list of Biological Science Databases: http://databib.org/index_subjects.php#Bio

The remainder of these sites are ones you can use to host your own data (some restrictions may apply).

About Sarah Werning

NSF Postdoctoral Research Fellow at Stony Brook University. I study how bone tissue, growth, and metabolism evolve at macroevolutionary time scales. I have an inordinate fondness for reptiles.
(Any opinions, findings, and conclusions or recommendations expressed in my posts are mine and do not necessarily reflect the views of the National Science Foundation.)

Looking at Uhen et al. (2013), the paper seems to report a mix of research database types. Some of them (MIOMAP, FAUNMAP, Neotoma, and the Paleobiology database) are primarily data portals to which users can upload data, mainly published faunal lists and occurrences (unpublished lists/occurrences can be uploaded as well). Users can also download data for meta-analyses. These are all important databases, and ones that enable new and interesting types of research, but they are not the type of data repository I am describing in my post above. The ones I describe in today’s post are all repositories mainly for new primary data.

I really like the layout of the Uhen et al. (2013) paper; it’s similar to the layout I’ll be using for this series.

I didn’t know until today that the Paleobiology database (PBDB) could be used to enter new data, so I’ll amend the post in a few minutes to include it in the repository repository.

I think I’m confused by what your criteria for inclusion is here (and maybe a bit peeved that you didn’t include Neotoma!). When I look at something like biomesh or AMCED, I don’t see a real difference between them and a database like Neotoma.

How are pollen counts from a core any different from raw data? For example ACMED seems to have a standardized crystal reporting format, so someone must be standardizing the data following submission. The same goes for Neotoma. I’ve scanned hundreds of pollen count sheets and submitted them, Neotoma just puts that data into a new file format that allows it to be used, I’ve also submitted raw count data as a spreadsheet file and it goes up too. Presumably many of these databases you list do the same thing (maybe not figshare, which really is a raw data repository).

Anyway, regardless of its inclusion on this list (and I snuck in a link to it anyway) Neotoma has been a fantastic resource for paleoecologists, both as a repository (we’ve used it as such in a recent NSF Data Management Plan) and as a source for data (in my own papers and in others’). The standardization of the data format is one of the key features that has made it so useful, and now with an API it will hopefully be of more use to the community.

My criteria were threefold: (1) that the repositories hosted primary data (not merely portals to other sites); (2) that they were a place the authors of a manuscript could post data before publication so that (3) they could be referred to in the resulting manuscript with an accession number or unique DOI.

I admit I don’t use Neotoma, and hadn’t realized from viewing the site or reading the Uhen et al. (2013) paper that you could do these things (they refer to it as “a relational database comprising a number of constituent ‘databases’”, and I should have read beyond that). I am wrong about this, and am happy to add Neotoma to the list above.

When I cover Neotoma, I’d much rather have a frequent user of Neotoma blog about it than me. Would you be interested in doing that? The format will be easy to replicate.