05/14/2018

Using Big Data to Ask Big Questions: Why Are There So Many (Damn) Daisies?

For those who have had Saturday afternoons in the field happily keying out wildflowers rudely stymied by yet another Damn Yellow Comp, the question “why are there so many daisies?!” may have sprung to mind (possibly not quite so politely). An answer: that the environment plays a primary role in driving the generation and distribution of plant diversity and community composition, will have occurred to many, and is hardly a novel concept. Yet due to a lack of data at sufficiently broad scales our ability to identify continent-wide patterns and fundamental relationships between ecology and diversity has until recently been limited to mostly descriptive work or studies of small numbers of taxa. Within the last decade however, advancing computer power and storage, coupled with museum digitization initiatives, has begun to open up huge repositories of data to research that can be used to start tackling questions on a grander scale.

Almost one in ten flowering plants on the North American continent is from the daisy family (Compositae) with unusually high diversity in southwestern United States and northern Mexico; however, the exact origins of this diversity remain unclear. It has been proposed that a cooling and drying trend since the mid-Eocene has allowed smaller herbaceous plants to thrive at the expense of woodier species, with more recent shuffling and mosaicism of communities during glacial ups-and-downs over the last 25,000 years exposing a highly heterogeneous landscape with plenty of previously unoccupied niches. The Compositae are particular adept at colonizing a vast array of niches, including those generally considered environmentally challenging and apparently inhospitable to many other plant groups, and during this time underwent several large and relatively rapid radiations. As such they make a perfect group for using Big Data to test whether extremes in particular environmental factors may be responsible for driving species diversity at large scales. My colleagues and I have chosen 14 tribes within the Compositae with predominantly North American distributions to study, allowing us to compare and contrast patterns across lineages.

The greatest challenge facing the harnessing of large and agglomerated data-sets is assessing and dealing with (often poor) data quality. The time consumed cleaning data is almost always underestimated, and while there are as-is statistical analysis packages that aim to tie many common tasks together in a relatively accessible way (see various R packages – a programming language for statistical computing and graphics), these never cover all contingencies, with different issues unique to every dataset and question.

For our work we engineered a data acquisition, cleaning, and analysis pipeline using a variety of computer programs, including R, Excel (still one of the most useful tools for screening, merging, and wrangling data), OpenRefine, ArcMap, and Biodiverse. Collection records were collated from the three largest publically available databases for North American biological specimen data—GBIF, iDigBio, and BISON. While much is shared between these repositories they are not entirely overlapping and the quality of data can vary markedly between them. In fact, the decision to use all three despite high redundancy allowed us to: a) scrape all possible data available at the time; and b) cross-reference between supposedly duplicate data from all three servers to identify issues or errors. Curation of the data behind-the-scenes can be rather opaque and this approach allowed us to identify and compare different taxonomies between databases (or even within a database) including outing an over-zealous GBIF synonymization algorithm that was merging related taxa with the same generic first letter, species name, and author initial (eg. Helianthus atrorubens L. with Hebeclinium atrorubens Lemaire).

An initial data-dump of close to 2 million records was reduced by three quarters before analysis. Many removed records were straight duplicates; however, a surprising number were para-duplicates – records with small inconsequential differences such as differently rounded geo-coordinates, present or absent collector initials, etc. The remaining records were scrubbed for weeds, garden-grown collections, gross georeferencing errors, and guestimated georeferences (the White House is home to upwards of 200 species of Compositae alone if raw GBIF data is to be believed as it is a commonly used centroid for records with no better locality information than “the district”). The removal of synonyms both between and within databases also took a lot of manual inspection and reference-hunting.

A final list of close to 500,000 records for over 3,000 species was curated. Values for 187 soil, geochemistry, topography, and climate variables were extracted for each point, with correlation analyses reducing the final set to be considered down to 50. A metaphylogeny was constructed using a Genbank and an Open Tree of Life backbone, with unplaced taxa grafted on according to expert opinion.

Armed with this data we can finally address the questions: Where are centers of diversity for North American Compositae? Are these similar across lineages? What environmental variables are correlated with increased or decreased diversity, and how do these differ across lineages? Does diversity appear to have a predictable response to certain variables through space? Are particular variables associated with more diverse clades?

Integrating multiple lines of Big Data to answer large-scale questions about the role of the environment in species diversification: Locality records for North American radiations of Compositae (Asteraceae) (a) are used to calculate diversity metrics, (b) and are combined with soil, geochemistry, climate, and topological data, (c) to determine regionally significant environmental variables that correlate with this diversity. How diversity changes across gradients in variable strength can also be modeled, and (d) placed in a phylogenetic context. Figure by R. Edwards from Funk (in press) and used with permission of the author.

As expected, we find high diversity and endemism across tribes in the Californian floristic province and extending down through northwestern and central Mexico. Interestingly (but probably not surprisingly) rarefaction curves (a technique used to assess species richness from the results of sampling) show that tribes tend to be under-sampled towards the margins of their distributions, suggesting that collectors head for areas with the greatest species diversity and neglect sampling scattered fringe species more exhaustively. Also, intuitively, we find that areas of high diversity across all tribes are associated with increased soil quartz content, as well as temperature and rainfall variables typical of the hot/dry seasonal habitats they favor. More intriguingly, each tribe also presents environmental variables that are uniquely correlated with diversity, such as soil pH in the tribe Heliantheae, slope of the terrain in the tribe Bahieae, and soil water content in the tribe Chaenactideae.

Using another statistical technique, Generalized Dissimilarity Models, we can ask how diversity responds as values for these variables increase or decrease: diversity peaks in Heliantheae at a pH of a little above 7, rapidly diminishes above a certain slope pitch in Bahieae, and tails off in Chaenactideae as soil saturation increases. These results, replicated across tribes and for each of our environmental variables, give us an insight into the characteristics of the niches being exploited by different groups. Work to place this in a phylogenetic context – tracing environmental tolerances on to a phylogeny and determining which are correlated with increases or decreases in speciation rates – is ongoing.

While the Compositae have long been associated with harsh dry sandy environments, there do appear to be signals of adaptation to different discrete sets of variable extremes within these areas, allowing multiple independent and sometimes overlapping radiations. Clearly morphological and physiological characteristics of the group (including the frustratingly similar and generalist nature of many a damn yellow one) are well suited to exploiting a wide range of niches, and the next step will be to integrate phenotypic data into evolutionary models to tease out which characteristics allow species to adapt to particular environments. Similarly, replication of these methods across other diverse groups via more Big Data are needed to draw broader conclusions as to the drivers of diversity at the level of communities and species assemblages. Not that knowing any of this makes me feel less inclined to throw my field-guide at the next impossible-to-key-out daisy I encounter, but it does make me feel like my frustration is somehow an acknowledgement of the impressive resilience and adaptability of this group of plants.

The diversity of soils and climate gradients associated with the southwest US and northern Mexico provide a mosaic of habitats for adaptable plant groups. Here a Gerea canescens (Heliantheae tribe) overlooks the geochemically and topographically varied rim of Death Valley, Nevada. (photo by R. Edwards)

The “us” behind this project is a working group tasked with characterizing extreme environments in North America and the role that these play in driving plant diversification: Robert Edwards (US); Vicki Funk (US); Elisabeth Bui (CSIRO); Marty Goldhaber (USGS); Joe Miller (NSF); Jennifer Cartwright (USGS); Chase Mason (UCF); Jim Thompson (UWV/USGS); Pam Soltis (UF); Brian Anacker (CU/City of Boulder); Ian Pearse (USGS); and Travis Nauman (USGS). It is funded by the USGS Powell Center for Analysis and Synthesis.