Genealogical Properties of Subsamples in Highly Fecund Populations

Abstract

We consider some genealogical properties of nested samples. The complete sample is assumed to have been drawn from a natural population characterised by high fecundity and sweepstakes reproduction (abbreviated HFSR). The random gene genealogies of the samples are—due to our assumption of HFSR—modelled by coalescent processes which admit multiple mergers of ancestral lineages looking back in time. Among the genealogical properties we consider are the probability that the most recent common ancestor is shared between the complete sample and the subsample nested within the complete sample; we also compare the lengths of ‘internal’ branches of nested genealogies between different coalescent processes. The results indicate how ‘informative’ a subsample is about the properties of the larger complete sample, how much information is gained by increasing the sample size, and how the ‘informativeness’ of the subsample varies between different coalescent processes.

Keywords

Coalescent High fecundity Nested samples Multiple mergers Time to most recent common ancestor

Mathematics Subject Classification

Notes

Acknowledgements

We thank Alison Etheridge for many and very valuable comments and suggestions, especially regarding Theorem 1. BE was funded by DFG grant STE 325/17-1 to Wolfgang Stephan through Priority Programme SPP1819: Rapid Evolutionary Adaptation. FF was funded by DFG grant FR 3633/2-1 through Priority Program 1590: Probabilistic Structures in Evolution.

Supplementary material

A1 Population Models

In this section we provide a brief overview of the population models behind the coalescent processes we consider, and why we think they are interesting. A detailed description of the coalescent processes is given in Sect. A2.

A universal mechanism among all biological populations is reproduction and inheritance. Reproduction refers to the generation of offspring, and inheritance refers to the transmission of information necessary for viability and reproduction. Mendel’s laws on independent segregation of chromosomes into gametes describe the transmission of information from a parent to an offspring in a diploid population. For our purposes, however, it suffices to think of haploid populations where one can think of an individual as a single gene copy. By tracing gene copies as they are passed on from one generation to the next one automatically stores two sets of information. On the one hand one stores how frequencies of genetic types change going forwards in time; on the other hand one keeps track of the ancestral, or genealogical, relations among the different copies. This duality has been successfully exploited for example in modeling selection [34, 35]. To model genetic variation in natural populations one requires a mathematically tractable model of how genetic information is passed from parents to offspring. In the Wright–Fisher model offspring choose their parents independently and uniformly at random. Suppose we are tracing the ancestry of \(n \ge 2\) gene copies in a haploid Wright–Fisher population of N gene copies in total. For any pair, the chance that they have a common ancestor in the previous generation is 1 / N. Informally, we trace the genealogy of our gene copies on the order of \({\mathscr {O}}(N)\) generations until we see the first merger, i.e. when at least 2 gene copies (or their ancestral lines) find a common ancestor. If n is small relative to N, when a merger occurs, with probability \(1-{\mathscr {O}}( 1/N)\) it involves just two ancestral lineages. This means that if we measure time in units of N generations, and assume N is very large, the random ancestral relations of our sampled gene copies can be described by a continuous-time Markov chain in which each pair of ancestral lines merges at rate 1 and no other mergers are possible. We have, in an informal way, arrived at the Kingman-coalescent [56, 57, 58]. One can derive the Kingman-coalescent not just from the Wright–Fisher model but from any population model which satisfies certain assumptions on the offspring distribution [61, 64, 71]. These assumptions mainly dictate that higher moments of the offspring number distribution are small relative to (an appropriate power of) the population size. The Kingman-coalescent, and its various extensions, are used almost universally as the ‘null model’ for a gene genealogy in population genetics. The Kingman-coalescent is a remarkably good model for populations characterised by low fecundity, i.e. whose individuals have small numbers of offspring relative to the population size.

The classical Kingman-coalescent is derived from a population model in which the population size is constant between generations. Extensions to stochastically varying population size, in which the population size does not vary ‘too much’ between generations, have been made [54]; the result is a time-changed Kingman-coalescent. Probably the most commonly applied model of deterministically changing population size is the model of exponential population growth (see eg. [25, 30, 41]). In each generation the population size is multiplied by a factor \((1+\beta /N)\), where \(\beta > 0\). Therefore, the population size in generation k going forward in time is given by \(N_k = N(1 + \beta / N)^k\) where N is taken as the ‘initial’ population size. It follows that the population size \(\lfloor Nt\rfloor \) generations ago is \(Ne^{-\beta t}\). [30] show that exponential population growth can be distinguished from multiple-merger coalescents (in which at least three ancestral lineages can merge simultaneously), derived from population models of high fecundity and sweepstakes reproduction, using population genetic data from a single locus, provided that sample size and number of mutations (segregating sites) are not too small.

A diverse group of natural populations, including some marine organisms [46], fungi [1, 51, 79], and viruses [83] are highly fecund. By way of example, individual Atlantic codfish [60, 68] and Pacific oysters [59] can lay millions of eggs. This high fecundity counteracts the high mortality rate among the larvae (juveniles) of these populations (Type III survivorship). The term ‘sweepstakes reproduction’ has been proposed to describe the reproduction mode of highly fecund populations with Type III survivorship [45]. Population models which admit high fecundity and sweepstakes reproduction (HFSR) through skewed or heavy-tailed offspring number distributions have been developed [31, 53, 64, 65, 73, 78]. In the haploid model of [78], each individual independently contributes a random number X of juveniles where \((C, \alpha > 0)\)

and \(x_n \sim y_n\) means \(x_n/y_n \rightarrow 1\) as \(n\rightarrow \infty \). The constant \(C > 0\) is a normalising constant, and the constant \(\alpha \) determines the skewness of the distribution. The next generation of individuals is then formed by sampling (uniformly without replacement) from the pool of juveniles. In the case \(\alpha < 2\) the random ancestral relations of gene copies can be described by specific forms of multiple-merger coalescent processes [72]. We remark that the fate of the juveniles need not be correlated to generate multiple-mergers in the genealogies — the heavy-tailed distribution of juveniles means that occasionally one ‘lucky’ individual contributes a huge number of juveniles while all others contribute only a small number of juveniles. Uniform sampling without replacement from the pool of juveniles means that the lucky individual leaves significantly more descendents in the next generation than anyone else, and this is what generates multiple mergers of ancestral lines.

Coalescent processes derived from population models of HFSR (see (A28) for an example) admit multiple mergers of ancestral lineages [24, 63, 65, 70, 71, 72, 76]. Mathematically, we consider exchangeable n-coalescent processes, which are Markovian processes \((\varPi _t^{(n)})_{t\ge 0}\) on the set of partitions of \([n] := \{1,2,\ldots , n\}\) whose transitions are mergers of partition blocks (a ‘block’ is a subset of [n], see Sect. A2) with rates specified in Sect. A2. The blocks of \(\varPi _t^{(n)}\) show which individuals in [n] share a common ancestor at time t measured from the time of sampling. Thus, the blocks of \(\varPi _t^{(n)}\) can be interpreted as ancestral lineages. The specific structure of the transition rates allows to treat a multiple-merger n-coalescent as the restriction of an exchangeable Markovian process \((\varPi _t)_{t\ge 0}\) on the set of partitions of \({\mathbb {N}}\), which is called a multiple-merger coalescent (abbreviated MMC) process. MMC processes are referred to as \(\varLambda \)-coalescents (\(\varLambda \) a finite measure on [0, 1]) [24, 70, 71] if any number of ancestral lineages can merge at any given time, but only one such merger occurs at a time. By way of an example, if \(1 \le \alpha < 2\) in (A28) one obtains a so-called Beta\((2-\alpha ,\alpha )\)-coalescent [72] (Beta-coalescent, see Eq. (A35)). Processes which admit at least two (multiple) mergers at a time are referred to as \(\varXi \)-coalescents (\(\varXi \) a finite measure on the infinite simplex \(\varDelta \)) [64, 65, 76]. See Sect. A2 for details. Specific examples of these MMC processes have been shown to give a better fit to genetic data sampled from Atlantic cod [2, 12, 16, 18, 19] and Japanese sardines [67] than the classical Kingman-coalescent. See e.g. [29] for an overview of inference methods for MMC processes. [46] review the evidence for sweepstakes reproduction among marine populations and conclude ‘that it plays a major role in shaping marine biodiversity’.

MMC models also arise in contexts other than high fecundity. [17] show that repeated strong bottlenecks in a Wright–Fisher population lead to time-changed Kingman-coalescents which look like \(\varXi \)-coalescents. [27, 28] show that the genealogy of a locus subjected to repeated beneficial mutations is well approximated by a \(\varXi \)-coalescent. [75] provides rigorous justification of the claims of [22, 66] that the genealogy of a population subject to repeated beneficial mutations can be described by the Beta-coalescent with \(\alpha = 1\) (also referred to as the Bolthausen–Sznitman coalescent [20]). These examples show that MMC processes are relevant for biology. We refer the interested reader to e.g. [5, 9, 10, 13, 25, 33] for a more detailed background on coalescent theory.

A2 Coalescent Processes

To keep our presentation self-contained a precise definition of the coalescent processes we will need will now be given. We follow the description of [19]. A coalescent process \(\varPi \) is a continuous-time Markov chain on the partitions of \({\mathbb {N}}\). Let \(\varPi ^{(n)}\) denote the restriction to [n], and write \({\mathscr {P}}_n\) for the space of partitions of [n]. A partition \(\pi = \{\pi _1, \ldots , \pi _{\#\pi } \} \in {\mathscr {P}}_n\) has \(\#\pi \)blocks which are disjoint subsets of [n]. We assume the blocks \(\pi _i\) are ordered by their smallest element; therefore we always have \(1 \in \pi _1\). In general a merging event can involve r distinct groups of blocks merging simultaneously. We write \(\underline{k} = (k_1, \ldots , k_r)\) where \(k_i \ge 2\) denotes the number of blocks merging in group i. Here \(r \in [\lfloor \#\pi / 2 \rfloor ]\), \(k_1 + \cdots + k_r \in [\#\pi ]_2\) and \(i_1^{(a)},\ldots , i_{k_a}^{(a)}\) will denote the indices of the blocks in the \(a\hbox {th}\) group. By \(\pi ^\prime \prec _{ \#\pi , \underline{k}} \pi \) we denote a transition from \(\pi \) to \(\pi ^\prime = A\cup B\) where

In (A29), set A (possibly empty) contains the blocks not involved in a merger, and B lists the blocks involved in each of the r mergers. By \(\pi ^\prime \prec _{ \#\pi , k} \pi \) we denote the transition in a \(\varLambda \)-coalescent where \(k \in [\#\pi ]_2\) merge in a single merger and \(\pi ^\prime \) is given as in (A29) with \(r = 1\); ie. only one group of blocks merges in each transition. By \(\pi ^\prime \prec _{\#\pi } \pi \) we denote a transition in the Kingman-coalescent where \(r = 1\) and 2 blocks merge in each transition.

The total rate of k-mergers in a \(\varLambda \)-coalescent is given by \( \lambda _k(n) = \genfrac(){0.0pt}1{n}{k} \lambda _{n,k}\) for \(2 \le k \le n\). The total rate of mergers given \(n \ge 2\) active blocks is

An important example of a \(\varLambda \)-coalescent is the Beta\((2 - \alpha , \alpha )\)-coalescent [78] where the \(\varLambda \) measure is associated with the beta density, where \(B(\cdot ,\cdot )\) is the beta function,

For \(\alpha = 1\) the Beta\((2 - \alpha ,\alpha )\)-coalescent is the Bolthausen–Sznitman coalescent [20, 39]. The Beta-coalescent is well-studied, there are connections to superprocesses, continuous-state branching processes (CSBP) and continuous stable random trees as described e.g. in [7, 14].

A3 Goldschmidt and Martin’s Construction of the Bolthausen–Sznitman n-coalescent

From [39], we recall the construction of the Bolthausen–Sznitman n-coalescent by cutting the edges of a random recursive tree. Let \({\mathbb {T}}_n\) be a random recursive tree with n nodes. We can construct \({\mathbb {T}}_n\) sequentially as follows

(i)

Start with a node labelled with 1 (the root) and no edges,

(ii)

If \(i<n\) nodes are present, add a node labelled with \(i+1\) and one edge connecting it to a node in [i] picked uniformly,

(iii)

stop if n nodes are present.

The object \({\mathbb {T}}_n\) is a labelled tree, each node has a single label. We consider a realisation of \({\mathbb {T}}_n\) and transform this tree over time into labelled trees with fewer nodes with nodes amassing multiple labels.

(i)

Each edge of \({\mathbb {T}}_n\) is linked to an exponential clock. The clocks are i.i.d. Exp(1)-distributed.

(ii)

We wait for the first clock to ring. At this time, we cut/remove the edge whose clock rang first. The tree is thus split in two trees, one of these trees includes the node with label 1. We denote this tree by \({\mathbb {T}}^{(1)}\), the other tree by \({\mathbb {T}}^{(2)}\). Let \(e_1\) be the node of \({\mathbb {T}}^{(1)}\) that was connected to the removed edge.

(iii)

All labels of \({\mathbb {T}}^{(2)}\) are added to the set of labels of \(e_1\). Remove \({\mathbb {T}}^{(2)}\) including its clocks.

(iv)

Repeat from (ii), using \({\mathbb {T}}^{(1)}\) labelled as in (iii) with the (remaining) clocks from (i). Stop when \({\mathbb {T}}^{(1)}\) in step (iii) consists of only a single node and no edges.

(v)

For any time t, label sets at the nodes of \({\mathbb {T}}^{(1)}\) (\({\mathbb {T}}_n\) before the first clock has rang) give a partition \(\varPi ^{(n)}_t\) of [n]. The process \((\varPi ^{(n)}_t)_{t\ge 0}\) is a Bolthausen–Sznitman n-coalescent (set \(\varPi ^{(n)}_t=[n]\) if t is bigger than the time at which we stopped the cutting procedure).

Figure 7 shows an illustration of steps (i)–(iii) for a realisation of \({\mathbb {T}}_5\).

Basu, A., Majumder, P.P.: A comparison of two popular statistical methods for estimating the time to most recent common ancestor (tmrca) from a sample of DNA sequences. J. Genet. 82(1–2), 7–12 (2003)CrossRefGoogle Scholar

Griswold, C.K., Baker, A.J.: Time to the most recent common ancestor and divergence times of populations of common chaffinches (Fringilla coelebs) in Europe and North Africa: insights into Pleistocene refugia and current levels of migration. Evolution 56(1), 143–153 (2002)CrossRefGoogle Scholar

Spouge, J.L.: Within a sample from a population, the distribution of the number of descendants of a subsample’s most recent common ancestor. Theor. Popul. Biol. 92, 51–54 (2014)CrossRefzbMATHGoogle Scholar