Multispecies coalescent process
Multispecies Coalescent Process is a stochastic process model that describes the genealogical relationships for a sample of DNA sequences taken from several species.
Gene tree is a binary graph that describes the evolutionary relationships between a sample of sequences for a non-recombining locus. Species tree describes the evolutionary relationships between a set of species, assuming tree-like evolution. However, several processes can lead to discordance between species and gene trees. The Multispecies Coalescent model provides a framework for inferring species phylogenies while accounting for ancestral polymorphism and gene tree-species tree conflict. The process is also called the Censored Coalescent.[1]
Multispecies Coalescent
The probability density of the gene trees under the multispecies coalescent model is discussed along with its use for parameter estimation using multi-locus sequence data.
Assumptions
The species phylogeny is assumed to be known. Complete isolation after species divergence, with no migration, hybridization, or introgression is also assumed. We assume no recombination so that all the sites within the locus share the same gene tree (topology and coalescent times).
Data and Model Parameters
The model and implementation of this method can be applied to any species tree. As an example, the species tree of the great apes: human (H), chimpanzee (C), gorilla (G) and orangutan (O) is considered. The topology of the species tree, (((HC)G)O)), is assumed known and fixed in the analysis (Figure 1).[1] Let be the entire data set, where represent the sequence alignment at locus , with for a total of loci.
The population size of a current species is considered only if more than one individual is sampled from that species at some loci.
The parameters in the model for the example of Figure 1 include the three divergence times , and and population size parameters for humans; for chimpanzees; and , and for the three ancestral species.
The divergence times ('s) are measured by the expected number of mutations per site from the ancestral node in the species tree to the present time (Figure 1 of Rannala and Yang, 2003).
Therefore, the parameters are .
Likelihood-based inference
The gene genealogy at each locus is represented by the tree topology and the coalescent times . Given parameters , the probability distribution of is specified by the coalescent process under the model given by
The probability of data given the gene tree and coalescent times (and thus branch lengths) at the locus, is the Felsenstein's phylogenetic likelihood.[2] Due to the assumption of independent evolution across the loci,
By Bayesian inference based on the joint conditional distribution
Then, the posterior distribution of is given by
where the integration represents summation over all possible gene tree topologies and integration over the coalescent times at each locus.[3]
Distribution of gene genealogy derived from censored coalescent process
The joint distribution of is derived directly in this section. Two sequences from different species can coalesce only in one populations that are ancestral to the two species. For example, sequences H and G can coalesce in populations HCG or HCGO, but not in populations H or HC. The coalescent processes in different populations are different.
For each population, the genealogy is traced backward in time, until the end of the population at time , and the number of lineages entering the population and the number of lineages leaving it are recorded. For example, and , for population H (Table 1).[1] This process is called a censored coalescent process because the coalescent process for one population may be terminated before all lineages that entered the population have coalesced. If the population consists of disconnected subtrees or lineages.
With one time unit defined as the time taken to accumulate one mutation per site, any two lineages coalesce at the rate . The waiting time until the next coalescent event, which reduces the number of lineages from to has exponential density
If , the probability that no coalescent event occurs between the last one and the end of the population at time ; i.e. during the time interval . This probability is and is 1 if .
(Note: One should recall that the probability of no events over time interval for a Poisson process with rate is . Here the coalescent rate when there are lineages is .)
In addition, to derive the probability of a particular gene tree topology in the population, if a coalescent event occurs in a sample of lineages, the probability that a particular pair of lineages coalesce is .
Multiplying these probabilities together, the joint probability distribution of the gene tree topology in the population and its coalescent times as
- .
The probability of the gene tree and coalescent times for the locus is the product of such probabilities across all the populations. Therefore, the gene genealogy of Figure 1,[1][4] we have
References
- Rannala B, Yang Z (August 2003). "Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci". Genetics. 164 (4): 1645–56. PMC 1462670. PMID 12930768.
- Felsenstein (1981). "Evolutionary trees from DNA sequences: A maximum likelihood approach". Journal of Molecular Evolution. 17 (6): 368–376. doi:10.1007/BF01734359. PMID 7288891.
- Xu B, Yang Z (December 2016). "Challenges in Species Tree Estimation Under the Multispecies Coalescent Model". Genetics. 204 (4): 1353–1368. doi:10.1534/genetics.116.190173. PMC 5161269. PMID 27927902.
- Yang Z (2014). Molecular evolution : a statistical approach (First ed.). Oxford: Oxford University Press. pp. Chapter 9. ISBN 9780199602605. OCLC 869346345.