Multispecies coalescent process

Multispecies Coalescent Process is a stochastic process model that describes the genealogical relationships for a sample of DNA sequences taken from several species.

Gene tree is a binary graph that describes the evolutionary relationships between a sample of sequences for a non-recombining locus. Species tree describes the evolutionary relationships between a set of species, assuming tree-like evolution. However, several processes can lead to discordance between species and gene trees. The Multispecies Coalescent model provides a framework for inferring species phylogenies while accounting for ancestral polymorphism and gene tree-species tree conflict. The process is also called the Censored Coalescent.[1]

Multispecies Coalescent

The probability density of the gene trees under the multispecies coalescent model is discussed along with its use for parameter estimation using multi-locus sequence data.

Assumptions

The species phylogeny is assumed to be known. Complete isolation after species divergence, with no migration, hybridization, or introgression is also assumed. We assume no recombination so that all the sites within the locus share the same gene tree (topology and coalescent times).

Data and Model Parameters

The model and implementation of this method can be applied to any species tree. As an example, the species tree of the great apes: human (H), chimpanzee (C), gorilla (G) and orangutan (O) is considered. The topology of the species tree, (((HC)G)O)), is assumed known and fixed in the analysis (Figure 1).[1] Let $D=\{D_{i}\}$ be the entire data set, where ${D_{i}}$ represent the sequence alignment at locus $i$ , with $i=1,2,\ldots ,L$ for a total of $L$ loci.

The population size of a current species is considered only if more than one individual is sampled from that species at some loci.

The parameters in the model for the example of Figure 1 include the three divergence times $\tau _{HC}$ , $\tau _{HCG}$ and $\tau _{HCGO}$ and population size parameters $\theta _{H}$ for humans; $\theta _{C}$ for chimpanzees; and $\theta _{HC}$ , $\theta _{HCG}$ and $\theta _{HCGO}$ for the three ancestral species.

The divergence times ( $\tau$ 's) are measured by the expected number of mutations per site from the ancestral node in the species tree to the present time (Figure 1 of Rannala and Yang, 2003).

Therefore, the parameters are $\Theta =\{\theta _{H},\theta _{C},\theta _{HC},\theta _{HCG},\theta _{HCGO},\tau _{HC},\tau _{HCG},\tau _{HCGO}\}$ .

Likelihood-based inference

The gene genealogy $G_{i}$ at each locus $i$ is represented by the tree topology $T_{i}$ and the coalescent times $t_{i}$ . Given parameters $\Theta$ , the probability distribution of $G_{i}=\{T_{i},t_{i}\}$ is specified by the coalescent process under the model given by

f(G\mid \Theta )=\prod _{i}f(G_{i}\mid \Theta )=\prod _{i}f(T_{i},t_{i}\mid \Theta )

The probability of data $D_{i}$ given the gene tree and coalescent times (and thus branch lengths) at the locus, $f(D_{i}\mid G_{i})$ is the Felsenstein's phylogenetic likelihood.[2] Due to the assumption of independent evolution across the loci,

f(D\mid G)=\prod _{i}f(D_{i}\mid G_{i})

By Bayesian inference based on the joint conditional distribution

f(\Theta ,G\mid D)\propto f(D\mid G)f(G\mid \Theta )f(\Theta ).

Then, the posterior distribution of $\Theta$ is given by

f(\Theta \mid D)=\int f(\Theta ,G\mid D)dG,

where the integration represents summation over all possible gene tree topologies and integration over the coalescent times at each locus.[3]

Distribution of gene genealogy derived from censored coalescent process

The joint distribution of $f(T_{i},t_{i}\mid \Theta )$ is derived directly in this section. Two sequences from different species can coalesce only in one populations that are ancestral to the two species. For example, sequences H and G can coalesce in populations HCG or HCGO, but not in populations H or HC. The coalescent processes in different populations are different.

For each population, the genealogy is traced backward in time, until the end of the population at time $\tau$ , and the number of lineages $(m)$ entering the population and the number of lineages leaving it $(n)$ are recorded. For example, $m=3,n=2,$ and $\tau =\tau _{HC}$ , for population H (Table 1).[1] This process is called a censored coalescent process because the coalescent process for one population may be terminated before all lineages that entered the population have coalesced. If $n\geq 1$ the population consists of $n$ disconnected subtrees or lineages.

With one time unit defined as the time taken to accumulate one mutation per site, any two lineages coalesce at the rate ${\frac {2}{\theta }}$ . The waiting time $t_{j}$ until the next coalescent event, which reduces the number of lineages from $j$ to $j-1$ has exponential density

f(t_{j})={\frac {j(j-1)}{2}}{\frac {2}{\theta }}\exp\{-{\frac {j(j-1)}{2}}{\frac {2}{\theta }}t_{j}\},\quad j=m,m-1,\ldots ,n+1

If $n\geq 1$ , the probability that no coalescent event occurs between the last one and the end of the population at time $\tau$ ; i.e. during the time interval $\tau -(t_{m}+t_{m-1}+\ldots +t_{n+1})$ . This probability is $\exp\{-{\frac {n(n-1)}{\theta }}[\tau -(t_{m}+t_{m-1}+\ldots +t_{n+1})]$ and is 1 if $n=1$ .

(Note: One should recall that the probability of no events over time interval $t$ for a Poisson process with rate $\lambda$ is $e^{-\lambda t}$ . Here the coalescent rate when there are $n$ lineages is $\lambda ={\frac {n(n-1)}{\theta }}$ .)

In addition, to derive the probability of a particular gene tree topology in the population, if a coalescent event occurs in a sample of $j$ lineages, the probability that a particular pair of lineages coalesce is $1/{\binom {j}{2}}=2/j(j-1),\quad j=m,m-1,\ldots ,n+1$ .

Multiplying these probabilities together, the joint probability distribution of the gene tree topology in the population and its coalescent times $t_{m},t_{m+1},\ldots ,t_{n+1}$ as

\prod _{j=n+1}^{m}{\Big [}{\frac {2}{\theta }}\exp {\Big \{}-{\frac {j(j-1)}{\theta }}t_{j}{\Big \}}{\Big ]}\exp {\Big \{}-{\frac {n(n-1)}{\theta }}(\tau -(t_{m}+t_{m+1}+\ldots +t_{n+1})){\Big \}}

.

The probability of the gene tree and coalescent times for the locus is the product of such probabilities across all the populations. Therefore, the gene genealogy of Figure 1,[1][4] we have

${\begin{aligned}f(G_{i}\mid \Theta )&=[2/\theta _{H}\exp\{-6t_{3}^{(H)}/\theta _{H}\}\exp\{-2(\tau _{HC}-t_{3}^{(H)})/\theta _{H}\}]\\&{}\times [2/\theta _{C}\exp\{-2t_{2}^{(C)}/\theta _{C}\}]\\&{}\times [2/\theta _{HC}\exp\{-6t_{3}^{HC}/\theta _{HC}\}]\times [2/\theta _{HC}\exp\{-2t_{2}^{HC}/\theta _{HC}\}]\\&{}\times [\exp\{-2(\tau _{HCG}-\tau _{HG}-(t_{3}^{HC}+t_{2}^{HC}))/\theta _{HCG}\}]\\&{}\times [2/\theta _{HCGO}\exp\{-6t_{3}^{HCGO}/\theta _{HCGO}\}]\times [2/\theta _{HCGO}\exp\{-2t_{2}^{HCGO}/\theta _{HCGO}\}]\end{aligned}}$

gollark: I'm insulted. I make many useful libraries, but I only get remembered for potatOS!

gollark: `i only like gollark's skynet nothing else made by gollark`

gollark: What pjals just said.

gollark: That's racism.

gollark: __is good__

References

Rannala B, Yang Z (August 2003). "Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci". Genetics. 164 (4): 1645–56. PMC 1462670. PMID 12930768.
Felsenstein (1981). "Evolutionary trees from DNA sequences: A maximum likelihood approach". Journal of Molecular Evolution. 17 (6): 368–376. doi:10.1007/BF01734359. PMID 7288891.
Xu B, Yang Z (December 2016). "Challenges in Species Tree Estimation Under the Multispecies Coalescent Model". Genetics. 204 (4): 1353–1368. doi:10.1534/genetics.116.190173. PMC 5161269. PMID 27927902.
Yang Z (2014). Molecular evolution : a statistical approach (First ed.). Oxford: Oxford University Press. pp. Chapter 9. ISBN 9780199602605. OCLC 869346345.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[Rannala_2003-1] Rannala B, Yang Z (August 2003). "Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci". Genetics. 164 (4): 1645–56. PMC 1462670. PMID 12930768.

[Felsenstein-2] Felsenstein (1981). "Evolutionary trees from DNA sequences: A maximum likelihood approach". Journal of Molecular Evolution. 17 (6): 368–376. doi:10.1007/BF01734359. PMID 7288891.

[3] Xu B, Yang Z (December 2016). "Challenges in Species Tree Estimation Under the Multispecies Coalescent Model". Genetics. 204 (4): 1353–1368. doi:10.1534/genetics.116.190173. PMC 5161269. PMID 27927902.

[4] Yang Z (2014). Molecular evolution : a statistical approach (First ed.). Oxford: Oxford University Press. pp. Chapter 9. ISBN 9780199602605. OCLC 869346345.