Department of Botany and Plant Sciences, University of California at Riverside
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
McDade (1990, 1992)
carried out artificial hybridization experiments among nine Central American species of the Aphelandra pucherrima complex and obtained 17 hybrid populations. She did extensive cladistic analysis, concluding that a hybrid will be placed by phylogenetic analysis as a basal lineage to the clade that includes its most derived parent. The phylogeny of 24 inbred strains of mice inferred by Atchley and Fitch (1991, 1993)
includes several strains with hybrid origins. Atchley and Fitch found that a hybrid strain is always placed close to one of its parental strains. However, analytical tools that we use at present cannot generate reticulate diagrams that accurately depict a hybrid history (McDade 1995
). If an analysis includes hybrids, no matter where the hybrids are placed, a cladistic method produces only divergently branching phylogenetic patterns and thus can never give the correct phylogeny. When traditional cladistic methods are applied anyway, they can give confusing and conflicting results. One typical result is that a large set of very different phylogenies will appear to be equally good (Hein 1990
).
After carefully examining the cladistic behavior of hybrids, Funk (1985)
provided some guidelines and methods for identifying possible hybrids in a cladistic study. Morefield (in Rieseberg and Morefield 1995
) has developed a computer program, RETICLAD, that can identify hybrids based on the expectation that they will combine the characters of their parents. RETICLAD basically represents a quantification of Funk's (1985)
approach to recognizing hybrids. Rieseberg and Ellstrand (1993)
provide a number of examples in which the program seems to work well. However, the RETICLAD program only tests reticulate events between terminal branches; hybridization between internal branches cannot be analyzed (Rieseberg and Morefield 1995
).
In this paper, I focus on some theoretical aspects of reticulate evolution and develop a method using a simple reticulate phylogeny of four (or five) taxa (including one hybrid) as an example. I demonstrate the phylogenetic method under the pure drift model, and then extend the method to fit a mutation model. A least-squares method is developed to reconstruct the reticulate phylogeny using gene frequency data. The efficacy of the method under the drift model is verified via Monte Carlo simulation experiments. Finally, the joint effect of genetic drift and mutation on analysis of reticulate phylogeny is discussed.
![]() |
The Pure Drift Model |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Reduction of Heterozygosity
In contrast to mutation, genetic drift consumes heterozygosity as the population evolves. Consider a finite population with an effective population size of Ne that was isolated from an infinite random mating population t generations ago. The heterozygosity of the current population, denoted by Ht, is
![]() |
Population Divergence
Consider a population (B) that was isolated from a base population (A) some tAB generations ago. Assume that B was split into two lineages, of which one underwent tBX generations of drift leading to terminal taxon X and the other drifted for tBY generations leading to terminal taxon Y (see fig. 1
). We can observe the gene frequencies of X and Y; can we infer the lengths of the three branches based on molecular data collected from X and Y? First, from the reduction in heterozygosity in X we can infer tAB + tBX based on HX = HAtAB+tBX, where HX and HA are the heterozygosities of population X and its remote ancestor A, respectively (HA is usually denoted by H0). Likewise, tAB + tBY can be inferred from HY = HA
tAB+tBY , where HY is the heterozygosity in Y. Second, if we know the heterozygosity of the internal node, HB, we can infer tAB using HB = HA
tAB. If Ne and HA are known, the above equations can be used to estimate the lengths of the three branches.
|
where n is the number of allelic states of a particular locus, and xi is the frequency of the ith allele for population X. Under the assumption that X is in Hardy-Weinberg equilibrium, the expectation of DXX equals HX. The heterozygosity of a terminal taxon can be calculated by sampling molecular data. However, heterozygosity of an internal node cannot be so calculated. We have previously shown that heterozygosity of the population at the internal node (e.g., B) can be estimated by (Xu, Atchley, and Fitch 1994
)
where yi is the frequency of the ith allele for population Y. Xu, Atchley, and Fitch (1994)
proved that E(DXY) = HA(1 -
XY), where
XY is the coancestry coefficient between X and Y. Because
XY approximately equals the average inbreeding coefficient of node B, we have
![]() |
Reticulation Under Drift
Figure 2
shows that after lineage divergence, hybridization occurred at point C in the lineage leading to X and at point D in the lineage leading to Y. The hybrid is denoted by node E, and the terminal node of the hybrid lineage is denoted by Z. Here, we assume that the hybrid contains equal contributions from the two progenitors. We also assume that the hybrid can reproduce true and show no change in fitness compared with its progenitors. The hybrid lineage evolves independent of its parents (without backcrossing). Some of these assumptions may be relaxed (see Discussion).
|
![]() |
![]() |
|
![]() |
As shown earlier, HB, HC, and HD can be estimated by DXY, DXX', and DYY', respectively. The heterozygosity of node E can be estimated as
![]() |
The heterozygosity in subsequent generations will decline at the usual rate, i.e., HZ = HEtEZ.
Expectation of Genetic Distance
The genetic distances between nonhybrid species X and Y have the following expectations: E(DXX) = HX = HAtAB+tBC +tCX, E(DYY) = HY = HA
tAB+tBD +tDY, and E(DXY) = HB = HA
tAB. The genetic distance of the hybrid lineage with itself has an expectation of:
Note that by employing the approximation
![]() |
![]() |
The approximation is quite good even if Ne is small, as shown in figure 4 .
|
and
Inferring Reticulate Phylogenies
The method is demonstrated using a hypothetical reticulate phylogeny with four taxa (see fig. 5
). This phylogeny has a total of 8 branches (including the root) and 10 data points, allowing a least-squares method to be used to estimate the lengths of the branches.
|
![]() |
![]() |
Define y = [yXX yWW yZZ yYY yXW yXZ yXY yWZ yWY yZY]T as a vector of data, t = [tAB tBC tBD tCF tFX tFW tDY tEZ]T as the lengths of branches of the phylogeny, and = [
XX
WW
ZZ
YY
XW
XZ
XY
WZ
WY
ZY]T as a vector of residual errors. We have the following linear model:
![]() |
The least-squares estimate of t is
![]() |
where degrees of freedom (df) = 10 - 8 = 2. Note that one degree of freedom has been lost compared with a regular bifurcating tree.
The total number of possible reticulate phylogenies may be huge for a large number of taxa. If it is known a priori that some taxa are hybrids and the hybridization has occurred at most once in any given lineage, the total number of reticulate phylogenies may be significantly reduced such that it is possible to search for the best phylogeny (with the least MSE). Consider, for example, the four-taxon reticulate evolution model. If we know that taxon Z has a hybrid origin but we do not know where the hybridization occurred, then the total number of reticulate phylogenies is 12 (see fig. 6 ). If we know that one of the four taxa is a hybrid but we do not know which one, then the total number of possible phylogenies will increase to 4 x 12 = 48. Theoretically, all possible phylogenies must be evaluated, and the inferred phylogeny is the one that has the minimum MSE.
|
![]() |
The Mutation Model |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Consider the simple phylogeny given in figure 1
. The index of gene alike between X and Y is
![]() |
In contrast to the drift model, in which yXY is proportional to the time before X and Y split, the yXY under the mutation model is proportional to the time after X and Y diverged.
I now examine the gene alike indices in the reticulate phylogeny (fig. 2
). The internal node of the hybrid can be decomposed into E = C + (1 -
)D, where
is an indicator variable defined as
= 1 if an allele sampled from Z comes from C and
= 0 if the allele comes from D. With equal contributions from C and D, we have Pr(
= 1) = 1/2. Therefore,
This approximation is made under the assumption that u + v < 1. Similarly,
Performing log transformation, we have
and
Having expressed yij as a linear function of the lengths of branches, we are ready to evaluate a given phylogeny.
Consider again the reticulate phylogeny given in figure 5
. We will view this as an unrooted tree so that A is treated as a regular terminal taxon rather than as a root. There are 5 taxa and 10 pairwise measurements of gene alike indices. The number of branches involved in the phylogeny is eight, leaving 10 - 8 = 2 degrees of freedom. Define y = [yAX yAW yAZ yAY yXW yXZ yXY yWZ yWY yZY]T as the data vector, t = [tAB tBC tBD tCF tFX tFW tDY tEZ]T as the vector of parameters (lengths of branches), and = [
AX
AW
AZ
AY
XW
XZ
XY
WZ
WY
ZY]T as the residual errors. The phylogeny is evaluated by fitting the data to the same linear model shown in equation (16) but with X defined differently:
The same least-squares method is applied here to evaluate the tree and estimate the lengths of the branches.
![]() |
Numerical Studies |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Let D be the distance matrix with the taxa arranged in the order of {X W Z Y}, i.e.,
Under this setting, the effective population size is Ne = 100. Let HA = 0.50; then, the expectation of D can be obtained using equations (11)(13):
Note that the differences between the exact and the approximated D are negligible.
The reticulate phylogeny (fig. 5 ) was then simulated under 100 independent biallelic, equally frequent loci. The genetic distances calculated from a single run of the simulation were
which are reasonably close to E(D).
The genetic distances are converted into y using equation (14) under HA = 0.50 and Ne = 100. The data are then fitted to the linear model with X chosen under the true phylogeny, given in equation (17) (see fig. 5 ). The lengths of branches are estimated using equation (18), with results also shown in figure 5 (values with floating points). In general, the estimated lengths of branches are reasonably closed to the true values, although some sampling errors have been observed. The MSE value of this phylogeny is 11.05 generation2.
The data (y) are then fitted to each of the remaining 11 reticulate phylogenies (fig. 6 ) and the 15 bifurcating trees (fig. 7 ). The MSEs of these trees are given in table 1 , showing that the true reticulate phylogeny (tree 2) does have the least MSE.
|
|
The three topologies simulated represent three different cases: hybridization between closely related taxa, ancient hybridization, and new hybridization between distantly related taxa. The first simulated topology is reticulate phylogeny 1 (see fig. 6 ), where Z is a hybrid between X and W, which are sister taxa (recently diverged). The frequency of being chosen as the inferred phylogeny is given in table 2 for each of the 12 + 15 = 27 phylogenies considered. When the number of loci is small, the true phylogeny has a low frequency of being inferred compared with bifurcating trees 6 and 10. Note that the relationship between the hybrid and its progenitors is ((X, Z), W) in tree 6, while the relationship is (X, (Z, W)) in tree 10. As the number of loci increases, the frequency of the true phylogeny begins to dominate.
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The drift model and the mutation model are exclusive. In situations in which the phylogeny was shaped by the joint force of drift and mutation, neither model will work, because the genetic distance measured this way is a function of both t's (before and after the split of the two populations). In theory, the two models can be combined to infer phylogenies when both drift and mutation are important. Unfortunately, derivation of the drift-mutation model is complicated. Consider the phylogeny given in figure 1 . The within-population gene alike indices are (Cockerham 1984
)
and the between-population index is
![]() |
Note that Q* - qXY is a function of both tAB (the time before the populations split) and tBX + tBY (the time after the populations split). The complexity comes from the fact that the log of Q* - qXY is not a linear function of the t's unless
![]() |
which is still complicated because the branch lengths before and after the populations split are expressed in different scales. Alternative methods, such as parsimony and maximum likelihood, may be preferable for combining the two models; this possibility deserves further investigation.
Molecular data at the sequence level have been used to detect horizontal gene transfer, e.g., recombination within a sequence (Hein 1990, 1993
; Hudson 1990;
Bollyky et al. 1996;
Grassly and Holmes 1997
). It is not clear how useful it is to infer species hybridization using sequence data. Under certain circumstances, nuclear DNA polymorphism in restriction endonucleases may be used to infer reticulate phylogenies. One of the basic assumptions when using restriction data is that the sites must be independent. This assumption may hold when the restriction sites are located far apart on the genome such that the sites freely recombine in the hybrid lineage. Alternatively, if the hybrid lineage is formed by the hybridization of a large number of individuals from each parental lineage, given a sufficient number of generations in random mating within the hybrid lineage, the sites may behave as independent, even if they are located close together. Nei and Li (1979)
developed the mathematical model for studying population divergence in terms of restriction endonucleases. The proportion of sites shared by lineages X and Y, denoted by SXY, is expected to decline as X and Y further diverge. Nei and Li (1979)
showed that
![]() |
where m is the number of different restriction enzymes used. The distance between X and Y is finally expressed as a linear function of the times after they diverged. The distance involving a hybrid lineage can be similarly expressed, e.g.,
The same least-squares method can be used to evaluate a reticulate phylogeny (see the mutation model in gene frequency).
Recombination, a form of reticulation at the gene level, generates the same problems as hybridization. Methods exist which try to diagnose recombination by looking at the compatibility of the "phylogenetic partition" supported by the polymorphic sites along the sequence (Drouin and Dover 1990
), by looking at changes in the most parsimonious topology along sequences (Hein 1990, 1993
), by using a maximum chi-square test (Maynard Smith 1992
), or by using the maximum-likelihood approach to detect the specific region showing "anomalous" evolutionary patterns (Grassly and Holmes 1997
). However, no general methods exist which allow the placement of a putative hybrid in the appropriate clade. Ritland and Eckenwalder (1992)
developed a method to estimate both the time since hybridization and the admixture proportion. Although their treatment does not allow the evaluation of alternative topologies when the progenitors are unknown, it does allow the placement of the hybrid in the correct position relative to the two progenitors if the progenitors are known. The above theoretical works have enhanced our understanding of reticulate evolution, but they may only represent a small proportion of the work required to complete a more general approach.
To obtain the number of loci required for this method to work accurately, dominant markers such as AFLPs may have to be employed. The method proposed can handle dominant markers provided that the gene frequency within a population can be estimated using the Hardy-Weinberg law; i.e., the frequency of the recessive allele is estimated by the square root of the frequency of the recessive homozygotes. The number of allelic states per dominant locus is considered to be two. The efficiency using dominant markers would be slightly less than that observed in the biallelic codominant system (see the simulation studies section) because the gene frequency is not given, but estimated from the genotypic frequencies.
The pure drift model may be of interest in its own right. Inbred strains of laboratory animals are valuable model organisms for studies in evolutionary biology, particularly at the molecular level (see Atchley and Fitch 1991, 1993
; Fitch and Atchley 1985, 1989
). The phylogeny of inbred strains is most likely driven by genetic drift, not by mutation. First, most of the inbred strains of mice could have arisen from just a few mice (Atchley and Fitch 1993
). Second, most of the inbred strains are derived by systematic brother x sister mating, which represents the maximum effect of genetic drift in laboratory animals. Third, the evolutionary history of these organisms is too short for a significant mutational input (most laboratory strains of rats and mice have been inbred for less than 200 generations). However, many inbred mouse and rat strains were originally produced from hybridization between genetically divergent strains (Atchley and Fitch 1993
). For instance, the SEC strain of mice was derived from hybrids between NB and BALB/c (Festing 1989
), and the BS strain of rats was derived from hybrids between NZ and a wild rat (Hedrich 1990
). With the theory presented in this paper and the data from inbred strains of animals with known hybrid origins, the genetic aspects of reticulation and its impact on phylogenetic inference could be studied in detail. Furthermore, the model may be readily applied to evolutionary studies of domesticated animals and agricultural cultivars.
For generality, suppose that backcrosses occurred a few times immediately after the initial hybridization event, such that the hybrid taxon ultimately inherits a proportion p of genes from parental taxon X and a proportion 1 - p of genes from parental taxon Y (see fig. 2 ). In this case, the expected genetic distances become
Estimation of branch lengths under this unidirectional gene flow scenario is still possible if p is known. Otherwise, data from more taxa are required to estimate branch lengths and p simultaneously.
A final caveat concerns the assumption of constant effective population size along all segments of the phylogeny. This assumption is not realistic, especially for the hybridized lineage. In the early stage of hybridization, the hybrid population must have experienced a sort of bottleneck and selection. There may be much reorganization of the genome, linkage disequilibrium of genes or chromosomal blocks, as nicely demonstrated in an empirical case (Rieseberg, Vanfossen, and Desrochers 1995
). The robustness of the model to these effects needs to be further studied. Nonetheless, we can slightly relax the assumption of constant Ne by assuming that Ne is constant within a segment, but it can vary across different segments. In this case, the estimated branch length for each segment is the number of generations divided by twice the effective population size corresponding to that period of time. A similar argument also holds for the assumption of constant mutation rate v across loci and across alleles within loci.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: genetic drift
heterozygosity
hybridization
mutation
phylogeny
reticulation
2 Address for correspondence and reprints: Shizhong Xu, Department of Botany and Plant Sciences, University of California, Riverside, California 92521. E-mail: xu{at}genetics.ucr.edu
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Atchley, W. R., and W. M. Fitch. 1991. Gene trees and the origins of inbred strains of mice. Science 254:554558.
. 1993. Genetic affinities of inbred mouse strains of uncertain origin. Mol. Biol. Evol. 10:11501169.[Abstract]
Bollyky, P. L., A. Rambaut, P. H. Harvey, and E. C. Holmes. 1996. Recombination between sequences of hepatitis B virus from different genotypes. J. Mol. Evol. 42:97102.[ISI][Medline]
Cavalli-Sforza, L. L., and A. W. F. Edwards. 1967. Phylogenetic analysis: models and estimation procedures. Am. J. Hum. Genet. 19:233257.[ISI][Medline]
Cockerham, C. C. 1984. Drift and mutation with a finite number of allelic states. Proc. Natl. Acad. Sci. USA 81:530534.
Drouin, G., and G. A. Dover. 1990. Independent gene evolution in the potato actin gene family demonstrated by phylogenetic procedure for resolving gene conversions and the phylogeny of angiosperm actin genes. J. Mol. Evol. 31:132150.[ISI][Medline]
Festing, M. F. W. 1989. Inbred strains of mice. Pp. 636648 in M. F. Lyon and A. G. Searle, eds. Genetic variants and inbred strains of mice. Oxford University Press, New York.
Fitch, W. M., and W. R. Atchley. 1985. Evolution in inbred strains of mice appears rapid. Science 228:11691175.
. 1989. Divergence in inbred strains of mice: a comparison of three different types of data. Pp. 203216 in C. Patterson, ed. Molecules and morphology in evolution: conflict or compromise? Cambridge University Press, London.
Funk, V. A. 1985. Phylogenetic patterns and hybridization. Ann. Mo. Bot. Gard. 72:681715.
Grassly, N. C., and E. C. Holmes. 1997. A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol. Biol. Evol. 14:239247.[Abstract]
Hedrich, H. J. 1990. Genetic monitoring of inbred strains of rats. Gustav Fischer Verlag, Stuttgart.
Hein, J. 1990. Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci. 98:185200.[ISI][Medline]
. 1993. A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol. 36:396405.[ISI]
Hudson, R. R. 1990. Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7:144.
McDade, L. 1990. Hybrids and phylogenetic systematics. I. Patterns of character expression in hybrids and their implications for cladistic analysis. Evolution 44:16851700.
. 1992. Hybrid and phylogenetic systematics. II. The impact of hybrids on cladistic analysis. Evolution 46:13291346.
. 1995. Hybridization and phylogenetics. Pp. 305331 in P. C. Hoch and A. G. Stephenson, eds. Experimental and molecular approaches to plant biosystematics. Monographs in Systematic Botany from the Missouri Botanical Garden.
Maynard Smith, J. 1992. Analyzing the mosaic structure of genes. J. Mol. Evol. 34:126129.[ISI][Medline]
Nei, M., and W.-H. Li. 1979. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl. Acad. Sci. USA 76:52695273.
Rieseberg, L. H. 1991. Homoploid reticulate evolution in Helianthus (Asteraceae) evidence from ribosomal genes. Am. J. Bot. 78:12181237.[ISI]
Rieseberg, L. H., and N. C. Ellstrand. 1993. What can molecular and morphological markers tell us about plant hybridization? Crit. Rev. Plant Sci. 12:213241.
Rieseberg, L. H., and J. D. Morefield. 1995. Character expression, phylogenetic reconstruction, and the detection of reticulate evolution. Pp. 333353 in P. C. Hoch and A. G. Stephenson, eds. Experimental and molecular approaches to plant biosystematics. Monographs in Systematic Botany from the Missouri Botanical Garden.
Rieseberg, L. H., C. Vanfossen, and A. M. Desrochers. 1995. Hybrid speciation accompanied by genomic reorganization in wild sunflowers. Nature 375:313316.
Rieseberg, L. H., J. Whitton, and C. R. Linder. 1996. Molecular marker incongruence in plant hybrid zones and phylogenetic trees. Acta Bot. Neerl. 45:143262.
Ritland, K., and J. E. Eckenwalder. 1992. Polymorphism, hybridization, and variable evolutionary rate in molecular phylogenies. Pp. 404429 in D. E. Soltis, P. S. Soltis, and J. J. D. Rootledge, eds. Molecular systematics of plants. Chapman and Hall, New York.
Spence, J. R. 1990. Introgressive hybridization in Heteroptera: the example of Limnoporus Stal (Gerridae) species in western Canada. Can. J. Zool. 68:17701782.[ISI]
Sytsma, K. J. 1990. DNA and morphology: inference of plant phylogeny. TREE 5:104110.
Xu, S., and W. R. Atchley. 1995. Heterozygosity of F2 from two segregating populations. J. Hered. 86:477480.[ISI][Medline]
Xu, S., W. R. Atchley, and W. M. Fitch. 1994. Phylogenetic inference under the pure drift model. Mol. Biol. Evol. 11:949960.[Abstract]