Phylogenetic Analysis Under Reticulate Evolution

Shizhong XuGo,

Department of Botany and Plant Sciences, University of California at Riverside


    Abstract
 TOP
 Abstract
 Introduction
 The Pure Drift Model
 The Mutation Model
 Numerical Studies
 Discussion
 Acknowledgements
 literature cited
 
The usual assumption that species have evolved from a common ancestor by a simple branching process—where each branch is genetically isolated—has been challenged by the observation of frequent hybridization between species in natural populations. In fact, most plant species are thought to have hybrid origins. This reticulate pattern of species evolution has posed problems in the definition of speciation and in phylogenetic reconstruction, especially when molecular data are used. As a result, hybridization has been largely treated as an evolutionary accident or statistical error in phylogenetic analysis. In this paper, I explicitly incorporate hybridization as an evolutionary occurrence and then conduct phylogenetic reconstruction. I first examine the reticulate evolution under a pure drift model, and then extend the theory to fit a mutation model. A least-squares method is developed for reconstructing a reticulate phylogeny using gene frequency data. The efficacy of the method under the pure drift model is verified via Monte Carlo simulations.


    Introduction
 TOP
 Abstract
 Introduction
 The Pure Drift Model
 The Mutation Model
 Numerical Studies
 Discussion
 Acknowledgements
 literature cited
 
Species are often assumed to have evolved from a common ancestor by a complete process of branching, followed by complete genetic isolation (Cavalli-Sforza and Edwards 1967Citation ). A period of hybridization shortly after speciation is not thought to weaken this assumption provided the period is short relative to the periods between successive speciation events. However, species may hybridize long after speciation (Ritland and Eckenwalder 1992Citation ), which may pose problems in phylogenetic reconstruction, especially when molecular data are used (Spence 1990Citation ). In fact, many plant species are thought to have hybrid origins. The discovery of cytoplasmic introgression and the nonconcordance between the rDNA and cpDNA phylogenies of several plant groups is reflective of past hybridization and subsequent introgression (Sytsma 1990Citation ; Rieseberg 1991Citation ; Rieseberg, Whitton, and Linder 1996Citation ).

McDade (1990, 1992)Citation carried out artificial hybridization experiments among nine Central American species of the Aphelandra pucherrima complex and obtained 17 hybrid populations. She did extensive cladistic analysis, concluding that a hybrid will be placed by phylogenetic analysis as a basal lineage to the clade that includes its most derived parent. The phylogeny of 24 inbred strains of mice inferred by Atchley and Fitch (1991, 1993)Citation includes several strains with hybrid origins. Atchley and Fitch found that a hybrid strain is always placed close to one of its parental strains. However, analytical tools that we use at present cannot generate reticulate diagrams that accurately depict a hybrid history (McDade 1995Citation ). If an analysis includes hybrids, no matter where the hybrids are placed, a cladistic method produces only divergently branching phylogenetic patterns and thus can never give the correct phylogeny. When traditional cladistic methods are applied anyway, they can give confusing and conflicting results. One typical result is that a large set of very different phylogenies will appear to be equally good (Hein 1990Citation ).

After carefully examining the cladistic behavior of hybrids, Funk (1985)Citation provided some guidelines and methods for identifying possible hybrids in a cladistic study. Morefield (in Rieseberg and Morefield 1995Citation ) has developed a computer program, RETICLAD, that can identify hybrids based on the expectation that they will combine the characters of their parents. RETICLAD basically represents a quantification of Funk's (1985)Citation approach to recognizing hybrids. Rieseberg and Ellstrand (1993)Citation provide a number of examples in which the program seems to work well. However, the RETICLAD program only tests reticulate events between terminal branches; hybridization between internal branches cannot be analyzed (Rieseberg and Morefield 1995Citation ).

In this paper, I focus on some theoretical aspects of reticulate evolution and develop a method using a simple reticulate phylogeny of four (or five) taxa (including one hybrid) as an example. I demonstrate the phylogenetic method under the pure drift model, and then extend the method to fit a mutation model. A least-squares method is developed to reconstruct the reticulate phylogeny using gene frequency data. The efficacy of the method under the drift model is verified via Monte Carlo simulation experiments. Finally, the joint effect of genetic drift and mutation on analysis of reticulate phylogeny is discussed.


    The Pure Drift Model
 TOP
 Abstract
 Introduction
 The Pure Drift Model
 The Mutation Model
 Numerical Studies
 Discussion
 Acknowledgements
 literature cited
 
In the absence of selection, mutation, and migration, the genetic composition of a finite population will change overtime in a random fashion. The heterozygosity within a single population, however, is a monotonic decreasing function of the time since the population is isolated from the base population. We have demonstrated that heterozygosities can be used to infer phylogenetic relationships among populations (Xu, Atchley, and Fitch 1994Citation ).

Reduction of Heterozygosity
In contrast to mutation, genetic drift consumes heterozygosity as the population evolves. Consider a finite population with an effective population size of Ne that was isolated from an infinite random mating population t generations ago. The heterozygosity of the current population, denoted by Ht, is

where H0 is the heterozygosity of the base population and {gamma} = 1 - 1/(2Ne) is the rate of heterozygosity reduction per generation. Equation (1) implies that if the loss of heterozygosity is known, one can infer the time since the population was isolated from the base population.

Population Divergence
Consider a population (B) that was isolated from a base population (A) some tAB generations ago. Assume that B was split into two lineages, of which one underwent tBX generations of drift leading to terminal taxon X and the other drifted for tBY generations leading to terminal taxon Y (see fig. 1 ). We can observe the gene frequencies of X and Y; can we infer the lengths of the three branches based on molecular data collected from X and Y? First, from the reduction in heterozygosity in X we can infer tAB + tBX based on HX = HA{gamma}tAB+tBX, where HX and HA are the heterozygosities of population X and its remote ancestor A, respectively (HA is usually denoted by H0). Likewise, tAB + tBY can be inferred from HY = HA{gamma}tAB+tBY , where HY is the heterozygosity in Y. Second, if we know the heterozygosity of the internal node, HB, we can infer tAB using HB = HA{gamma}tAB. If Ne and HA are known, the above equations can be used to estimate the lengths of the three branches.



View larger version (6K):
[in this window]
[in a new window]
 
Fig. 1.—The simple rooted phylogeny used as an example in the text, where A is the base population, B is an internal node, X and Y are two terminal populations, and tij is the length of the branch between nodes i and j.

 
Heterozygosity and Genetic Distance
The estimated heterozygosity of terminal node X is given by


where n is the number of allelic states of a particular locus, and xi is the frequency of the ith allele for population X. Under the assumption that X is in Hardy-Weinberg equilibrium, the expectation of DXX equals HX. The heterozygosity of a terminal taxon can be calculated by sampling molecular data. However, heterozygosity of an internal node cannot be so calculated. We have previously shown that heterozygosity of the population at the internal node (e.g., B) can be estimated by (Xu, Atchley, and Fitch 1994Citation )


where yi is the frequency of the ith allele for population Y. Xu, Atchley, and Fitch (1994)Citation proved that E(DXY) = HA(1 - {theta}XY), where {theta}XY is the coancestry coefficient between X and Y. Because {theta}XY approximately equals the average inbreeding coefficient of node B, we have

where FB is the inbreeding coefficient of node B. Although equation (3) is strictly an estimate of the heterozygosity of the population two generations after the split from node B, we follow the convention adopted by Xu, Atchley, and Fitch (1994)Citation and treat it as an estimate of node B itself. For population sizes normally observed in natural populations, the reduction of heterozygosity over two generations is negligible, and thus HB can be approximated by DXY. With multiple loci, the overall estimated heterozygosity DXY takes the average of all locus-specific genetic distances. Note that under the pure drift model, E(DXY) {approx} HB is a function of tAB but not a function of tBX and tBY (see fig. 1 ).

Reticulation Under Drift
Figure 2 shows that after lineage divergence, hybridization occurred at point C in the lineage leading to X and at point D in the lineage leading to Y. The hybrid is denoted by node E, and the terminal node of the hybrid lineage is denoted by Z. Here, we assume that the hybrid contains equal contributions from the two progenitors. We also assume that the hybrid can reproduce true and show no change in fitness compared with its progenitors. The hybrid lineage evolves independent of its parents (without backcrossing). Some of these assumptions may be relaxed (see Discussion).



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 2.—The simple rooted reticulate phylogeny used as an example in the text, where a group of individuals from node C of lineage X hybridized with a group of individuals from node D of lineage Y to form a hybridized lineage Z.

 
It should be noted that hybridization between lineages X and Y in the past does not affect the genetic distance between X and Y (assuming no horizontal gene transfer). This allows the heterozygosity of node B to still be estimated by DXY. Figure 3 gives the reticulate phylogeny where we assume two hypothetical lineages diverged at the time when hybridization occurred. One lineage was isolated from node C and designated taxon X', while the other lineage was isolated from node D and designated taxon Y'. Recall that the genetic distance between X and Z is defined as the probability that a random gene sampled from Z and a random gene from X have different allelic states. The random gene drawn from Z has a probability of 1/2 of coming from lineage Y and a probability of 1/2 of coming from lineage X. If the sampled gene comes from lineage Y, then DXZ = DXY; otherwise, DXZ = DXX', where DXX' is the genetic distance between X and the hypothetical taxon X'. Therefore, DXZ = (DXX' + DXY)/2. Similarly, DYZ = (DYY' + DXY)/2. From these two equations, we obtain

and

These two genetic distances are important because DXX' and DYY' reflect the heterozygosities of internal nodes C and D, respectively, which are required for estimating tBC and tBD.



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 3.—The reticulate phylogeny used as an example in the text, where X' and Y' are two hypothetical taxa

 
Recovery of Heterozygosity in the Hybrid Lineage
The hybrid population will regain some heterozygosity compared with its parents. Denote the first generation hybrid by F1 and the second generation by F2 (node E in fig. 3 ). The expected heterozygosity of F1 is the probability that a random gene drawn from node C and a random gene from node D have different allelic states. As discussed earlier, this probability is the heterozygosity of the population two generations after node B. Therefore, the heterozygosity of the F1 hybrids is immediately recovered to the level at the time that the split just occurred. However, much of the regained heterozygosity will be lost in the next generation (F2). The amount of loss depends on HC and HD. Xu and Atchley (1995)Citation showed that the expected heterozygosity in F2 (node E) is

provided that the effective population size is not too small, e.g., Ne > 50. If the hybridization occurs long after divergence between C and D, HC and HD will be small, leading to HE = HB/2; i.e., half of the heterozygosity will be lost. On the other hand, if hybridization occurs shortly after divergence between C and D, then HC and HD will be close to HB, which leads to HE = HB; i.e., there is no loss in heterozygosity.

As shown earlier, HB, HC, and HD can be estimated by DXY, DXX', and DYY', respectively. The heterozygosity of node E can be estimated as

The heterozygosity in subsequent generations will decline at the usual rate, i.e., HZ = HE{gamma}tEZ.

Expectation of Genetic Distance
The genetic distances between nonhybrid species X and Y have the following expectations: E(DXX) = HX = HA{gamma}tAB+tBC +tCX, E(DYY) = HY = HA{gamma}tAB+tBD +tDY, and E(DXY) = HB = HA{gamma}tAB. The genetic distance of the hybrid lineage with itself has an expectation of:


Note that by employing the approximation

for small d, the terms in the parentheses of the above equation can be approximated by {gamma}tAB+(tBC +tBD)/4 for large Ne. Therefore,

The approximation is quite good even if Ne is small, as shown in figure 4 .



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 4.—Comparison of the approximate (dotted line) with the exact (solid line) heterozygosity of the hybrid terminal taxon Z as a function of Ne under HA = 0.5 and tAB = tBC = tBD = tEZ = 50 generations

 
Expectations of the genetic distances of the hybrid lineage from its parental species are


and


Inferring Reticulate Phylogenies
The method is demonstrated using a hypothetical reticulate phylogeny with four taxa (see fig. 5 ). This phylogeny has a total of 8 branches (including the root) and 10 data points, allowing a least-squares method to be used to estimate the lengths of the branches.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 5.—The reticulate phylogeny discussed in the text showing both the actual lengths of branches (integer values in parentheses) and the estimated branch lengths (values with floating points)

 
First, define

where i and j represent the terminal taxa. When Ne is large, ln {gamma} {approx} -1/(2Ne), leading to

Note that HA can be set at an arbitrary value without affecting the lengths of the tree branches except the length of the root (tAB). The value of Ne is normally unknown, but the choice of Ne does not affect the relative relationships of the estimated branch lengths. Because HA and Ne are common to all yij, they can be ignored in the data analysis without affecting the estimation of relative branch lengths. Therefore, we take yij = -ln(Dij), which can be interpreted as a measure of genetic similarity between taxa i and j. This similarity reflects the time during which taxa i and j share the same evolutionary pathways.

Define y = [yXX yWW yZZ yYY yXW yXZ yXY yWZ yWY yZY]T as a vector of data, t = [tAB tBC tBD tCF tFX tFW tDY tEZ]T as the lengths of branches of the phylogeny, and {epsilon} = [{epsilon}XX {epsilon}WW {epsilon}ZZ {epsilon}YY {epsilon}XW {epsilon}XZ {epsilon}XY {epsilon}WZ {epsilon}WY {epsilon}ZY]T as a vector of residual errors. We have the following linear model:

where X is the design matrix determined by the tree topology


The least-squares estimate of t is

with a mean squared error of


where degrees of freedom (df) = 10 - 8 = 2. Note that one degree of freedom has been lost compared with a regular bifurcating tree.

The total number of possible reticulate phylogenies may be huge for a large number of taxa. If it is known a priori that some taxa are hybrids and the hybridization has occurred at most once in any given lineage, the total number of reticulate phylogenies may be significantly reduced such that it is possible to search for the best phylogeny (with the least MSE). Consider, for example, the four-taxon reticulate evolution model. If we know that taxon Z has a hybrid origin but we do not know where the hybridization occurred, then the total number of reticulate phylogenies is 12 (see fig. 6 ). If we know that one of the four taxa is a hybrid but we do not know which one, then the total number of possible phylogenies will increase to 4 x 12 = 48. Theoretically, all possible phylogenies must be evaluated, and the inferred phylogeny is the one that has the minimum MSE.



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 6.—The 12 possible rooted reticulate phylogenies for four taxa with Z as the hybrid.

 

    The Mutation Model
 TOP
 Abstract
 Introduction
 The Pure Drift Model
 The Mutation Model
 Numerical Studies
 Discussion
 Acknowledgements
 literature cited
 
The mutation model was developed according to the theory formulated by Cockerham (1984)Citation , whereby the number of allelic states per locus (n) is assumed to be finite. For simplicity, I also assumed an equal mutation rate (v) of any allele to any other specific allele so that the overall mutation rate is u = (n - 1)v. Cockerham's (1984)Citation gene alike index is used for the derivation. The gene alike index within population X is defined as the probability of a random pair of alleles being alike (identical by state) and is estimated by QX = 1 - DXX = {Sigma}i=1n xi2. Similarly, the gene alike index between populations X and Y is defined as qXY = 1 - DXY = {Sigma}i=1n xiyi. The phylogeny is assumed to have started with an equilibrium gene alike value in the ancestral population before the taxa diverged. This assumption implies that the gene alike index within each population and each internal node is a constant; i.e., Qi = Q* for all i's, so that only qXY is informative for phylogenetic inference. Cockerham (1984)Citation provided the equilibrium gene alike value Q* {approx} (1 + 4Nev)/(1 + 4Nenv). The assumption of equilibrium QX does not imply constant allelic frequencies over time. In fact, the allelic frequencies must change from time to time so that the gene alike index between populations also varies overtime.

Consider the simple phylogeny given in figure 1 . The index of gene alike between X and Y is

where {alpha} = 1 - u - v, and q* {approx} 1/n is the equilibrium value of qXY (Cockerham 1984Citation ). Rearranging equation (20), we have (q* - qXY) = (q* - Q*){alpha}tBX+tBY, which leads to


In contrast to the drift model, in which yXY is proportional to the time before X and Y split, the yXY under the mutation model is proportional to the time after X and Y diverged.

I now examine the gene alike indices in the reticulate phylogeny (fig. 2 ). The internal node of the hybrid can be decomposed into E = {delta}C + (1 - {delta})D, where {delta} is an indicator variable defined as {delta} = 1 if an allele sampled from Z comes from C and {delta} = 0 if the allele comes from D. With equal contributions from C and D, we have Pr({delta} = 1) = 1/2. Therefore,


This approximation is made under the assumption that u + v < 1. Similarly,


Performing log transformation, we have


and


Having expressed yij as a linear function of the lengths of branches, we are ready to evaluate a given phylogeny.

Consider again the reticulate phylogeny given in figure 5 . We will view this as an unrooted tree so that A is treated as a regular terminal taxon rather than as a root. There are 5 taxa and 10 pairwise measurements of gene alike indices. The number of branches involved in the phylogeny is eight, leaving 10 - 8 = 2 degrees of freedom. Define y = [yAX yAW yAZ yAY yXW yXZ yXY yWZ yWY yZY]T as the data vector, t = [tAB tBC tBD tCF tFX tFW tDY tEZ]T as the vector of parameters (lengths of branches), and {epsilon} = [{epsilon}AX {epsilon}AW {epsilon}AZ {epsilon}AY {epsilon}XW {epsilon}XZ {epsilon}XY {epsilon}WZ {epsilon}WY {epsilon}ZY]T as the residual errors. The phylogeny is evaluated by fitting the data to the same linear model shown in equation (16) but with X defined differently:


The same least-squares method is applied here to evaluate the tree and estimate the lengths of the branches.


    Numerical Studies
 TOP
 Abstract
 Introduction
 The Pure Drift Model
 The Mutation Model
 Numerical Studies
 Discussion
 Acknowledgements
 literature cited
 
The reticulate phylogeny given in figure 5 was simulated under the pure drift model. First, a base population (node A) was simulated, from which 50 males and 50 females were randomly sampled, forming parents of the next generation. The population then underwent 50 generations of random mating and random selection, creating node B, which was then split into two lineages. One lineage diverged to node C after 50 generations of drift, and the other lineage diverged to node D. Node C continued drifting for another 50 generations into node F, which was itself split into two populations, X and W; each underwent 50 generations of drift. Node D continued drifting for 100 generations, leading to Y. Beyond the normal bifurcating process, a hybrid population (node E) was formed via hybridization between 50 males from node C and 50 females from node D. The hybrids then underwent random drift for 100 generations, leading to the terminal taxon Z.

Let D be the distance matrix with the taxa arranged in the order of {X W Z Y}, i.e.,


Under this setting, the effective population size is Ne = 100. Let HA = 0.50; then, the expectation of D can be obtained using equations (11)–(13):


Note that the differences between the exact and the approximated D are negligible.

The reticulate phylogeny (fig. 5 ) was then simulated under 100 independent biallelic, equally frequent loci. The genetic distances calculated from a single run of the simulation were


which are reasonably close to E(D).

The genetic distances are converted into y using equation (14) under HA = 0.50 and Ne = 100. The data are then fitted to the linear model with X chosen under the true phylogeny, given in equation (17) (see fig. 5 ). The lengths of branches are estimated using equation (18), with results also shown in figure 5 (values with floating points). In general, the estimated lengths of branches are reasonably closed to the true values, although some sampling errors have been observed. The MSE value of this phylogeny is 11.05 generation2.

The data (y) are then fitted to each of the remaining 11 reticulate phylogenies (fig. 6 ) and the 15 bifurcating trees (fig. 7 ). The MSEs of these trees are given in table 1 , showing that the true reticulate phylogeny (tree 2) does have the least MSE.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 7.—The 15 possible rooted bifurcating trees for four taxa.

 

View this table:
[in this window]
[in a new window]
 
Table 1 Mean Squared Errors (MSEs) of the 12 Possible Reticulate Phylogenies with Taxon Z Being the Hybrid (see fig. 6) and the 15 Possible Bifurcating Trees (see fig. 7) Calculated from a Single Simulation with 100 Independent Loci

 
Repeated simulations were also conducted under the same setting. The number of loci (L), however, was examined at the following levels: L = 10, 20, 30, 50, 75, 100, 150, 200. The simulated molecular data were analyzed under each of the 12 possible reticulate phylogenies (fig. 6 ) and the 15 possible bifurcating trees (fig. 7 ). The inferred phylogeny is chosen as the one that has the minimum MSE among the 27 phylogenies considered. The simulation was then repeated 100 times for each level of L.

The three topologies simulated represent three different cases: hybridization between closely related taxa, ancient hybridization, and new hybridization between distantly related taxa. The first simulated topology is reticulate phylogeny 1 (see fig. 6 ), where Z is a hybrid between X and W, which are sister taxa (recently diverged). The frequency of being chosen as the inferred phylogeny is given in table 2 for each of the 12 + 15 = 27 phylogenies considered. When the number of loci is small, the true phylogeny has a low frequency of being inferred compared with bifurcating trees 6 and 10. Note that the relationship between the hybrid and its progenitors is ((X, Z), W) in tree 6, while the relationship is (X, (Z, W)) in tree 10. As the number of loci increases, the frequency of the true phylogeny begins to dominate.


View this table:
[in this window]
[in a new window]
 
Table 2 Simulation Results from a Reticulate Phylogeny in Which the Hybrid Species is Derived from Recently Diverged Sister Taxa

 
The second topology simulated is reticulate phylogeny 2 (see fig. 6 ), where Z is a hybrid between clade (X, W) and Y, which are distantly related. The hybridization occurred shortly after the two lineages diverged (old hybridization). The frequency of being chosen as the inferred phylogeny is given in table 3 for each possible phylogeny. Again, when the number of loci is small, the true phylogeny has a low frequency of being inferred compared with bifurcating trees 3 and 13. The relationship between the hybrid and its progenitors is (((X, W), Z), Y) in tree 3, while the relationship is ((X, W), (Z, Y)) in tree 13. Both retain the true clade (X, W). As the number of loci increases, so does the frequency of the true phylogeny.


View this table:
[in this window]
[in a new window]
 
Table 3 Simulation Results from a Reticulate Phylogeny in which the Hybrid Species is Derived from Distantly Related Taxa but Arose Only Shortly After These Lineages Diverged

 
The third topology simulated is reticulate phylogeny 3 (see fig. 6 ), where Z is a hybrid between W and Y, which are distantly related. The hybridization, however, occurred long after the two lineages diverged (new hybridization). The frequency of being chosen as the inferred phylogeny is given in table 4 for each possible phylogeny considered. As the number of loci increases, the frequency of being chosen as the inferred phylogeny increases sharply for the true phylogeny (phylogeny 3). In general, as the number of loci increases, the true reticulate phylogeny acquires an increasingly large probability of being inferred.


View this table:
[in this window]
[in a new window]
 
Table 4 Simulation Results from a Reticulate Phylogeny in which the Hybrid Species is of Recent Origin Between Distantly Related Taxa

 

    Discussion
 TOP
 Abstract
 Introduction
 The Pure Drift Model
 The Mutation Model
 Numerical Studies
 Discussion
 Acknowledgements
 literature cited
 
I have developed a method to analyze reticulate phylogenies under the pure drift model and subsequently extended the method to fit the mutation model. The drift model is best used in short-term evolution, in which mutation has not played an important role. The mutation model, however, is applicable for long-term evolution, in which the within-population heterozygosity of any internal node has reached an equilibrium. Both models use genetic distances (or similarities) between taxa in gene frequency. However, the genetic distance between populations under the drift model is a function of the time (t) during which the two populations share the same evolutionary pathways, whereas the genetic distance under the mutation model is a function of the time (t) since the two populations have diverged. While the latter is easily understood, the former appears to be counterintuitive. Under the pure drift model, the genetic distance between two populations reflects the heterozygosity of the immediate common ancestor and contains no information about the heterozygosity of either population. The heterozygosity of the immediate common ancestor, however, is only a function of the time (t) since the common ancestor was isolated from the base population (the root of the phylogeny). The segments between the base population and the immediate common ancestor are the common pathways shared by the two populations before their divergence.

The drift model and the mutation model are exclusive. In situations in which the phylogeny was shaped by the joint force of drift and mutation, neither model will work, because the genetic distance measured this way is a function of both t's (before and after the split of the two populations). In theory, the two models can be combined to infer phylogenies when both drift and mutation are important. Unfortunately, derivation of the drift-mutation model is complicated. Consider the phylogeny given in figure 1 . The within-population gene alike indices are (Cockerham 1984Citation )


and the between-population index is

which can be re-expressed as


Note that Q* - qXY is a function of both tAB (the time before the populations split) and tBX + tBY (the time after the populations split). The complexity comes from the fact that the log of Q* - qXY is not a linear function of the t's unless

Unfortunately, equation (30) does not hold in general. Even if it holds, we can see that


which is still complicated because the branch lengths before and after the populations split are expressed in different scales. Alternative methods, such as parsimony and maximum likelihood, may be preferable for combining the two models; this possibility deserves further investigation.

Molecular data at the sequence level have been used to detect horizontal gene transfer, e.g., recombination within a sequence (Hein 1990, 1993Citation ; Hudson 1990;Citation Bollyky et al. 1996;Citation Grassly and Holmes 1997Citation ). It is not clear how useful it is to infer species hybridization using sequence data. Under certain circumstances, nuclear DNA polymorphism in restriction endonucleases may be used to infer reticulate phylogenies. One of the basic assumptions when using restriction data is that the sites must be independent. This assumption may hold when the restriction sites are located far apart on the genome such that the sites freely recombine in the hybrid lineage. Alternatively, if the hybrid lineage is formed by the hybridization of a large number of individuals from each parental lineage, given a sufficient number of generations in random mating within the hybrid lineage, the sites may behave as independent, even if they are located close together. Nei and Li (1979)Citation developed the mathematical model for studying population divergence in terms of restriction endonucleases. The proportion of sites shared by lineages X and Y, denoted by SXY, is expected to decline as X and Y further diverge. Nei and Li (1979)Citation showed that

where r is the number of nucleotides in the restriction site (e.g., r = 6 for EcoRI-GAATTC), {lambda} is the rate of nucleotide substitution per unit time (year or generation), and tBX + tBY is the total amount of time since X and Y diverged (see fig. 1 ). When multiple restriction enzymes are used, the sites from the same r-valued restriction enzymes should be combined. Data from different r-valued restriction enzymes should be properly weighted before being used for analysis. Define SXiYi as the proportion of shared sites between X and Y for an enzyme with ri nucleotides in the restriction site. The following combination of data is suggested:


where m is the number of different restriction enzymes used. The distance between X and Y is finally expressed as a linear function of the times after they diverged. The distance involving a hybrid lineage can be similarly expressed, e.g.,


The same least-squares method can be used to evaluate a reticulate phylogeny (see the mutation model in gene frequency).

Recombination, a form of reticulation at the gene level, generates the same problems as hybridization. Methods exist which try to diagnose recombination by looking at the compatibility of the "phylogenetic partition" supported by the polymorphic sites along the sequence (Drouin and Dover 1990Citation ), by looking at changes in the most parsimonious topology along sequences (Hein 1990, 1993Citation ), by using a maximum chi-square test (Maynard Smith 1992Citation ), or by using the maximum-likelihood approach to detect the specific region showing "anomalous" evolutionary patterns (Grassly and Holmes 1997Citation ). However, no general methods exist which allow the placement of a putative hybrid in the appropriate clade. Ritland and Eckenwalder (1992)Citation developed a method to estimate both the time since hybridization and the admixture proportion. Although their treatment does not allow the evaluation of alternative topologies when the progenitors are unknown, it does allow the placement of the hybrid in the correct position relative to the two progenitors if the progenitors are known. The above theoretical works have enhanced our understanding of reticulate evolution, but they may only represent a small proportion of the work required to complete a more general approach.

To obtain the number of loci required for this method to work accurately, dominant markers such as AFLPs may have to be employed. The method proposed can handle dominant markers provided that the gene frequency within a population can be estimated using the Hardy-Weinberg law; i.e., the frequency of the recessive allele is estimated by the square root of the frequency of the recessive homozygotes. The number of allelic states per dominant locus is considered to be two. The efficiency using dominant markers would be slightly less than that observed in the biallelic codominant system (see the simulation studies section) because the gene frequency is not given, but estimated from the genotypic frequencies.

The pure drift model may be of interest in its own right. Inbred strains of laboratory animals are valuable model organisms for studies in evolutionary biology, particularly at the molecular level (see Atchley and Fitch 1991, 1993Citation ; Fitch and Atchley 1985, 1989Citation ). The phylogeny of inbred strains is most likely driven by genetic drift, not by mutation. First, most of the inbred strains of mice could have arisen from just a few mice (Atchley and Fitch 1993Citation ). Second, most of the inbred strains are derived by systematic brother x sister mating, which represents the maximum effect of genetic drift in laboratory animals. Third, the evolutionary history of these organisms is too short for a significant mutational input (most laboratory strains of rats and mice have been inbred for less than 200 generations). However, many inbred mouse and rat strains were originally produced from hybridization between genetically divergent strains (Atchley and Fitch 1993Citation ). For instance, the SEC strain of mice was derived from hybrids between NB and BALB/c (Festing 1989Citation ), and the BS strain of rats was derived from hybrids between NZ and a wild rat (Hedrich 1990Citation ). With the theory presented in this paper and the data from inbred strains of animals with known hybrid origins, the genetic aspects of reticulation and its impact on phylogenetic inference could be studied in detail. Furthermore, the model may be readily applied to evolutionary studies of domesticated animals and agricultural cultivars.

For generality, suppose that backcrosses occurred a few times immediately after the initial hybridization event, such that the hybrid taxon ultimately inherits a proportion p of genes from parental taxon X and a proportion 1 - p of genes from parental taxon Y (see fig. 2 ). In this case, the expected genetic distances become


Estimation of branch lengths under this unidirectional gene flow scenario is still possible if p is known. Otherwise, data from more taxa are required to estimate branch lengths and p simultaneously.

A final caveat concerns the assumption of constant effective population size along all segments of the phylogeny. This assumption is not realistic, especially for the hybridized lineage. In the early stage of hybridization, the hybrid population must have experienced a sort of bottleneck and selection. There may be much reorganization of the genome, linkage disequilibrium of genes or chromosomal blocks, as nicely demonstrated in an empirical case (Rieseberg, Vanfossen, and Desrochers 1995Citation ). The robustness of the model to these effects needs to be further studied. Nonetheless, we can slightly relax the assumption of constant Ne by assuming that Ne is constant within a segment, but it can vary across different segments. In this case, the estimated branch length for each segment is the number of generations divided by twice the effective population size corresponding to that period of time. A similar argument also holds for the assumption of constant mutation rate v across loci and across alleles within loci.


    Acknowledgements
 TOP
 Abstract
 Introduction
 The Pure Drift Model
 The Mutation Model
 Numerical Studies
 Discussion
 Acknowledgements
 literature cited
 
I thank Damian D. Gessler for justifying the approximations in equations (11)–(13) and for his help in smoothing the presentation. I am grateful to Norman Ellstrand, Richard Whitkus, and Claus Vogl for comments on an earlier version of the manuscript. This work was partly supported by National Institutes of Health grant GM55321.


    Footnotes
 
Yun-Xin Fu, Reviewing Editor

1 Keywords: genetic drift heterozygosity hybridization mutation phylogeny reticulation Back

2 Address for correspondence and reprints: Shizhong Xu, Department of Botany and Plant Sciences, University of California, Riverside, California 92521. E-mail: xu{at}genetics.ucr.edu Back


    literature cited
 TOP
 Abstract
 Introduction
 The Pure Drift Model
 The Mutation Model
 Numerical Studies
 Discussion
 Acknowledgements
 literature cited
 

    Atchley, W. R., and W. M. Fitch. 1991. Gene trees and the origins of inbred strains of mice. Science 254:554–558.

    ———. 1993. Genetic affinities of inbred mouse strains of uncertain origin. Mol. Biol. Evol. 10:1150–1169.[Abstract]

    Bollyky, P. L., A. Rambaut, P. H. Harvey, and E. C. Holmes. 1996. Recombination between sequences of hepatitis B virus from different genotypes. J. Mol. Evol. 42:97–102.[ISI][Medline]

    Cavalli-Sforza, L. L., and A. W. F. Edwards. 1967. Phylogenetic analysis: models and estimation procedures. Am. J. Hum. Genet. 19:233–257.[ISI][Medline]

    Cockerham, C. C. 1984. Drift and mutation with a finite number of allelic states. Proc. Natl. Acad. Sci. USA 81:530–534.

    Drouin, G., and G. A. Dover. 1990. Independent gene evolution in the potato actin gene family demonstrated by phylogenetic procedure for resolving gene conversions and the phylogeny of angiosperm actin genes. J. Mol. Evol. 31:132–150.[ISI][Medline]

    Festing, M. F. W. 1989. Inbred strains of mice. Pp. 636–648 in M. F. Lyon and A. G. Searle, eds. Genetic variants and inbred strains of mice. Oxford University Press, New York.

    Fitch, W. M., and W. R. Atchley. 1985. Evolution in inbred strains of mice appears rapid. Science 228:1169–1175.

    ———. 1989. Divergence in inbred strains of mice: a comparison of three different types of data. Pp. 203–216 in C. Patterson, ed. Molecules and morphology in evolution: conflict or compromise? Cambridge University Press, London.

    Funk, V. A. 1985. Phylogenetic patterns and hybridization. Ann. Mo. Bot. Gard. 72:681–715.

    Grassly, N. C., and E. C. Holmes. 1997. A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol. Biol. Evol. 14:239–247.[Abstract]

    Hedrich, H. J. 1990. Genetic monitoring of inbred strains of rats. Gustav Fischer Verlag, Stuttgart.

    Hein, J. 1990. Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci. 98:185–200.[ISI][Medline]

    ———. 1993. A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol. 36:396–405.[ISI]

    Hudson, R. R. 1990. Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7:1–44.

    McDade, L. 1990. Hybrids and phylogenetic systematics. I. Patterns of character expression in hybrids and their implications for cladistic analysis. Evolution 44:1685–1700.

    ———. 1992. Hybrid and phylogenetic systematics. II. The impact of hybrids on cladistic analysis. Evolution 46:1329–1346.

    ———. 1995. Hybridization and phylogenetics. Pp. 305–331 in P. C. Hoch and A. G. Stephenson, eds. Experimental and molecular approaches to plant biosystematics. Monographs in Systematic Botany from the Missouri Botanical Garden.

    Maynard Smith, J. 1992. Analyzing the mosaic structure of genes. J. Mol. Evol. 34:126–129.[ISI][Medline]

    Nei, M., and W.-H. Li. 1979. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl. Acad. Sci. USA 76:5269–5273.

    Rieseberg, L. H. 1991. Homoploid reticulate evolution in Helianthus (Asteraceae) evidence from ribosomal genes. Am. J. Bot. 78:1218–1237.[ISI]

    Rieseberg, L. H., and N. C. Ellstrand. 1993. What can molecular and morphological markers tell us about plant hybridization? Crit. Rev. Plant Sci. 12:213–241.

    Rieseberg, L. H., and J. D. Morefield. 1995. Character expression, phylogenetic reconstruction, and the detection of reticulate evolution. Pp. 333–353 in P. C. Hoch and A. G. Stephenson, eds. Experimental and molecular approaches to plant biosystematics. Monographs in Systematic Botany from the Missouri Botanical Garden.

    Rieseberg, L. H., C. Vanfossen, and A. M. Desrochers. 1995. Hybrid speciation accompanied by genomic reorganization in wild sunflowers. Nature 375:313–316.

    Rieseberg, L. H., J. Whitton, and C. R. Linder. 1996. Molecular marker incongruence in plant hybrid zones and phylogenetic trees. Acta Bot. Neerl. 45:143–262.

    Ritland, K., and J. E. Eckenwalder. 1992. Polymorphism, hybridization, and variable evolutionary rate in molecular phylogenies. Pp. 404–429 in D. E. Soltis, P. S. Soltis, and J. J. D. Rootledge, eds. Molecular systematics of plants. Chapman and Hall, New York.

    Spence, J. R. 1990. Introgressive hybridization in Heteroptera: the example of Limnoporus Stal (Gerridae) species in western Canada. Can. J. Zool. 68:1770–1782.[ISI]

    Sytsma, K. J. 1990. DNA and morphology: inference of plant phylogeny. TREE 5:104–110.

    Xu, S., and W. R. Atchley. 1995. Heterozygosity of F2 from two segregating populations. J. Hered. 86:477–480.[ISI][Medline]

    Xu, S., W. R. Atchley, and W. M. Fitch. 1994. Phylogenetic inference under the pure drift model. Mol. Biol. Evol. 11:949–960.[Abstract]

Accepted for publication February 10, 2000.