Department of Biochemistry and Biophysics and Stockholm Bioinformatics Center, Stockholm University, Stockholm, Sweden
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Coding-sequence nucleotide substitution is an important mechanism driving the differential adaptive evolution of species and has been studied most notably in hemoglobin, ribonuclease, and lysozyme (Messier and Stewart 1997
; reviewed in Golding and Dean 1998
). The most common approach for identifying branches of phylogenetic trees undergoing adaptive evolution is to measure the ratio of nonsynonymous (Ka) to synonymous (Ks) nucleotide substitution rates for the branches. Nonsynonymous mutations are mutations to DNA that result in amino acid changes in the encoded protein and are ultimately selected for or against based on their effect on the organism's ability to reproduce. Synonymous mutations are mutations to DNA that do not change the encoded amino acid and reflect the opportunity to explore sequence space in the absence of selection. Ka/Ks is a measure of accepted mutations normalized for opportunity. Because extant proteins have been selected for function for millions of years, high Ka/Ks is generally reflective of changed function and adaptive evolution (or at least reduced functional constraint).
While theoretically positive selection occurs when Ka >> Ks, the few instances of observation of functional change along branches of phylogenetic trees indicate that it frequently occurs at Ka/Ks < 1 (Crandall et al. 1999
; Almgren 2001
; unpublished observations). This is because substitution rates measured along branches of phylogenetic trees can average multiple evolutionary periods and because protein-folding selective constraints are different on different residues. Therefore, proteins can undergo adaptive evolution where a fraction of amino acids have Ka/Ks ratios of >>1 for a fraction of the period reflected in a branch, and this can result in Ka/Ks ratios of <1 (see Gould and Eldredge [1993]
for a possible evolutionary mechanism). Furthermore, the relaxation of selective constraints (neutral evolution) that is specific to a subset of branches on a tree can also result in changes of protein function and is important to detect when correlating molecular events with organismal phenotypes. Such changes of protein function can be adaptive at the organismal level. More benchmarking of Ka/Ks ratios with real protein families where function is known needs to be done.
Many methods are available to calculate the Ka/Ks ratio, ranging from the simpler method of Nei and Gojobori (1986)
to the more computationally intensive, highly parameterized maximum-likelihood method of Yang and Nielsen (2000)
. In a comparison of methods (Ina 1995
), the PBL method of Pamilo, Bianchi, and Li (Li, Wu, and Luo 1985
; Pamilo and Bianchi 1993
) performed well and remains one of the more popular methods. This method has recently been modified by Benner, Trabesinger, and Schreiber (1998)
using an implementation of ancestral sequence reconstruction and is hereinafter referred to as the PBLSB method. This allows one to compare Ka/Ks ratios along branches of a phylogenetic tree instead of using a simple pairwise comparison of sequences.
The value of the PBLSB method lies in its ability to pinpoint events to specific branches of a phylogenetic tree and its computational speed, allowing exhaustive analysis of genomes and large data sets (see Liberles et al. 2001
). While likelihood-based methods frequently perform better than parsimony-based methods (Zhang and Nei 1997
), they are computationally slow for all but small data sets.
In studies performed with simulated data sets (e.g., Koshi and Goldstein 1996
; Zhang and Nei 1997
; unpublished data), ancestral sequence reconstruction is not always accurate (not even when maximum-likelihood methods are used), especially as the branch lengths under study increase. While current research with simulated data attempts to delineate when such sequences can be trusted and also to improve methodology, it is valuable to test when currently available simple methods detect changes of protein function for real data sets. The performance of different evolutionary methods with different degrees of parameterization on real data sets was analyzed here to test the importance of these methods in influencing the measurement of evolutionary events. The following new (methods 210) and classical (method 1) methods were tested in both a recent and a more ancient evolutionary case study: different methods for reconstructing ancestral sequences, different theoretical and empirical substitution matrices (methods 2 and 810), parameterization for codon bias (method 7) and GC bias (methods 5 and 6), and two new measures of adaptive evolution (methods 11 and 12).
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The Master Catalog
All multiple-sequence alignments and phylogenetic trees were obtained from the Master Catalog (Benner et al. 2000
), a database of alignments and trees for each independently evolving protein family found in GenBank. Access can be obtained through EraGen Biosciences (Alachua, Fla.) at http://www.eragen.com.
The methods differed in their treatment of the parameters discussed below. Methods 24 represented modifications of the Fitch algorithm to calculate ancestral sequences, while methods 510 represented modifications of the PBL method to calculate Ka and Ks. Methods 11 and 12 are alternatives to the Ka/Ks ratio.
Reconstruction (methods 1 and 2)
Two methods for reconstructing ancestral sequences were utilized. Both were based on the Fitch (1971)
algorithm. While in its simplest form, the Fitch algorithm is not as accurate as maximum-likelihood reconstruction (Zhang and Nei 1997
), it is much faster and more accessible for large data sets. The DNA reconstruction uses a Fitch reconstruction of only DNA sequences as performed by Benner, Trabesinger, and Schreiber (1998)
. A new parsimony approach is introduced here where DNA and protein parsimonies are calculated simultaneously using the Fitch algorithm (DNA+protein). Then, for ambiguous positions, the intersection of the translated codon and amino acid reconstructions is taken as the reconstructed ancestral sequence. The DNA reconstruction is used when no amino acids are found in common. Probabilities are assigned to codons instead of to nucleotides (as is done by Benner, Trabesinger, and Schreiber 1998
) starting from the root.
Branch Length Weighting (methods 24)
When weighting is set to none, probabilities of ancestral sequences are assigned using the Fitch algorithm. Ignoring branch lengths is thought to be one of the main reasons that maximum likelihood outperforms the Fitch algorithm (Zhang and Nei 1997
). Therefore, one of the modifications of the Fitch algorithm tested here involves reweighted parsimonious ambiguities based on branch length distances. Initial unweighted probabilities are calculated using the Fitch algorithm. These probabilities are then reweighted from the three connecting branches according to the probabilities of sequences at each neighboring node (using the initial unweighted probabilities) divided by the branch length distance measured in either point accepted mutations (PAM) or neutral evolutionary distance (NED) (a distance based on an approach to equilibrium calculation of synonymous twofold-degenerate pyrimidine transition codon substitutions, as in Peltier et al. 2000
). PAM distances are calculated according to standard methods.
Substitution Matrix (methods 2 and 810)
A discrete form of the Grantham matrix is used in the PBLSB method to weight mutational path probabilities when multiple substitutions occur along a branch (method 2) (Grantham 1974
; Li, Wu, and Luo 1985
; Pamilo and Bianchi 1993
). Here, the discrete Grantham matrix and three additional substitution matrices were utilized. A continuous variation of the Grantham matrix is introduced (method 8), where instead of rounding to the nearest value of 50, substitution likelihoods are divided by the Grantham value. Another discrete, but simpler theoretical matrix, the Zhang matrix (method 10) is used in a fashion analogous to that of the discrete Grantham matrix (Zhang 2000
). Finally, an empirical substitution matrix, the Taylor-Jones matrix, was tested as well (method 9) (Taylor and Jones 1993
).
GC Content Correction (methods 5 and 6)
Ks may be affected by a selective pressure on GC content in third positions making synonymous third-position substitution nonneutral (a selectionist model of GC content differences proposes that mutations occur and are then subject to selective pressures on local GC content to remove a subset of these mutations). These selective pressures cause an underestimation of Ks, which was corrected using two measures of GC selective pressure. The selective pressure of GC content bias can be inferred from the GC content (Nei 1987
). Correction for global GC content (method 5) uses the value determined from third positions of all coding sequences known in a given genome (http://www.kazusa.or.jp/codon/). Correction for local GC content (method 6) uses the value of GC content calculated from third positions in the gene of interest.
Codon Bias Correction (method 7)
Correction for codon bias (the nonrandom utilization of different codons encoding the same amino acid) is performed by multiplying the substitution probabilities in the PBL method in each case by the normalized codon usage value (CB), where
|
PAM/NED Ratio (method 11)
The PAM/NED ratio is calculated simply by dividing the PAM distance of a branch by the NED distance along the same branch.
Sequence Space Assessment Statistic (method 12)
The sequence space assessment (SSA) statistic is a measure of the difference between the number of amino acid sites that undergo mutation along a branch (pairwise between branch termini) and the number of those that are found to be variant throughout the sequence in at least one protein in the tree, normalized for the number of taxa. The normalization equation is taken from Tajima's D statistic, for which pairwise variation is compared with the number of polymorphic sites in an interbreeding population. Here,
|
Robustness and NED tests
The same robustness test and the same NED test used in the Master Catalog (Benner et al. 2000
) were applied here. The robustness test eliminates very short branch lengths with fractional mutations. The NED test eliminates long branches with equilibrated third-position NED codons. The NED test involves a test of saturation or equilibration of twofold-degenerate pyrimidine transition codons along a branch (Smith and Smith 1996
; Peltier et al. 2000
; Liberles et al. 2001
). Due to the inaccuracies of ancestral sequence reconstruction, along-branch NED distances may be underestimated on extremely long branches, as indicated by preliminary simulations (including some of those in fig. 2
) (unpublished data). The implications of this are discussed below.
|
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The Benner application of the Fitch method utilizes nucleotide sequences as the basis for the reconstruction of ancestral sequences. A new method is presented here that simultaneously reconstructs nucleotide and amino acid sequences. The nucleotide sequences are translated in ambiguous positions, and the intersection of the two sets of sequences from the two simultaneous reconstructions is adopted. If there is no intersection, then the nucleotide-sequence-based reconstruction is used. Probabilities are assigned from the root, with equal probabilities assigned to codons instead of to nucleotides. This approach is expected to more accurately measure evolutionary intermediates that are tolerated by natural selection and are more parsimonious with respect to protein sequence. Variations of the Fitch algorithm were also tested that use various measures of branch length distances based on the initially assigned Fitch probabilities. The absence of branch length weighting in the original implementation of the Fitch algorithm may account for its poor performance (Zhang and Nei 1997
). PAM distance as a measure of protein distance and NED as a measure of synonymous nucleotide distance were used for branch length weighting. Branch length weighting is expected to remove some of the biases associated with parsimony reconstructions.
Second, the PBL method for calculating Ka/Ks ratios was examined. The usefulness of this ratio hinges on Ks being a truly neutral measure of biological evolution (which Ks may not be in many cases). When Ks is subject to selective pressures, it no longer accurately reflects the amount of substitution available to a gene. Examination of codon usage tables in mammals indicates that not all codons are used equally, and this may result in a bias in nucleotide substitution (http://www.kazusa.or.jp/codon/). Normalized nucleotide substitution matrices were multiplied by the path probability in the PBL method as a correction for Ka. Ks was also corrected directly for selective pressure due to codon bias.
GC content bias may also have an effect on the determination of Ks. GC percentage globally within a genome may be dictated by the physiological temperature of an organism. Locally, the isochore structure of a gene may exert a selective pressure on Ks (Matassi, Sharp, and Gautier 1999
). Here, the selective force of GC content bias (Nei 1987
), measured either globally throughout the genome or locally within the homologous gene set, was used to correct Ks.
Codon bias and GC content bias are well known, both theoretically and from simulations, to influence the calculation of Ka/Ks ratios (see, e.g., Smith 1994
). The prevalence of these effects on genes in various species is not known. Their effects can be ascertained only if the calculation of Ka/Ks ratios is influenced by parameterization to correct for such factors.
Finally, two additional measures of adaptive evolution were explored. Because Ka does not measure the relative neutrality of a mutation, PAM was examined as a measure of protein distance. Furthermore, because NED has been reported to be more clocklike than Ks, it was used to normalize PAM distances (see, e.g., Peltier et al. 2000
). Thus, the PAM/NED ratio was explored as an alternative to the Ka/Ks ratio.
Analogous to Ks, the number of amino acid sites found to vary throughout a clade normalized for sample size can be a measure of the number of sites at which mutation can be tolerated by the physical chemistry (e.g., some interior residues are crucial for proper folding of the protein) of the specific protein structure. The number of those sites that vary along a specific branch, like Ka, shows how much of this potential for mutation is utilized. Using a mathematical framework similar to that developed for comparisons of polymorphism versus pairwise or along-branch divergence, the SSA statistic is also presented as a measure of adaptive evolution, where neutral evolution is expected to have a value of zero (Tajima 1989
).
These measures of adaptive evolution were compared on a recent example of high Ka/Ks rate ratios in primates, that of the leptin protein family, and on a more ancient protein family that spans eubacterial and eukaryotic life, that of the deoxyribonucleoside kinases. The value of a recent example involving closely related species is that genes in these cases frequently share genomic localizations, and closely related species frequently have similar GC contents and codon preferences. This case is ideal for testing parameters correcting such genomic variables that are likely to be stable over the tree. Alternatively, the value of a more ancient and broad example is that the more extensive amino acid substitution is ideal for testing models that account for this differentially.
Leptin has recently become a gene of pharmaceutical interest, having a role in obesity (Friedman and Halaas 1998
). Two previous evolutionary analyses have been performed on the leptin protein in mammals (Benner, Trabesinger, and Schreiber 1998
; Benner et al. 2000
). Both studies demonstrated episodes of rapid sequence evolution between primates and rodents, in which the leptin gene has been implicated in obesity. However, the two studies utilized different methods for assigning ancestral sequences and different methods for calculating the Ka/Ks ratio. From the different methods, different branches representing different periods of evolutionary history were identified as adaptive in the two studies. In the first study, adaptive branches were identified in lineages leading to Hominidae, orangutans, and gorillas and a potentially adaptive event in the lineage leading to rhesus monkeys (Benner, Trabesinger, and Schreiber 1998
). In the second study, adaptive events were assigned to the lineages leading to orangutans, rhesus monkeys, and Hominidae (Benner et al. 2000
). (It should be noted that fig. 2 in Benner et al. [2000]
differs from the representation of the leptin phylogenetic tree in the database described in the paper). This was the starting point for the examination of different methodologies presented in table 1
, based on the phylogenetic tree in figure 1
. This tree was not calculated here but is taken from an accepted biological tree for these species (Arnason, Gullberg, and Graur 1996
).
|
|
Correcting parsimony by branch length does make a difference in the branches leading to rhesus monkeys and Hominidae (methods 1 and 2 vs. methods 3 and 4). This appears to be the classical problem of a long branch dominating a parsimony reconstruction in methods 1 and 2, which is corrected in methods 3 and 4. As NED is more neutral than PAM, method 4 is the preferred method for this data set.
Overall, the branches in the tree that appear to be adaptive using these methods are the Hominidae and, possibly, the rhesus monkey branches, but not the orangutan branch. The branch leading to orangutans was not considered adaptive in the Liberles et al. (2001)
study with the application of a robustness test to eliminate overly short branch lengths despite the overestimation of Ka/Ks using only the method that was applied in that study. The adaptive branch in the lineage leading to Hominidae has been used to explain differences in the behavior of leptin between mice and humans and its applicability as a drug candidate (Benner, Trabesinger, and Schreiber 1998
; Benner et al. 2000
).
A second, more ancient, protein family of evolutionary interest is the dexoyribonucleotide kinase superfamily C-terminal independently evolving unit (Benner et al. 2000
; Almgren 2001
). Here, both the gene function and the nucleotide specificity are undergoing adaptive evolution (see Almgren 2001
for details). In more ancient evolutionary events, the substitution model can affect the calculation of Ka significantly. In table 2
, the five occurrences with Ka/Ks ratios of >0.6 from the PBLSB method were considered along with two control points that were least likely to be adaptive, representing the branches separating human and mouse deoxycytosine kinase. Some branches that obviously reflect changes of protein function are not detected with the 0.6 cutoff. There are several possible reasons for the failure to detect these change-of-function events (e.g., long-branch averaging, neutral evolution). A corresponding tree for this gene family derived from the Master Catalog (Benner et al. 2000
) can be found in figure 2
.
|
This advantage of calculating NED distances along a branch is dependent on the validity of reconstruction along sufficiently short branches throughout the tree. This is not the case for all of the branches in figure 2 . Some changes of protein function are likely at high PAM distances, even when Ka/Ks is overestimated due to saturation. Detecting such changes of function is not informative about the evolutionary mechanism (selection vs. neutrality). Further simulation can indicate exactly when ancestral sequence reconstruction can be trusted in the calculation of NED distances or Ks rates for such molecular-evolution studies. However, the problem of inaccurate ancestral sequence reconstruction along long branches will diminish as more genes and genomes are sequenced and gene trees become better articulated.
The PAM/NED ratio (method 11) seems to follow a pattern similar to that of the Ka/Ks ratios. Branch 3 from table 2 most obviously contains a change of protein function. This is best discriminated by SSA (method 12). Branch 4 may or may not represent a change of function because it is not obvious what the enzymatic specificity was at either end of the branch. The same ambiguity about functions also characterizes branch 1. During the time represented in branches 2 and 5, major changes of function of the protein are less clear and, furthermore, may average multiple events (adaptive, nonadaptive, and conservative) over long branch lengths. Values for these four branches (branches 1, 2, 4, and 5) are intermediate by the SSA measure but not negatively discriminated from branch 1 by Ka/Ks and PAM/NED. In all cases, values for branches 15 are still significantly higher than those for control branches 6 and 7, which are clearly undergoing negative selection. The SSA statistic is interesting as a different parameter that is more closely tied to the specific protein structure and fold. The PAM/NED ratio (or a Grantham/NED ratio) considers more information about the chemical degree of change (it is based on a substitutional model of evolution) per biological time, and PAM/NED ratios appear to correlate with Ka/Ks throughout the tree. As more examples of adaptivity are documented in the literature, it will be interesting to benchmark SSA, PAM/NED ratios, and Ka/Ks ratios against real evolutionary events. Interestingly, Ka/Ks ratios appear to correlate at some level with some putatively adaptive events in Eubacteria. Given the large degrees of codon bias and GC bias that make Ks nonneutral in Eubacteria, this is surprising, although the effect may be similar to that of Ks underestimation on long branches where Ka is sufficiently large to accurately reflect changes of protein function.
Bioinformatics and functional genomics approaches to detecting the genes responsible for the adaptive evolution of species are important tools as more metazoan (and mammalian) genomes are sequenced. The evaluation of tools described here is necessary for application of the appropriate tools to uncover the evolutionary history of species.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: nonsynonymous nucleotide substitution
synonymous nucleotide substitution
PAM distance
adaptive evolution
reconstructed ancestral sequences
2 Address for correspondence and reprints: David A. Liberles, Department of Biochemistry and Biophysics and Stockholm Bioinformatics Center, Stockholm University, 106 91 Stockholm, Sweden. liberles{at}sbc.su.se
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Almgren M. A. E., 2001 Genomic approaches to understanding protein function M.Sc. thesis, Stockholm University, Stockholm, Sweden
Arnason U., X. Xu, A. Gullberg, D. Graur, 1996 The "Phoca Standard": an external molecular reference for calibrating recent evolutionary divergences J. Mol. Evol 43:41-45[ISI][Medline]
Benner S. A., S. G. Chamberlin, D. A. Liberles, S. Govindarajan, L. Knecht, 2000 Functional inferences from reconstructed evolutionary biology involving rectified databasesan evolutionarily grounded approach to functional genomics Res. Microbiol 151:97-106[ISI][Medline]
Benner S. A., N. Trabesinger, D. Schreiber, 1998 Post genomic science: converting primary structure into physiological function Adv. Enzyme Regul 38:155-180[ISI][Medline]
Chelvanayagam G., A. Eggenschwiler, L. Knecht, G. H. Gonnet, S. A. Benner, 1997 An analysis of simultaneous variation in protein structures Protein Eng 10:307-316[Abstract]
Crandall K. A., C. R. Kelsey, H. Imamichi, H. C. Lane, N. P. Salzman, 1999 Parallel evolution of drug resistance in HIV: failure of nonsynonymous/synonymous substitution rate ratio to detect selection Mol. Biol. Evol 16:372-382[Abstract]
Fitch W. M., 1971 Toward defining the course of evolution: minimum change for a specific tree topology Syst. Zool 20:406-416[ISI]
Friedman J. M., J. L. Halaas, 1998 Leptin and the regulation of body weight in mammals Nature 395:763-770[ISI][Medline]
Golding G. B., A. M. Dean, 1998 The structural basis of molecular adaptation Mol. Biol. Evol 15:355-369[Abstract]
Gould S. J., N. Eldredge, 1993 Punctuated equilibrium comes of age Nature 366:223-227[ISI][Medline]
Grantham R., 1974 Amino acid difference formula to help explain protein evolution Science 185:862-864[ISI][Medline]
Ina Y., 1995 New methods for estimating the numbers of synonymous and nonsynonymous substitutions J. Mol. Evol 40:190-226[ISI][Medline]
Knight R. D., S. J. Freeland, L. F. Landweber, 2001 A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes Genome Biol 2:research0010.1-research0010.13
Koshi J. M., R. A. Goldstein, 1996 Probabilistic reconstruction of ancestral protein sequences J. Mol. Evol 42:313-320[ISI][Medline]
Li W. H., C. I. Wu, C. C. Luo, 1985 A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes Mol. Biol. Evol 2:150-174[Abstract]
Liberles D. A., D. R. Schreiber, S. Govindarajan, S. G. Chamberlin, S. A. Benner, 2001 The adaptive evolution database (TAED) Genome Biol 2:research0028.1-0028.6
Matassi G., P. M. Sharp, C. Gautier, 1999 Chromosomal location effects on gene sequence evolution in mammals Curr. Biol 9:786-791[ISI][Medline]
Messier W., C. B. Stewart, 1997 Episodic adaptive evolution of primate lysozymes Nature 385:151-154[ISI][Medline]
Miyamoto M. M., W. M. Fitch, 1995 Testing the covarion hypothesis of molecular evolution Mol. Biol. Evol 12:503-513[Abstract]
Nei M., 1987 Molecular evolutionary genetics Columbia University Press, New York
Nei M., T. Gojobori, 1986 Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions Mol. Biol. Evol 3:418-426[Abstract]
Pamilo P., N. O. Bianchi, 1993 Evolution of the Zfx and Zfy genes: rates and interdependence between the genes Mol. Biol. Evol 10:271-281[Abstract]
Peltier M. R., L. C. Raley, D. A. Liberles, S. A. Benner, P. J. Hansen, 2000 Evolutionary history of the uterine serpins J. Exp. Zool 288:165-174[ISI][Medline]
Pollack D. D., J. A. Eisen, N. A. Doggett, M. P. Cummings, 2000 A case for evolutionary genomics and the comprehensive examination of sequence biodiversity Mol. Biol. Evol 17:1776-1788
Smith J. M., 1994 Estimating selection by comparing synonymous and substitutional changes J. Mol. Evol 39:123-128[ISI][Medline]
Smith J. M., N. H. Smith, 1996 Synonymous nucleotide divergence: what is "saturation"? Genetics 142:1033-1036
Tajima F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism Genetics 123:585-595
Taylor W. R., D. T. Jones, 1993 Deriving an amino acid distance matrix J. Theor. Biol 164:65-83[ISI][Medline]
Yang Z., R. Nielsen, 2000 Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models Mol. Biol. Evol 17:32-43
Zhang J., 2000 Rates of conservative and radical nonsynonymous nucleotide substitutions in mammalian nuclear genes J. Mol. Evol 50:56-68[ISI][Medline]
Zhang J., M. Nei, 1997 Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods J. Mol. Evol 44:S139-S146[ISI][Medline]