*Institut National de la Santé et de la Recherche Médicale U436, Université Paris 7, Paris, France; and
INSERM U535, Batiment INSERM Grégory Pincus, Kremlin Bicêtre, France
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The ancestral sequences were reconstructed by both P and ML procedures. For the P reconstruction, we used the accelerated transformation option (ACCTRAN) and the delayed transformation option (DELTRAN). These options allow one to assign unambiguous, although uncertain, ancestral amino acids at each site of each internal node, including the root. Branch lengths are estimated by the minimum number of nucleotide changes occurring along them. These estimates can be different for ACCTRAN and DELTRAN. These assignments were constrained in order to avoid reconstructed codons not corresponding to any amino acid. For the ML reconstruction, we used PAML to re-estimate the branch lengths and to estimate the ancestral states.
Simulation of Protein Sequence Evolution Along a Phylogenetic Tree
The simulation of the evolution of sequences along a phylogenetic tree was performed with a modified version of PSeq-Gen (Rambaut and Grassly 1997
). The original program allows one to simulate the evolution of protein sequences along a given phylogenetic tree, using models based on Dayhoff's (PAM) or Jones-Taylor-Thornton's (JTT) substitution matrices. It takes as input a phylogenetic tree, generates a random ancestor sequence, and makes it evolve along the tree according to the evolutionary model selected. The number of substitutions expected along each branch of the tree is a function of its length. A scaling procedure allows easy variation of the lengths of the whole tree (i.e., to simulate different substitution rates) without specifying a new input file. Finally, site heterogeneity of substitution based on a Gamma shape distribution is implemented.
In our modified version, the root sequence inferred from the phylogenetic reconstruction can be imposed in different manners: it can be either directly input (as the inferred root sequence) or built as a hybrid sequence of the taxon sequences (at each site, the character of one randomly selected taxon sequence is taken). Such a procedure was preferred to a random assignment from a JTT or PAM distribution, since we have to take into account site heterogeneity (see below). In the latter case, prior to the simulation, the root sequence then undergoes a stabilization process through iterative evolution along the tree, so that the ancestral sequence can be considered at the equilibrium under the evolutionary model (JTT or PAM).
In order to simulate evolution that was as consistent as possible with the data of the aligned protein sequences, site heterogeneity of substitution was not randomly chosen, but deduced from the phylogenetic reconstruction. The substitution heterogeneity rates were taken, for each site, as the ratio of the number of substitutions observed along the tree for that site to the mean substitution rate observed along the tree over all sites. Thus, the heterogeneity substitution coefficients vary around 1. The effectiveness of such estimation was carefully checked by comparing the expected and observed heterogeneity rates with series of simulations.
Since the quality of the simulations could condition our results in a major way, we expressed branch lengths in terms of substitutions observed between two reconstructed ancestral sequences or between reconstructed and extant sequences. A posteriori control on each branch length, as well as on the overall tree length, is possible, assuming that each branch length is described by a Poisson law and rejecting simulations for which lengths deviate significantly from the target. It is possible with the same kind of control to check that the overall rate heterogeneity per site does not deviate too much from observation using a chi square test. Moreover, we also checked that the mean tree length over series of simulations is close to the observed tree length. Thus, we expected that the simulation fits, at best, the inferred phylogeny.
Finally, since the role of gaps is unclear in the context of detecting cosubsitutions, sites for which at least one gap was present in the alignment were not considered.
Criteria for Detecting Significant Cosubstituting Pairs of Sites
In this study, we call two different sites undergoing a substitution on the same branch of the tree "cosubstitution." For two given sites i and j, one can count the number of cosubstitutions occurring among all the branches of the tree. This number will be referred to as CMoij. Two different criteria have been used to detect pairs of sites for which a significant amount of cosubstitutions is observed. Note that both criteria are not aimed at detecting whether one site generally affects the substitution probabilities at the other.
Finally, having performed the tests, we used a standard binomial procedure to assess the significance of the number Nobs of positive tests observed among the Nt tests: the number is significant if Nobs Nlimit or Nobs
Nlimit, with Nlimit = Nt
± 1.64sqrt(Nt
(1 -
)). In the present case, Nt is equal to NnoGap x (NnoGap - 1)/2, where NnoGap is the number of sites for which no gap is present in the alignment. For example, for UCE, Nt = (133 x 132/2) = 8,778, to check the significance of an excess of positive tests we use Nmax = 8,778 x 0.05 + 1.64
= 472.38; to check the significance of a lack of positive tests we use Nmin = 8,778 x 0.05 - 1.64
= 405.42. Each test is unilateral, with a type I error or 0.05.
Classes of Amino Acids
As an alternative to the set of 20 amino acids, we used a partition into seven classes issued from an analysis of the similarity of the profiles of amino acid contacts in proteins (unpublished data). It is as follows: Ala, Ile, Leu, Met, Phe, Val/Gly, Pro, Trp, Tyr/Asn, Gln, Ser, Thr/His/Arg, Lys/Asp, Glu/Cys. The simulations were still performed using the 20 amino acids even when the partition into seven classes was used. However, only substitutions resulting in a change of class were considered. To assess the biological significance of the results obtained for such a 7-class partition, we generated 7-class partitions of the 20 amino acids obtained by randomly reassigning the amino acids to the 7 classes. A partition in four classes was also used to distinguish polar residues: Arg, Lys/Asp, Glu/Asn, Gln, Ser, Thr/all others.
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Influence of the Root Sequence
The simulation performed to detect cosubstitutions could be affected by the fact that the inferred sequence at the root could not be at the equilibrium under the JTT or the PAM model, influencing the rate of substitution along the tree. Hence, for all the simulations, we used a procedure to equilibrate the root sequence before starting the simulations (see Materials and Methods). In fact, as shown table 2
(column 4 vs. column 2), when this procedure is not used (i.e., the reconstructed root sequence is directly used as the starting sequence of the simulation), the results are not affected.
For further analyses based on 7- and 4-aa classes, we present only results based on the JTT model after equilibration of the root sequences. For parsimony, only DELTRAN results are presented.
Detection of Cosubstitution Using a Limited Number of Amino Acid Classes
As we have concluded, no significant cosubstitutions were found when using the previous model, which accepts that any amino acid at any site in the sequence can undergo the cosubstitution process with any other amino acid at any other site. The results were the same whatever the model of evolution (PAM or JTT) and the procedure used to infer ancestral states and branch lengths (P, ML). We even found that the numbers of observed cosubstitutions were significantly less than expected for all proteins except MYO and IL2. This suggests that cosubstitutions that could occur in real proteins are somewhat constrained compared with the evolutionary models employed in this study. Moreover, the really significant cosubstitution events (if any) have more of a chance to be swamped with a large amount of noisy events by taking into account 20 classes instead of a more reasonable (biologicaly speaking) reduced number of classes. Therefore, we reduced the number from 20 to 7 classes, mostly corresponding to the classical partition of amino acids according to their physicochemical properties; the results were drastically affected. The number of detected pairs was found to be larger than that obtained with the model including 20 amino acid classes (table 2
: column 9 vs. column 2 for the P procedure; column 7 vs. column 11 for the ML procedure), and now appeared to be significant (P = 0.05). Some proteins showed numbers of observed cosubstituting sites which could be significantly larger than expected under independent and random evolution of sites: excesses of at least 144 and 103 cosubstituting pairs were observed for HMP and ADK, respectively. Hence, this suggests that cosubstitutions occur in real proteins but they are somewhat constrained in a way which is more correctly described by the model using seven classes. Since in such an analysis we ignore substitutions occurring within amino acid classes, this suggests that significant cosubstitutions occurring in real proteins would mostly preserve a balance of the overall physicochemical properties associated with the protein. One can wonder, however, whether such a result is not a consequence of reducing the number of classes, independent of any biological significance. To test this possibility, we performed control simulations by randomly reassigning the amino acids to the 7 classes. We obtained nonsignificant amounts of coevolving pairs in all cases. Hence, it is the nature of the amino acids within the classes that seems to be responsible for significant detections.
Finally, as electrostatic effects are often considered important, we classified the amino acids into four groups, defined in terms of their electrostatic charge/polarizability. Compared with the 7-class partition, this results in merging 4 classes corresponding to nonpolar amino acids into a single class. Using such a partition, a significant amount of cosubstituting pairs was found for all proteins except UCE and BBP. The results obtained for BBP were congruent with previous results, since we never observed significant detection for this data set in this study. For UCE, the use of 7 classes led to incongruent results, depending on the model and the reconstruction employed. The present results using 4 classes suggest that cosubstitutions detected using 7 classes could mostly involve nonpolar/charged residues. For other proteins, compensation of electrostatic properties seems to be part of the cosubstitution process. Such results support the conclusions of Pollock, Taylor, and Goldman (1999)
, who use a two-state model to perform detection and could detect the presence of correlated substitutions for the myoglobin.
Comparison Between the PAM and JTT Models
Some discrepancies can be noticed between the results obtained using the JTT model of substitutions and those obtained using the PAM model when the P procedure was used (the ML procedure routinely used the JTT model). The PAM model led to a smaller number of detected pairs. This difference did not modify the conclusion of a lack of cosubstitution as long as the 20 distinct amino acids were used (table 2
: column 2 vs. column 3). When only 7 categories of amino acids were taken into account, the JTT and PAM models led either to consistent and significant detected pairs (PER, HMP, BLAC, MYO, IPP, ADK, API), to consistent and nonsignificant detected pairs (BBP and IL2), or to inconsistent conclusions. In the last case, five data sets showed significant cosubstitutions only with the JTT procedure (UCE, PAZ, ANV, RECA, DRN), and one only with PAM (TN1R). The fact that the JTT model seemed to statistically detect more cosubstitution events than the PAM model could be explained by the smaller heterogeneity between the substitution probabilities within the PAM matrix compared with those within the JTT one, which produces a much larger variance of the simulations with PAM than with JTT and affects the value of the threshold of detection. Indeed, analyzing the laws issued from the simulations, we observed a wider distribution of the number of cosubstitutions at pairs of sites using PAM. However, it is also possible that for one part, the difference in the results could lie in the nature of the substitutions induced by PAM compared with that induced by JTT.
The ML procedure with the JTT model gave results slightly different from those given by the P procedure with the JTT or the PAM model of substitution. Only BBP, RECA, TN1R, and API were not significant. To assess how the use of the PAM model could affect the ML procedure, we performed for some cases a reconstruction of the ancestral states using ML/PAM. For PAZ and UCE, this led to a detection of 59 pairs instead of 78 (column 11 in table 2 ) and 455 instead of 541, respectively. This led to disagreements of the same magnitude as those obtained for the parsimony data.
Differences Between Data Sets
The differences observed between data sets concerning the detection of a significant number of cosubstituting pairs could be related to some distinctive features of the data selected for the analysis. No clear relationship can be extrapolated with the PAM index. The 6 data sets for which significant amounts of cosubstituting pairs were detected independent of the method employed all had PAM indices of less than 90, and the data set associated with the largest PAM index (BBP) was also the one for which all approaches led to nonsignificant detection. However, for PAZ, also with a large PAM index, some approaches led to significant detection. Furthermore, significant detection did not seem to be related to the number of sequences in the data sets.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: correlated substitutions
phylogeny
sequence alignment
2 Address for correspondence and reprints: Pierre Tufféry, Institut National de la Santé et de la Recherche Médicale U436, Université Paris 7, case 7113, 2 place Jussieu, 75251 Paris, France. E-mail: tuffery{at}urbb.jussieu.fr
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschuh, D., A. M. Lesk, A. C. Bloomer, and A. C. Klug. 1987. Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. J. Mol. Biol. 193:693707.[ISI][Medline]
Chelvanayagam, G., A. Eggenschwiler, L. Knecht, G. H. Gonnet, and S. A. Benner. 1997. An analysis of simultaneous variation in protein structures. Protein Eng. 10:307316.[Abstract]
Cunningham, C. W., K. E. Omland, and T. H. Oakley. 1998. Reconstructing ancestral character states: a critical reappraisal. TREE 13:361366.
Felsenstein, J. 1993. PHYLIP (phylogeny inference package). Version 3.5. Distributed by the author, Department of Genetics, University of Washington, Seattle.
Gobel, U., C. Sander, R. Schneider, and A. Valencia. 1994. Correlated mutations and residue contact prediction. Proteins 18:309317.
Koshi, J., and R. A. Goldstein. 1996. Probabilistic reconstruction of ancestral protein sequences. J. Mol. Evol. 42:313320.[ISI][Medline]
Maddison, W. P. 1990. A method for testing the correlated evolution of two binary characters: are gains or losses concentrated on certain branches of phylogenetic tree? Evolution 44:539557.
Neher, E. 1994. How frequent are correlated changes in families of protein sequences? Proc. Natl. Acad. Sci. USA 91:98102.
Olmea, A., and A. Valencia. 1997. Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold. Des. 2:S25S32.
Pagel, M. 1994. Detecting correlated evolution on phylogenies: a general method for comparative analysis of discrete characters. Proc. R. Soc. Lond. B Biol. Sci. 255:3745.[ISI]
Pazos, F., M. Helmer-Citterich, G. Ausiello, and A. Valencia. 1997. Correlated mutations contain information about protein-protein interaction. J. Mol. Biol. 271:511523.[ISI][Medline]
Pollock, D. D., and W. R. Taylor. 1997. Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng. 10:647657.[Abstract]
Pollock, D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287:187198.[ISI][Medline]
Rambaut, A., and N. Grassly. 1997. PSeq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees. Cabios 13:559560.
Shindyalov, N., N. A. Kolchanov, and C. Sander. 1994. Can three dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng. 7:349358.
Swofford, D. 1993. PAUP. phylogenetic analysis using parsimony. Version 3.1.1. Illinois Natural History Survey, Champaign.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680.[Abstract]
Tufféry, P., M. Durand, and P. Darlu. 1999. How possible is the detection of correlated mutations? Theor. Chem. Acc. 101:915.
Yang, Z., S. Kumar, and M. Nei. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:16411650.
Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Cabios 13:15551556.
Zhang, J., and M. Nei. 1997. Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihod, and distance methods. Mol. Biol. Evol. 44:S139S146.