Exploring a Phylogenetic Approach for the Detection of Correlated Substitutions in Proteins

Pierre Tufféry* and Pierre Darlu{dagger}

*Institut National de la Santé et de la Recherche Médicale U436, Université Paris 7, Paris, France; and
{dagger}INSERM U535, Batiment INSERM Grégory Pincus, Kremlin Bicêtre, France


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 literature cited
 
The remarkable conservation of protein structure, compared with that of sequences, suggests that in the course of evolution, residue substitutions which tend to destabilize a particular structure must be compensated by other substitutions that confer greater stability on that structure. Several approaches have been designed to detect correlated changes in a set of homologous sequences. However, most of them do not take into account the phylogeny of the sequences, and it has been shown that their detection power is weak. It remains unclear whether coevolution could be a general process at the level of amino acids of proteins. In the present study, we analyze the phylogenetic reconstruction of 15 sets of homologous proteins to assess, under different conditions, whether a significant amount of coevolving sites can be detected. Two criteria are used to detect significantly cosubstituting sites. One criterion corresponds to that of Shindyalov, Kolchanov, and Sander. The second one is based on intensive simulations of evolution of protein sequences along a phylogeny to estimate the significance of the number of observed cosubstitutions for pairs of sites. Our results show an important sensitivity of the detection of cosubstituting sites to the model used for the phylogenetic reconstruction. Not considering the uncertainty associated with the reconstructed data might lead to detecting numerous false-positive pairs of sites. Finally, significant amounts of coevolving pairs could be found only when substitutions affecting the physicochemical properties of the amino acids were considered. Such results suggest evidence of a cosubstitution mechanism in protein evolution. However, the identification of nonambiguous coevolving sites is still unresolved.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 literature cited
 
The recent years have seen increasing interest in developing approaches intended to detect correlated evolution between sites of protein sequences, with the aim of revealing whether some correlation exists between the compensation of the substitutions and some spatial proximity. Several studies have setup criteria based on the analysis of aligned protein sequences (Altschuh, Lesk, and Klug 1987Citation ; Neher 1994Citation ; Gobel et al. 1994Citation ; Olmea and Valencia 1997Citation ; Pazos et al. 1997Citation ; Chelvanayagam et al. 1997Citation ). However, it is now well established that a knowledge of the phylogenetic relationships between sequences is required to have some chances to extract the residues actually participating of the compensatory process from the irrelevant compensations due to random correlations, and it has been explicitly demonstrated for most of the criteria used in the above studies that they cannot perform detection without important random noise (Pollock and Taylor 1997Citation ; Tufféry, Durand, and Darlu 1999Citation ). In 1994, Shindyalov, Kolchonov, and Sander pioneered the investigation of the amount of coevolutionary information that could be deduced from a phylogenetic reconstruction of the protein sequences. Their goal was to assess the degree of relationship between coevolution of protein residues and spatial proximity. However, although their criteria for identifying correlated substitutions seem to be convenient enough for routine analysis, their study relied on a method of tree reconstruction (UPGMA) which was too simplistic to yield any convincing results. Also, the statistics associated with their approach were not well assessed. At the same time, Pagel (1994)Citation , extending Maddison's (1991)Citation work, developed a maximum-likelihood (ML) method to assess the significance of correlated changes for variables taking only two states, with the model of changes along the branches of the tree following a Markov process. While not specifically designed for protein sequence coevolution analysis, it has been recently adapted to it by Pollock, Taylor, and Goldman (1999)Citation , still within a two-state framework, with the power of the method being evaluated by simulations. The authors showed for myoglobin that some coevolution signal seems significantly larger than what can be expected from random evolution. However, this result has to be confirmed for more than a unique protein. Also, the two-state model requires a simplified classification of amino acids with similar properties (i.e., charge or volume). This simplified model needs to be extended to the largest detailed description of the amino acid types in order to get an accurate analysis of putative coevolution signals. However, the implementation of the ML method with more than two states could be particularly difficult. In the present study, we compare two independent approaches taking into account 20 amino acid states and applying them to 15 protein data sets ranging from 11 to 44 sequences per data set in order to assess whether some significant cosubsitution signal can be detected.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 literature cited
 
Selection of Sequences and Alignments
Sets of homologous protein sequences were obtained by using the PSearch facility proposed at EBI (http://www2.ebi.ac.uk/services.html). This service uses blast to fetch from among various databases the sequences which are related to one given sequence. The sets of sequences obtained were then realigned using CLUSTAL W (Thompson, Higgins, and Gibson 1994Citation ), and a visual check of the resulting alignment was performed. Fourteen alignments were selected, and their characteristics are described in table 1 . The myoglobin sequences already used by Neher (1994)Citation and Pollock, Taylor, and Goldman (1999)Citation were added to this collection.


View this table:
[in this window]
[in a new window]
 
Table 1 Descriptions of the Families of Sequences Used

 
Phylogenetic Reconstruction
Given a collection of aligned sequences, the most parsimonious trees were found by using the heuristic searches implemented in PAUP, version 3.1.1 (Swofford 1993Citation ). In these analyses, amino acid sites were equally weighted, with changes being weighted by the minimum number of nucleotide substitutions needed to convert one amino acid to another (Felsenstein 1993Citation ). Sequences were selected to get a fully resolved topology, with the essential condition that at least two amino acid changes occur on each branch so that correlated substitutions could be expected. Since for simulations we need a rooted tree, the midpoint method was used to place the root. We also checked for some cases in which the tree structures obtained by the parsimony (P) method were unmodified when an ML approach was used (PAML; Yang 1997Citation ).

The ancestral sequences were reconstructed by both P and ML procedures. For the P reconstruction, we used the accelerated transformation option (ACCTRAN) and the delayed transformation option (DELTRAN). These options allow one to assign unambiguous, although uncertain, ancestral amino acids at each site of each internal node, including the root. Branch lengths are estimated by the minimum number of nucleotide changes occurring along them. These estimates can be different for ACCTRAN and DELTRAN. These assignments were constrained in order to avoid reconstructed codons not corresponding to any amino acid. For the ML reconstruction, we used PAML to re-estimate the branch lengths and to estimate the ancestral states.

Simulation of Protein Sequence Evolution Along a Phylogenetic Tree
The simulation of the evolution of sequences along a phylogenetic tree was performed with a modified version of PSeq-Gen (Rambaut and Grassly 1997Citation ). The original program allows one to simulate the evolution of protein sequences along a given phylogenetic tree, using models based on Dayhoff's (PAM) or Jones-Taylor-Thornton's (JTT) substitution matrices. It takes as input a phylogenetic tree, generates a random ancestor sequence, and makes it evolve along the tree according to the evolutionary model selected. The number of substitutions expected along each branch of the tree is a function of its length. A scaling procedure allows easy variation of the lengths of the whole tree (i.e., to simulate different substitution rates) without specifying a new input file. Finally, site heterogeneity of substitution based on a Gamma shape distribution is implemented.

In our modified version, the root sequence inferred from the phylogenetic reconstruction can be imposed in different manners: it can be either directly input (as the inferred root sequence) or built as a hybrid sequence of the taxon sequences (at each site, the character of one randomly selected taxon sequence is taken). Such a procedure was preferred to a random assignment from a JTT or PAM distribution, since we have to take into account site heterogeneity (see below). In the latter case, prior to the simulation, the root sequence then undergoes a stabilization process through iterative evolution along the tree, so that the ancestral sequence can be considered at the equilibrium under the evolutionary model (JTT or PAM).

In order to simulate evolution that was as consistent as possible with the data of the aligned protein sequences, site heterogeneity of substitution was not randomly chosen, but deduced from the phylogenetic reconstruction. The substitution heterogeneity rates were taken, for each site, as the ratio of the number of substitutions observed along the tree for that site to the mean substitution rate observed along the tree over all sites. Thus, the heterogeneity substitution coefficients vary around 1. The effectiveness of such estimation was carefully checked by comparing the expected and observed heterogeneity rates with series of simulations.

Since the quality of the simulations could condition our results in a major way, we expressed branch lengths in terms of substitutions observed between two reconstructed ancestral sequences or between reconstructed and extant sequences. A posteriori control on each branch length, as well as on the overall tree length, is possible, assuming that each branch length is described by a Poisson law and rejecting simulations for which lengths deviate significantly from the target. It is possible with the same kind of control to check that the overall rate heterogeneity per site does not deviate too much from observation using a chi square test. Moreover, we also checked that the mean tree length over series of simulations is close to the observed tree length. Thus, we expected that the simulation fits, at best, the inferred phylogeny.

Finally, since the role of gaps is unclear in the context of detecting cosubsitutions, sites for which at least one gap was present in the alignment were not considered.

Criteria for Detecting Significant Cosubstituting Pairs of Sites
In this study, we call two different sites undergoing a substitution on the same branch of the tree "cosubstitution." For two given sites i and j, one can count the number of cosubstitutions occurring among all the branches of the tree. This number will be referred to as CMoij. Two different criteria have been used to detect pairs of sites for which a significant amount of cosubstitutions is observed. Note that both criteria are not aimed at detecting whether one site generally affects the substitution probabilities at the other.

  1. The probabilistic criterion (S94)
    This criterion was proposed by Shindyalov, Kolchanov, and Sander (1994)Citation . It consists in estimating from the data the probability that for a given pair of sites, the number of simultaneous substitutions expected under independent evolution is larger than that observed in the reconstructed phylogenetic tree given the reconstructed ancestral states at each site. The estimation of this number is based on a simple probabilistic model that implicitly takes into account site mutation heterogeneity as well as branch lengths as deduced from the reconstruction (see Shindyalov, Kolchanov, and Sander [1994Citation ] for details). We implemented both the exact approach and the approximate approach of Shindyalov, Kolchanov, and Sander (1994)Citation and checked that their concordance was satisfactory. For computational purposes, we used the approximate approach.
  2. The simulation-based criterion (SIM)
    To estimate the probability that the number of simultaneous substitutions observed at two sites i and j, CMoij, can be obtained "by chance" during the evolution of the protein, we performed series of simulations leading to random amino acid changes along the reconstructed tree, following either the PAM or the JTT model. After each simulation, we counted the number of simultaneous substitutions obtained, CMsij. We thus built, for a large number of simulations and for each pair, the law Lsij of the expected number of randomly generated simultaneous substitutions for i and j. While dependent on the model used for the simulations, such a criterion has the advantage of being independent from any hypothesis concerning the putative cosubstitution process. Given a set of simulations, the significance of CMoij can be assessed from the number of times the CMsij value is equal to or larger than CMoij. This is equivalent, for a first type error of {alpha}, to determining from Lsij the threshold {epsilon}ij that delimits a fraction {alpha} of the law. {epsilon}ij is stable if the laws are built from a large enough number of simulations. In the present study, we carried out series of 20,000 simulations. For a set of sequences of N sites, we performed Nt = N x (N - 1)/2 tests. To perform a detection at the first type error of {alpha} on Nt tests, one should, in theory, make a series of individual tests at the first type error of {alpha}/Nt. However, this approach leads to a number of required simulations that is prohibitive and was not used here. Instead, we assessed a first type error value &alpha* such that the number of significant tests N* observed for a series of simulations was {alpha}Nt.

Finally, having performed the tests, we used a standard binomial procedure to assess the significance of the number Nobs of positive tests observed among the Nt tests: the number is significant if Nobs >= Nlimit or Nobs <= Nlimit, with Nlimit = Nt{alpha} ± 1.64sqrt(Nt{alpha}(1 - {alpha})). In the present case, Nt is equal to NnoGap x (NnoGap - 1)/2, where NnoGap is the number of sites for which no gap is present in the alignment. For example, for UCE, Nt = (133 x 132/2) = 8,778, to check the significance of an excess of positive tests we use Nmax = 8,778 x 0.05 + 1.64 = 472.38; to check the significance of a lack of positive tests we use Nmin = 8,778 x 0.05 - 1.64 = 405.42. Each test is unilateral, with a type I error or 0.05.

Classes of Amino Acids
As an alternative to the set of 20 amino acids, we used a partition into seven classes issued from an analysis of the similarity of the profiles of amino acid contacts in proteins (unpublished data). It is as follows: Ala, Ile, Leu, Met, Phe, Val/Gly, Pro, Trp, Tyr/Asn, Gln, Ser, Thr/His/Arg, Lys/Asp, Glu/Cys. The simulations were still performed using the 20 amino acids even when the partition into seven classes was used. However, only substitutions resulting in a change of class were considered. To assess the biological significance of the results obtained for such a 7-class partition, we generated 7-class partitions of the 20 amino acids obtained by randomly reassigning the amino acids to the 7 classes. A partition in four classes was also used to distinguish polar residues: Arg, Lys/Asp, Glu/Asn, Gln, Ser, Thr/all others.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 literature cited
 
Detection of Cosubstituting Pairs Using 20 Amino Acid States
Comparison Between S94 and SIM
Table 2 gives the numbers of cosubstituting pairs detected using both the SIM and the S94 criteria for 15 families of proteins. Surprisingly, S94 (columns 5 and 8 for P and ML, respectively) detected an amount of coevolving sites which was much larger than that detected by SIM (columns 2 and 7 for P and ML, respectively) and significantly (type I error, P = 0.05) larger than the number expected at random, while those obtained with SIM were significantly smaller. However, when the SIM criterion was used and the simulations preserved exact branch lengths and exact numbers of mutations per site along the tree (column 6), the results were very similar to those given by S94 (column 5). Hence, a major difference in the results obtained by S94 and SIM lies in the fact that branch lengths and relative mutation rates were maintained constant under the S94 procedure, whereas they were allowed to fluctuate under the SIM procedure. For a pair of sites, the introduction of variable branch lengths leads necessarily to a wider distribution of the number of cosubstitutions, an increased value of the threshold of detection for a given type I error, and, consequently, a decrease in the efficiency of detection. Results presented in table 2 were obtained using simulations allowing 1 SD in the branch lengths. We checked the influence of the amount of variability in the branch lengths by introducing a smaller variance in the branch lengths (0.1 SD). Even for such small variation, we observed an important decrease in the amount of detected pairs. For example, for UCE and PER, we detected 313 and 1,892 pairs, i.e., deviations no different according to binomial sampling (column 2). Control simulations performed using 2 SD do not lead to significant modifications of the results presented. For example, simulations allowing deviations of 2 SD (column 9) led to 557 pairs being detected for UCE and 3,015 pairs being detected for PER. Since the phylogenetic reconstruction is performed with a certain amount of uncertainty in the estimation of the ancestral amino acid and hence the estimation of the branch lengths, the results obtained with SIM, taking into account some variability on the branch lengths and relative mutation rates at each site, should be more likely than those obtained with S94. For the case in which no variance is taken into account, the present results suggest that numerous pairs close to the limit of detection could be falsely considered positives. Thus, the S94 criterion was no longer considered in the present work. We present only results for which some variance on branch lengths was allowed during the simulations (1 SD).


View this table:
[in this window]
[in a new window]
 
Table 2 Numbers of Coevolving Pairs Detected for Different Sets of Sequences and Different Reconstructions of the Ancestral States

 
Influence of Methods for Inferring Ancestral States
The methods used to infer the ancestral amino acids at the interior nodes of an evolutionary tree still pose difficult problems. Each of the currently available methods, ML and P, has its own flaws (Cunningham, Omland, and Oakley 1998Citation ). In any case, the accuracy of the inference cannot be fully satisfactory. It depends on various factors, such as the reliability of the assumed substitution model, the degree of divergence between sequences, the number of sequences used, the branch length of the tree, the depth of the nodes, and the correctness of the tree (Yang, Kumar, and Nei 1995Citation ; Zhang and Nei 1997Citation ). These authors demonstrated that the accuracy of the ML method is slightly better than the parsimony reconstruction in most of the situations. For example, the accuracy at a site is 0.50 by P for parsimony-informative sites, as compared with 0.63 by ML method, in the case of cytochrome b. For sites which are highly polymorphic, the probabilities of various possible combinations of reconstructed amino acid ancestors can be hardly distinct, leading to some degree of arbitrariness when choosing a given combination, even though one combination has the best probability. Koshi and Goldstein (1996)Citation showed that even with structure-dependent substitution matrices being taken into account to infer the ancestral states and the structure of the tree being known, the proportion of incorrect inferences in the ancestral sequence can be large when the evolutionary distance is large. For instance, when the PAM values ranged between 30 and 120, the proportion of incorrect inferences was between 8% and 40%. The PAM values of the data analyzed in this work were usually among the largest (table 1 ). Therefore, we did not expect a very large difference in accuracy using P instead of ML. Indeed, the method used for reconstructing ancestral states seemed to be less influential on the results than the statistical procedure used to test the cosubstitutions. The numbers of detected pairs are roughly identical with either the P (ACCTRAN, column 1 of table 2 ; DELTRAN, column 2 of table 2 ) or the ML (column 7 of table 2 ) method being used to infer the ancestral sequences. The correlation between these three different estimations is higher than 0.98 (n = 15). We can conclude that both P and ML estimation of the ancestral states lead to similar results.

Influence of the Root Sequence
The simulation performed to detect cosubstitutions could be affected by the fact that the inferred sequence at the root could not be at the equilibrium under the JTT or the PAM model, influencing the rate of substitution along the tree. Hence, for all the simulations, we used a procedure to equilibrate the root sequence before starting the simulations (see Materials and Methods). In fact, as shown table 2 (column 4 vs. column 2), when this procedure is not used (i.e., the reconstructed root sequence is directly used as the starting sequence of the simulation), the results are not affected.

For further analyses based on 7- and 4-aa classes, we present only results based on the JTT model after equilibration of the root sequences. For parsimony, only DELTRAN results are presented.

Detection of Cosubstitution Using a Limited Number of Amino Acid Classes
As we have concluded, no significant cosubstitutions were found when using the previous model, which accepts that any amino acid at any site in the sequence can undergo the cosubstitution process with any other amino acid at any other site. The results were the same whatever the model of evolution (PAM or JTT) and the procedure used to infer ancestral states and branch lengths (P, ML). We even found that the numbers of observed cosubstitutions were significantly less than expected for all proteins except MYO and IL2. This suggests that cosubstitutions that could occur in real proteins are somewhat constrained compared with the evolutionary models employed in this study. Moreover, the really significant cosubstitution events (if any) have more of a chance to be swamped with a large amount of noisy events by taking into account 20 classes instead of a more reasonable (biologicaly speaking) reduced number of classes. Therefore, we reduced the number from 20 to 7 classes, mostly corresponding to the classical partition of amino acids according to their physicochemical properties; the results were drastically affected. The number of detected pairs was found to be larger than that obtained with the model including 20 amino acid classes (table 2 : column 9 vs. column 2 for the P procedure; column 7 vs. column 11 for the ML procedure), and now appeared to be significant (P = 0.05). Some proteins showed numbers of observed cosubstituting sites which could be significantly larger than expected under independent and random evolution of sites: excesses of at least 144 and 103 cosubstituting pairs were observed for HMP and ADK, respectively. Hence, this suggests that cosubstitutions occur in real proteins but they are somewhat constrained in a way which is more correctly described by the model using seven classes. Since in such an analysis we ignore substitutions occurring within amino acid classes, this suggests that significant cosubstitutions occurring in real proteins would mostly preserve a balance of the overall physicochemical properties associated with the protein. One can wonder, however, whether such a result is not a consequence of reducing the number of classes, independent of any biological significance. To test this possibility, we performed control simulations by randomly reassigning the amino acids to the 7 classes. We obtained nonsignificant amounts of coevolving pairs in all cases. Hence, it is the nature of the amino acids within the classes that seems to be responsible for significant detections.

Finally, as electrostatic effects are often considered important, we classified the amino acids into four groups, defined in terms of their electrostatic charge/polarizability. Compared with the 7-class partition, this results in merging 4 classes corresponding to nonpolar amino acids into a single class. Using such a partition, a significant amount of cosubstituting pairs was found for all proteins except UCE and BBP. The results obtained for BBP were congruent with previous results, since we never observed significant detection for this data set in this study. For UCE, the use of 7 classes led to incongruent results, depending on the model and the reconstruction employed. The present results using 4 classes suggest that cosubstitutions detected using 7 classes could mostly involve nonpolar/charged residues. For other proteins, compensation of electrostatic properties seems to be part of the cosubstitution process. Such results support the conclusions of Pollock, Taylor, and Goldman (1999)Citation , who use a two-state model to perform detection and could detect the presence of correlated substitutions for the myoglobin.

Comparison Between the PAM and JTT Models
Some discrepancies can be noticed between the results obtained using the JTT model of substitutions and those obtained using the PAM model when the P procedure was used (the ML procedure routinely used the JTT model). The PAM model led to a smaller number of detected pairs. This difference did not modify the conclusion of a lack of cosubstitution as long as the 20 distinct amino acids were used (table 2 : column 2 vs. column 3). When only 7 categories of amino acids were taken into account, the JTT and PAM models led either to consistent and significant detected pairs (PER, HMP, BLAC, MYO, IPP, ADK, API), to consistent and nonsignificant detected pairs (BBP and IL2), or to inconsistent conclusions. In the last case, five data sets showed significant cosubstitutions only with the JTT procedure (UCE, PAZ, ANV, RECA, DRN), and one only with PAM (TN1R). The fact that the JTT model seemed to statistically detect more cosubstitution events than the PAM model could be explained by the smaller heterogeneity between the substitution probabilities within the PAM matrix compared with those within the JTT one, which produces a much larger variance of the simulations with PAM than with JTT and affects the value of the threshold of detection. Indeed, analyzing the laws issued from the simulations, we observed a wider distribution of the number of cosubstitutions at pairs of sites using PAM. However, it is also possible that for one part, the difference in the results could lie in the nature of the substitutions induced by PAM compared with that induced by JTT.

The ML procedure with the JTT model gave results slightly different from those given by the P procedure with the JTT or the PAM model of substitution. Only BBP, RECA, TN1R, and API were not significant. To assess how the use of the PAM model could affect the ML procedure, we performed for some cases a reconstruction of the ancestral states using ML/PAM. For PAZ and UCE, this led to a detection of 59 pairs instead of 78 (column 11 in table 2 ) and 455 instead of 541, respectively. This led to disagreements of the same magnitude as those obtained for the parsimony data.

Differences Between Data Sets
The differences observed between data sets concerning the detection of a significant number of cosubstituting pairs could be related to some distinctive features of the data selected for the analysis. No clear relationship can be extrapolated with the PAM index. The 6 data sets for which significant amounts of cosubstituting pairs were detected independent of the method employed all had PAM indices of less than 90, and the data set associated with the largest PAM index (BBP) was also the one for which all approaches led to nonsignificant detection. However, for PAZ, also with a large PAM index, some approaches led to significant detection. Furthermore, significant detection did not seem to be related to the number of sequences in the data sets.


    Conclusions
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 literature cited
 
The present results show that the search for cosubstituting pairs depends on various parameters of the detection process. In particular, the significance of the detection is largely affected by the variability associated with the phylogenetic reconstruction. Despite the large influence of the model retained to describe the pattern of amino acid substitution and the implemented procedure used to statistically detect nonrandom cosubstitution pairs, the results, considering seven or four classes of amino acids, support the evidence of some general process that uses cosubstitutions as a compensatory mechanism related to the preservation of the physicochemical properties of the proteins. Indeed, for 6 data sets, significant numbers of cosubstituting pairs were detected independent of the parameterization of the search. However, the unambiguous identification of the coevolving sites remains a matter for future work. This is a condition for further progress in the understanding of how coevolution could be related to protein structure and function.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 literature cited
 
We thank anonymous referees for many useful suggestions and S. Hazout for helpful discussions.


    Footnotes
 
Manolo Gouy, Reviewing Editor

1 Keywords: correlated substitutions phylogeny sequence alignment Back

2 Address for correspondence and reprints: Pierre Tufféry, Institut National de la Santé et de la Recherche Médicale U436, Université Paris 7, case 7113, 2 place Jussieu, 75251 Paris, France. E-mail: tuffery{at}urbb.jussieu.fr Back


    literature cited
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 literature cited
 

    Altschuh, D., A. M. Lesk, A. C. Bloomer, and A. C. Klug. 1987. Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. J. Mol. Biol. 193:693–707.[ISI][Medline]

    Chelvanayagam, G., A. Eggenschwiler, L. Knecht, G. H. Gonnet, and S. A. Benner. 1997. An analysis of simultaneous variation in protein structures. Protein Eng. 10:307–316.[Abstract]

    Cunningham, C. W., K. E. Omland, and T. H. Oakley. 1998. Reconstructing ancestral character states: a critical reappraisal. TREE 13:361–366.

    Felsenstein, J. 1993. PHYLIP (phylogeny inference package). Version 3.5. Distributed by the author, Department of Genetics, University of Washington, Seattle.

    Gobel, U., C. Sander, R. Schneider, and A. Valencia. 1994. Correlated mutations and residue contact prediction. Proteins 18:309–317.

    Koshi, J., and R. A. Goldstein. 1996. Probabilistic reconstruction of ancestral protein sequences. J. Mol. Evol. 42:313–320.[ISI][Medline]

    Maddison, W. P. 1990. A method for testing the correlated evolution of two binary characters: are gains or losses concentrated on certain branches of phylogenetic tree? Evolution 44:539–557.

    Neher, E. 1994. How frequent are correlated changes in families of protein sequences? Proc. Natl. Acad. Sci. USA 91:98–102.

    Olmea, A., and A. Valencia. 1997. Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold. Des. 2:S25–S32.

    Pagel, M. 1994. Detecting correlated evolution on phylogenies: a general method for comparative analysis of discrete characters. Proc. R. Soc. Lond. B Biol. Sci. 255:37–45.[ISI]

    Pazos, F., M. Helmer-Citterich, G. Ausiello, and A. Valencia. 1997. Correlated mutations contain information about protein-protein interaction. J. Mol. Biol. 271:511–523.[ISI][Medline]

    Pollock, D. D., and W. R. Taylor. 1997. Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng. 10:647–657.[Abstract]

    Pollock, D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287:187–198.[ISI][Medline]

    Rambaut, A., and N. Grassly. 1997. PSeq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees. Cabios 13:559–560.

    Shindyalov, N., N. A. Kolchanov, and C. Sander. 1994. Can three dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng. 7:349–358.

    Swofford, D. 1993. PAUP. phylogenetic analysis using parsimony. Version 3.1.1. Illinois Natural History Survey, Champaign.

    Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.[Abstract]

    Tufféry, P., M. Durand, and P. Darlu. 1999. How possible is the detection of correlated mutations? Theor. Chem. Acc. 101:9–15.

    Yang, Z., S. Kumar, and M. Nei. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:1641–1650.

    Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Cabios 13:1555–1556.

    Zhang, J., and M. Nei. 1997. Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihod, and distance methods. Mol. Biol. Evol. 44:S139–S146.

Accepted for publication July 20, 2000.