Detection and reduction of evolutionary noise in correlated mutation analysis

Orly Noivirt1, Miriam Eisenstein2 and Amnon Horovitz1,3

Departments of 1Structural Biology and 2Chemical Research Support, Weizmann Institute of Science, Rehovot 76100, Israel

3 To whom correspondence should be addressed. E-mail: amnon.horovitz{at}weizmann.ac.il


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Acknowledgements
 References
 
Direct or indirect inter-residue interactions in proteins are often reflected by mutations at one site that compensate for mutations at another site. Various bioinformatic methods have been developed for detecting such correlated mutations in order to obtain information about intra- and inter-protein interactions. Here, we show by carrying out a correlated mutation analysis for non-interacting proteins that the signal due to inter-residue interactions is of similar magnitude to the ‘noise’ that arises from other evolutionary processes related to common ancestry. A new method for detecting correlated mutations is presented that reduces this evolutionary noise by taking into account evolutionary distances in the protein family. It is shown that this method yields better signal-to-noise ratios and, therefore, can much better resolve, for example, correlated mutations that reflect true inter-residue interactions.

Keywords: bioinformatics/co-evolving residues/coordinated mutations/evolution/tree-based shuffling


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Acknowledgements
 References
 
Mutations that perturb protein structure at one site are often compensated for by mutations at other sites. Such coordinated mutations in proteins are thought to occur since there is a stronger selective pressure to maintain protein structure and function than sequence. Suppressor mutations revealed through genetic studies are examples for the operation of such compensatory mechanisms. It has often been assumed that compensatory mutations occur at positions near the site of perturbation (Altschuh et al., 1987Go). This motivated the development of bioinformatic methods for detecting correlated mutations as a source for distance information in protein structure prediction (Göbel et al., 1994Go; Neher, 1994Go; Ortiz et al., 1999Go). Correlated mutations may, however, also occur at distant positions, thus reflecting long-range interactions in proteins (Horovitz et al., 1994Go; Lockless and Ranganathan, 1999Go; Kass and Horovitz, 2002Go; Fleishman et al., 2004Go).

Bioinformatic methods for detecting correlated mutations consist of two main steps: (i) alignment of homologous sequences and (ii) identification of pairs of columns in the alignment in which there is a statistically significant tendency for mutations in one column to be accompanied by corresponding and usually different mutations in the other column. The results of such an analysis are found to depend on the way in which both steps are carried out since all the methods for detecting correlated mutations are sensitive, but to different extents, to the degree of sequence conservation in the alignment (Fodor and Aldrich, 2004Go). A second key problem that all the methods share is distinguishing signal from noise. In other words, it is necessary to differentiate between correlated mutations that reflect short- or long-range inter-residue interactions (because of selective pressure to maintain protein structure and/or function) and those that reflect other evolutionary processes related to common ancestry (Pollock et al., 1999Go; Larson et al., 2000Go; Wollenberg and Atchley, 2000Go) such as changes in codon usage or amino acid frequencies (referred to in this paper as ‘noise’). The need to take into account common ancestry in correlated mutation analysis has been recognized before. Shindyalov et al. (1994)Go incorporated the evolutionary tree structure in their formalism but without implementing it in their algorithm to filter out evolutionary noise. Larson et al. (2000)Go sought to eliminate artifactual covariations by discarding those found to arise from subsets of sequences with an average pairwise identity higher than the median of the distribution. In their study, separate sequence diversity thresholds were determined empirically for each system. A different approach (Wollenberg and Atchley, 2000Go) to this problem was to compare the distribution of an inter-site mutual information statistic for an alignment of naturally occurring sequences with the distribution of this statistic for artificial sequence data generated using the parametric bootstrap from a random ancestral sequence, a given substitution matrix and the same tree. Correlated mutations in the set of artificial sequences can arise solely from common ancestry. Hence such a comparison enabled the probability that a pair of covarying sites with a certain value of the mutual information statistic did not result from common ancestry to be determined.

In this paper, we first provide an unambiguous demonstration that the level of noise due to common ancestry is of the same magnitude as that of the signal due to inter-residue interactions. We then describe a new method for detecting correlated mutations that improves the signal-to-noise ratio by determining for each individual pair of covarying sites in an alignment the likelihood that they did not result from common ancestry. Our method is similar in spirit to the method of Wollenberg and Atchley (2000)Go but with two key differences: (i) the amino acid composition of each position in the alignment is conserved in the artificial sequences; and (ii) the likelihood that a pair of covarying sites did not result from common ancestry is determined individually for every pair of sites in the alignment. Finally, we report results that show that the fraction of correlated mutations that reflect direct inter-residue interactions is enriched when this new method is used, thereby indicating that the signal has a physical basis that the noise lacks.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Acknowledgements
 References
 
Construction of sequence data sets

The SWISS-PROT database (Bairoch and Apweiler, 2000Go; Boeckmann et al., 2003Go) was searched for large families of proteins and an initial multiple sequence alignment (MSA) of each family was then carried out with version 1.82 of the CLUSTAL W program (Thompson et al., 1994Go) using its default parameters. Sequences found to have a low average pairwise sequence identity with all the other sequences in the alignment were eliminated and the remaining sequences were then realigned. This process was repeated until the average pairwise identity of all the sequences in the alignment was >45%. Correlated mutation analysis was carried out only on such MSAs that contained ≥50 sequences. Analysis of correlated mutations between families (in order to detect noise) was carried out by concatenation of sequences in the corresponding MSAs that are from the same organism. One such alignment, for example, consists of the sequence of adenylate kinase from organism X concatenated to the sequence of carbamoyl-phosphate synthase from the same organism X, the sequence of adenylate kinase from organism Y concatenated to the sequence of carbamoyl-phosphate synthase from the same organism Y, etc. We created 16 such alignments (see Table I) of sequences of different artificial chimeras that contain between 40 and 50 sequences each. All the alignments can be provided upon request.


View this table:
[in this window]
[in a new window]
 
Table I. List of 16 non-interacting pairs of proteins used to test our noise reduction method

 
Detection of correlated mutations using random shuffling (i.e. without noise reduction)

Frequencies of all amino acids at all positions in the MSA containing N sequences were calculated. The expected number of sequences, NEX, that contain amino acid A at position i and amino acid B at position j assuming no coupling between the two positions is given by NfA,ifB,j where fA,i and fB,j are the frequencies of A and B at positions i and j, respectively. Each column j was shuffled randomly up to 2000 times (while leaving all other columns in the alignment including column i intact) and the observed number of sequences, NOBS, that contain amino acid A at position i and amino acid B at position j in each shuffle was determined. A {chi}2(i,j) value for column i and each shuffle of column j was then calculated, as follows:

(1)
where n = kl is the number of different amino acid pairs which may be found at positions i and j given that k and l different kinds of amino acids are found at these two positions, respectively. The shuffles were carried out in three rounds and if the value of {chi}2(i,j) obtained without shuffling after the first and second rounds was not sufficiently high relative to the values obtained with shuffling then that pair of positions was not analyzed further in order to save computer time. The number of shuffles in the first and second rounds and the choice of cutoff values had little effect on the final results. In the implementation here, a pair of positions was not analyzed further if the value of {chi}2(i,j) obtained without shuffling was found to be in the bottom 90% of all the values obtained after the first 100 shuffles. The procedure was then repeated for 900 additional random shuffles and the pair of positions was not considered further if the value of {chi}2(i,j) obtained without shuffling was now found to be in the bottom 95% of all the values. Finally, 1000 additional random shuffles were carried out and a P-value was assigned to columns i and j based on the value of {chi}2(i,j) obtained without shuffling relative to all the values of {chi}2(i,j) obtained with shuffling. Positions at which 10 percent or more of the sequences have a gap were discarded. Positions with a lower percentage of gaps were analyzed but sequences with a gap at position i and/or j (either before or after the shuffling) were excluded from the statistics of that pair of positions. This procedure was carried out in turn for each pair of positions i and j in the alignment. It should be noted that, regardless of whether the random or tree-based (see below) shuffling methods is used, no information is obtained about coupling between two positions if one or both of them are fully conserved since {chi}2(i,j) = 0. Such positions in a MSA can, therefore, be eliminated before carrying out the correlated mutation analysis. Given a P-value of 0.005, we calculate that ~3% of the coupled positions that are detected using the random shuffling method are due to multiplicity (i.e. false discovery rate).

Detection of correlated mutations using tree-based shuffling (i.e. with noise reduction)

This method is similar to that described above except that each shuffle for column j was generated by (i) randomly selecting 20% of the sequences in the alignment and (ii) carrying out pairwise permutations between each of the selected sequences and another sequence in the alignment chosen with a probability based on the evolutionary distance between them, as follows:

(2)
where P(a, b) is the probability that a randomly selected sequence a is shuffled with sequence b, ra,b is the evolutionary distance between sequences a and b and {sum}i1/ra,i is the sum of reciprocals of the evolutionary distances between sequence a and all the other sequences in the alignment (the probability function P(a, b) = exp(–ra,b)/{sum}iexp(–ra,i) was also tested and found to yield similar results). Evolutionary distances between all the sequences in the MSA were calculated using the Tree-Puzzle software (Schmidt et al., 2002Go) and its default parameters but with the uniform rate heterogeneity model and the Dayhoff substitution matrix (Dayhoff et al., 1978Go). These distances are in units of expected fraction of amino acids changed such that 1 unit corresponds to 100 PAM, where 1 PAM is 1% amino acids changed (the relationship between PAM and % amino acids changed is not linear) (Dayhoff et al., 1978Go). Use of other amino acid substitution matrices and/or rate heterogeneity models was found to have little effect on the results of the correlated mutation analysis obtained with this method. The tree-based shuffles were carried out in several rounds as in the case of the random shuffles described above. Here, this was implemented by initially calculating a {chi}2(i, j) value (Equation 1) for each column i and 500 shuffles of column j. Pairs of positions for which the value of {chi}2(i, j) obtained without shuffling was found to be in the bottom 80% of all the values obtained with shuffling were not analyzed further. The procedure was then repeated twice, each time for 500 additional shuffles and pairs of positions were not considered further if the value of {chi}2(i, j) obtained for them without shuffling was found to be in the bottom 90% after the second repeat or bottom 93% after the third repeat. Finally, 500 additional random shuffles were carried out and a P-value was assigned to columns i and j based on the value of {chi}2(i, j) obtained without shuffling relative to all the values of {chi}2(i, j) obtained with shuffling. Here, too, the number of shuffles in each round and the choice of cutoff values had little effect on the final results. Given a P-value of 0.005, <3% of the coupled positions that are detected using the tree-based shuffling method are due to multiplicity (i.e. false discovery rate).

Analysis of correlated mutations in different elements of secondary structure

Information on the location of {alpha}-helices, ß-strands and unstructured segments (that comprise ≥10 amino acids) in the sequences of the proteins in Table II (available as Supplementary data at PEDS online) with known three-dimensional structure was extracted from the Protein Data Bank (PDB). The PDB codes of these proteins are 1ad2, 1e4y:A, 1k7w:A, 1il2:A, 1fx0:A, 1kmh:B, 1ocz:A, 1e9i:A, 1aon:O, 4hhb:AB, 1fmt:A, 3pfk, 3pgk, 1set:A, 1m6j:A, 1oel:A and 1h1t:A. Correlated mutations found in protein families containing these proteins were then classified into those that are in {alpha}-helices, ß-strands and unstructured segments (if both positions in the pairwise correlation are in the same secondary structure element) or other.

Analysis of distances between positions with covarying residues

Inter-residue distances between Cß atoms (Zemla et al., 1997Go) (or C{alpha} in the case of glycine) were calculated between all pairs of positions found to have covarying residues using the old method with random shuffling or the new method with the tree-based shuffling. In addition, distances between all pairs of residues in the proteins analyzed were calculated as a control (except for positions that were excluded from the correlated mutation analysis owing to high gap content or conservation). Histograms of these distances were created only for the non-homo-oligomeric proteins in Table II (Supplementary data) ribosomal 50S L1 protein (1ad2), adenylate kinase (1e4y:A), methionyl-tRNA formyl transferase (1fmt:A) and phosphoglycerate kinase (3pgk), since we wanted to avoid cases where it is not clear whether the distances that should be measured are within subunits, between subunits or both.


    Results and discussion
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Acknowledgements
 References
 
Here, we describe a method that increases the likelihood that correlated mutations which are detected reflect inter-residue interactions and not common ancestry. Assessing the performance of such a method is not trivial. For example, it is sometimes assumed that correlated mutations at positions that are distant from each other in space reflect common ancestry and not inter-residue interactions, but that may not be the case (Horovitz et al., 1994Go; Lockless and Ranganathan, 1999Go; Kass and Horovitz, 2002Go). In addition, correlated mutations at positions that are close to each other in space may reflect common ancestry and not inter-residue interactions as often assumed. MSAs of different concatenated non-interacting proteins were, therefore, generated in which correlated mutations between the two proteins are due to common ancestry only. The new method was tested by determining to what extent it can better detect correlated mutations within each of the proteins that are due to inter-residue interactions and common ancestry (signal and noise, respectively) as compared with correlated mutations between the proteins that are due to common ancestry only. In addition, the new method was tested by determining whether it leads to enrichment of the fraction of correlated mutations that reflect direct inter-residue interactions in the tertiary structure and in different secondary structure elements of proteins as compared with the old method.

Magnitude of evolutionary noise in correlated mutation analysis

The method of Kass and Horovitz (2002)Go (see also http://bioportal.weizmann.ac.il/cmutatd/) for detecting correlated mutations is based on the chi-squared test, which requires that results for a pair of positions i,j are discarded if (i) any Nn,EX is <1 or (ii) >20% of the Nn,EX are <5. Here, these two conditions were replaced by determining a P-value for {chi}2(i, j) from the {chi}2(i, j) value obtained without shuffling relative to the 2000 {chi}2(i, j) values obtained with shuffling column j with respect to column i. This statistical test, which is based on a distribution of {chi}2(i,j) values generated for every pair of positions, circumvents the need to assume that the chi-squared distribution holds as is often not the case when one of the above two conditions is not met. This method was applied to 16 pairs of concatenated non-interacting proteins (Table I) in order to evaluate the extent of evolutionary noise in correlated mutation analysis. The criterion for deciding that there is no (direct or indirect) interaction between two proteins, {alpha} and ß, was that thorough searching of PubMed did not reveal one (these pairs are used only to test the methods and, therefore, it is not crucial if it transpires in the future that a few of them are indeed interacting). The sequences of {alpha}i and ßi from i organisms were aligned and concatenated and a search for correlated mutations within proteins {alpha} and ß and between proteins {alpha} and ß was then carried out. Surprisingly, little difference was found between the number or density (the number of correlated mutations found divided by the total number of possible correlated pairs of positions) of correlated mutations found within proteins and between proteins for all 16 pairs examined, despite the fact that proteins {alpha} and ß do not interact (Figure 1a). In contrast, shuffling of the sequences of ßi with respect to {alpha}i (so that now {alpha}i is no longer concatenated to ßi and instead is concatenated, for example, to ßj from another organism) eliminated almost completely correlated mutations between {alpha} and ß but not within {alpha} and ß (Figure 1b). These results show, therefore, that the signal in correlated mutation analysis due to inter-residue interactions is of the same magnitude or weaker than the noise due to other evolutionary processes.



View larger version (61K):
[in this window]
[in a new window]
 
Fig. 1. Correlated mutation analysis of concatenated carbamoyl-phosphate synthase and adenylate kinase sequences without noise reduction. Concatenated sequences of carbamoyl-phosphate synthase (small chain) and adenylate kinase from the same organism (a) and after shuffling of the adenylate kinase sequences (b) were subjected to correlated mutation analysis using the method of Kass and Horovitz (2002) with random shuffling. Coupled positions with a P-value ≤0.005 found within carbamoyl-phosphate synthase (small chain) (I) or adenylate kinase (II) and between the two proteins (III) are shown as black dots. Sequence numbering corresponds to the sequence of adenylate kinase concatenated to the C-terminus of carbamoyl-phosphate synthase (small chain) after eliminating positions in both proteins that are conserved or with gaps in >10% of the sequences.

 
Reduction of evolutionary noise in correlated mutation analysis by tree-based shuffling

The method of Kass and Horovitz (2002)Go with random shuffling as described above was modified in order to improve the signal (due to inter-residue interactions)-to-noise (due to other evolutionary processes) ratio. In this new method, the probability for shuffling two sequences is not random but inversely proportional to their evolutionary distance. Consider, for example, the two evolutionary trees shown in Figure 2. In the case of the left tree, the pattern of coordinated mutations is correlated with the tree, i.e. the pairs AB and CD of amino acids are found only in the top and bottom branches of the tree, respectively. In the case of the right tree, however, such a correlation is absent since each of the pairs AB and CD is found in both branches of the tree. The contribution of evolutionary noise to the results of correlated mutation analysis is, therefore, expected to be larger for the pattern of coordinated mutations that is correlated with the evolutionary tree (Figure 2a). In the case of the left tree, shuffling of sequences with a probability that is inversely proportional to their evolutionary distance will cause little change in column j since pairs that are close in distance are also identical. The {chi}2(i, j) value obtained without shuffling column j relative to column i will, therefore, be relatively similar to the {chi}2(i, j) values obtained with shuffling and a high P-value will be assigned to that pair of positions. In the case of the right tree, however, shuffling of sequences with a probability that is inversely proportional to their evolutionary distance will cause a large change in column j since pairs that are close in distance are not necessarily identical. Hence the {chi}2(i, j) value obtained without shuffling column j relative to column i will tend to differ from the {chi}2(i, j) values obtained with shuffling and it is more likely that a low P-value will be assigned to that pair of positions. The new method, therefore, assigns a higher score to patterns of coordinated mutations with a weak correlation to the evolutionary tree (Figure 2b) whereas other methods (see, for example, Lockless and Ranganathan, 1999Go; Kass and Horovitz, 2002Go; Dima and Thirumalai, 2004Go) assign the same score to both patterns.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 2. Examples of evolutionary trees for which a correlation between patterns of coordinated mutations is present (a) or absent (b). In the left tree, the amino acids A at position i and B at position j found in the top branch are replaced by the amino acids C and D, respectively, in the bottom branch. In the tree on the right, each of the pairs AB and CD is found in both branches.

 
The new method was tested on the 16 different pairs of concatenated non-interacting proteins {alpha} and ß (Table I). The probability for shuffling two sequences was determined from a distance matrix for the aligned and concatenated sequences of {alpha}i and ßi from i organisms. It should be mentioned that in such distance matrices, the influence of the two proteins may not be equal and depends on their lengths. It may be seen in Figure 3, panels III that application of the new method to the pair of proteins carbamoyl-phosphate synthase and adenylate kinase, for example, leads to a dramatic decrease in the number of inter-protein correlated mutations. The number of intra-protein correlated mutations is also found to decrease for both proteins (Figure 3, panels I and II). Importantly, the ratio between the density of intra-protein correlated mutations (due to signal and noise) and the density of inter-protein correlated mutations (due to noise only) for this pair of proteins changes from ~1.4 (0.180/0.126) when using the method of Kass and Horovitz (2002)Go with random shuffling to ~5.2 (0.00113/0.00022) when using the new method (in the case of both methods a P-value ≤0.005 was used). Hence a 4-fold improvement in the signal-to-noise ratio was obtained in this case. This estimate can, however, be somewhat misleading since the intra-protein correlated mutations were counted for the two proteins together. A separate calculation for carbamoyl-phosphate synthase shows that the ratio between the density of intra-protein correlated mutations and the density of inter-protein correlated mutations changes from ~1.5 (0.185/0.126) when using the method of Kass and Horovitz (2002)Go with random shuffling to ~6.2 (0.00135/0.00022) when using the new method. In contrast, the separate calculation for adenylate kinase shows that the ratio between the density of intra-protein correlated mutations and the density of inter-protein correlated mutations changes from ~1.3 (0.164/0.126) when using the method of Kass and Horovitz (2002)Go with random shuffling to ~1.2 (0.00026/0.00022) when using the new method. The separate calculations for the two proteins suggest, therefore, that the signal due to inter-residue interactions is absent (or very weak) in the case of adenylate kinase but clearly present in the case of carbamoyl-phosphate synthase.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 3. Correlated mutation analysis of concatenated carbamoyl-phosphate synthase and adenylate kinase sequences with noise reduction. Sequences of carbamoyl-phosphate synthase (small chain) and adenylate kinase from the same organism were concatenated and subjected to a correlated mutations analysis using the new method described in this paper. Coupled positions with a P-value ≤0.005 found within carbamoyl-phosphate synthase (small chain) (I) or adenylate kinase (II) and between the two proteins (III) are shown as black dots. Sequence numbering corresponds to the sequence of adenylate kinase concatenated to the C-terminus of carbamoyl-phosphate synthase (small chain) after eliminating positions in both proteins that are conserved or with gaps in >10% of the sequences.

 
A comparison of the results of correlated mutations analysis of the 16 pairs of non-interacting proteins (Table I) using the evolutionary tree-based shuffling method (the new method) and the method of Kass and Horovitz (2002)Go with random shuffling (the old method) is shown in Figure 4. The ratio between the density of intra-protein correlated mutations and the density of inter-protein correlated mutations obtained with the new method is plotted against the same ratio obtained with the old method for each pair of proteins. It may be seen that the results are consistently above the line with a slope of unity, indicating that the new method leads to noise reduction. In three out of the 16 pairs, however, the improvement in the signal-to-noise ratio was minimal. In one of these cases, seryl-tRNA synthetase and the small chain of carbamoyl-phosphate synthase, the lack of improvement may be due to a surprising sequence similarity between the two proteins with a probability of 0.0646 to have arisen by random (Pearson, 1996). We do not yet have an explanation for the lack of improvement in the case of the two pairs enolase–ribosomal protein L1 and S-adenosylmethionine synthetase–phosphoribosyl pyrophosphate synthetase. In general, however, there appears to be an inverse correlation (r=0.63) between the value of the tree-correlation coefficient and the extent of improvement in the signal-to-noise ratio using our new method (Table I). This might initially seem surprising since a low value of the tree-correlation coefficient indicates that the sequences of {alpha} are already shuffled to some extent with respect to those of ß and, therefore, the scope for improvement in the signal-to-noise ratio using the new method is limited in advance. However, the tree-based shuffling is not efficient for one or both of the proteins when the value of the tree-correlation coefficient is low and, therefore, the evolutionary noise in the intra-protein correlations is high. Hence the ratio of inter-protein noise reduction to intra-protein noise reduction is likely to be higher for {alpha},ß pairs with a small tree coefficient.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 4. Comparison of the results of correlated mutations analysis of non-interacting protein pairs using the random and the evolutionary tree-based shuffling methods. The ratio between the density of intra-protein correlated mutations and the density of inter-protein correlated mutations obtained with the tree-based shuffling method (the new method) is plotted against the same ratio obtained with random shuffling (the old method) for each of the 16 pairs of non-interacting proteins in Table II (Supplementary data). Results above and below the black line with a slope of unity indicate noise reduction and enhancement, respectively. The errors due to the shuffling were calculated by repeating each calculation three times. All the positions reported to be coupled have a P-value ≤0.005.

 
Enhancement of signal due to residue–residue interactions in correlated mutation analysis

The results described above show that application of the new evolutionary tree-based shuffling method to concatenated non-interacting proteins leads to a reduction of the evolutionary noise that is reflected in inter-protein correlated mutations. This implies that the new method should also lead to enrichment of intra-protein correlated mutations that reflect residue–residue interactions. Direct evidence for this conclusion was obtained by analyzing the separation in sequence of correlated positions found in different types of secondary structure elements (Figure 5) and the distance distribution of correlated positions (Figure 6) using the new method in comparison with the old method. The correlated mutations in {alpha}-helices obtained with the new method show two clear peaks corresponding to respective separations in sequence of four and eight residues (Figure 5a) that are poorly resolved in the case of the results obtained with the old method (Figure 5b). Similarly, the results for correlated mutations in ß-strands obtained with the new method show a clear peak that corresponds to separation in sequence of two residues (Figure 5a) that is poorly resolved in the case of the results obtained with old method (Figure 5b). Such separations in sequence are expected for direct inter-residue interactions in {alpha}-helices and ß-strands. A control for these results is the finding that no major peak is observed in the case of the other secondary structure elements (turns, loops and unstructured segments) using both the new and the old methods. In addition, it may be seen in Figure 6 that, in the case of the four non-homo-oligomeric proteins examined, the fraction of pairs of positions separated by a short distance is enriched significantly when considering pairs of correlated positions revealed by the new method relative to pairs of correlated positions revealed by the old method or all pairs of positions in the proteins. Taken together, the results in Figures 5 and 6, therefore, indicate that a certain fraction of correlated mutations reflects direct inter-residue interactions and that this fraction is enriched significantly when using the new method. Detection of such inter-residue interactions can assist, for example, in protein structure predictions and in revealing protein–protein interactions.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 5. Analysis of correlated mutations in different elements of secondary structure. The number of correlated mutations with a P-value ≤0.01 in {alpha}-helices, ß-strands and all other secondary structural elements (turns, loops and unstructured segments) found using the method of Kass and Horovitz (2002) with evolutionary-tree based shuffling (a) or random shuffling (b) is shown as a function of the distance in sequence separating the coupled positions. The numbers were normalized relative to the total number of correlated mutations in each of the secondary structural elements. For further details, see Methods.

 


View larger version (36K):
[in this window]
[in a new window]
 
Fig. 6. Distributions of spatial distances between positions with covarying residues. Histograms of inter-residue distances between Cß atoms (Zemla et al., 1997Go) (or C{alpha} in the case of glycine) are shown for all pairs of positions found to have covarying residues with a P-value <0.01 using the old method with random shuffling (in gray) or the new method with the tree-based shuffling (in black). Also shown are histograms of all the pairwise distances between residues in the proteins analyzed (in white). Histograms of these distances were created only for monomeric proteins ribosomal 50S L1 protein (1ad2), adenylate kinase (1e4y:A), methionyl-tRNA formyl transferase (1fmt:A) and phosphoglycerate kinase (3pgk). See Methods for further details.

 
Conclusions

Correlated mutation analysis of concatenated non-interacting proteins using the method of Kass and Horovitz (2002)Go with random shuffling revealed many inter-protein coordinated mutations that reflect evolutionary noise and not direct or indirect inter-residue interactions. Replacing the random shuffling step with evolutionary tree-based shuffling (which may also be useful in other non-related applications) was found to reduce evolutionary noise and, thus, increase the signal-to-noise ratio. This new method is similar in spirit to the method of Wollenberg and Atchley (2000)Go but has two important advantages: (i) the amino acid composition of each position in the alignment is conserved in the artificial sequences and (ii) the likelihood that a pair of covarying sites did not result from common ancestry is determined individually for every pair of sites in the alignment. Evidence for the increase in signal when the new method described here is applied is provided by the observation that the fraction of correlated mutations that reflect direct inter-residue interactions in the tertiary structure and different structural elements of proteins is enriched. Hence correlated mutations at distant positions revealed by the new method are also more likely to have physical significance and reflect long-range energetic coupling in proteins.


    Acknowledgements
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Acknowledgements
 References
 
This work was supported by The Israel Science Foundation. A.H. is an incumbent of the Carl and Dorothy Bennett Professorial Chair in Biochemistry. We thank Dr Ilan and Dafna Tsafrir and Professor Eytan Domany for many useful discussions and Professor Ron Unger for critical reading of the manuscript.


    References
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 Acknowledgements
 References
 
Altschuh,D., Lesk,A.M., Bloomer,A.C. and Klug,A. (1987) J. Mol. Biol., 193, 693–707.[CrossRef][ISI][Medline]

Bairoch,A. and Apweiler R. (2000) Nucleic Acids Res., 28, 45–48.[Abstract/Free Full Text]

Boeckmann,B. et al. (2003) Nucleic Acids Res., 31, 365–370.[Abstract/Free Full Text]

Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, DC, pp. 345–358.

Dima,R.I. and Thirumalai,D. (2004) Bioinformatics, 20, 2345–2354.[Abstract/Free Full Text]

Fleishman,S.J., Yifrach,O. and Ben-Tal,N. (2004) J. Mol. Biol., 340, 307–318.[CrossRef][ISI][Medline]

Fodor,A.A. and Aldrich,R.W. (2004) Proteins, 56, 211–221.[CrossRef][Medline]

Göbel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins, 18, 309–317.[ISI][Medline]

Goh,C.S., Bogan,A.A., Joachimiak,M., Walther,D. and Cohen,F.E. (2000) J. Mol. Biol., 299, 283–293.[CrossRef][ISI][Medline]

Horovitz,A., Bochkareva,E.S., Yifrach,O. and Girshovich,A.S. (1994) J. Mol. Biol., 238, 133–138.[CrossRef][ISI][Medline]

Kass,I. and Horovitz,A. (2002) Proteins, 48, 611–617.[CrossRef][ISI][Medline]

Larson,S.M., Di Nardo,A.A. and Davidson,A.R. (2000) J. Mol. Biol., 303, 433–446.[CrossRef][ISI][Medline]

Lockless,S.W. and Ranganathan,R. (1999) Science, 286, 295–299.[Abstract/Free Full Text]

Neher,E. (1994) Proc. Natl Acad. Sci. USA, 91, 98–102.[Abstract/Free Full Text]

Ortiz,A.R., Kolinski,A., Rotkiewicz,P., Ilkowski,B. and Skolnick,J. (1999) Proteins, Suppl. 3, 177–185.

Pearson,W.R. (1996) Methods Enzymol., 266, 227–258.[ISI][Medline]

Pollock,D.D., Taylor,W.R. and Goldman,N. (1999) J. Mol. Biol., 287, 187–198.[CrossRef][ISI][Medline]

Schmidt,H.A., Strimmer,K., Vingron,M. and von Haeseler,A. (2002) Bioinformatics, 18, 502–504.[Abstract/Free Full Text]

Shindyalov,I.N., Kolchanov,N.A. and Sander,C. (1994) Protein Eng., 7, 349–358.[ISI][Medline]

Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 4673–4680.[Abstract]

Wollenberg,K.R. and Atchley,W.R. (2000) Proc. Natl Acad. Sci. USA, 97, 3288–3291.[Abstract/Free Full Text]

Zemla,A., Venclovas,C., Reinhardt,A., Fidelis,K. and Hubbard,T.J. (1997) Proteins, Suppl. 1, 140–150.

Received April 19, 2005; accepted April 21, 2005.

Edited by Valerie Daggett





This Article
Abstract
Full Text (PDF)
[Supplementary data]
All Versions of this Article:
18/5/247    most recent
gzi029v1
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Request Permissions
Google Scholar
Articles by Noivirt, O.
Articles by Horovitz, A.
PubMed
PubMed Citation
Articles by Noivirt, O.
Articles by Horovitz, A.