Departments of 1Structural Biology and 2Chemical Research Support, Weizmann Institute of Science, Rehovot 76100, Israel
3 To whom correspondence should be addressed. E-mail: amnon.horovitz{at}weizmann.ac.il
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: bioinformatics/co-evolving residues/coordinated mutations/evolution/tree-based shuffling
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bioinformatic methods for detecting correlated mutations consist of two main steps: (i) alignment of homologous sequences and (ii) identification of pairs of columns in the alignment in which there is a statistically significant tendency for mutations in one column to be accompanied by corresponding and usually different mutations in the other column. The results of such an analysis are found to depend on the way in which both steps are carried out since all the methods for detecting correlated mutations are sensitive, but to different extents, to the degree of sequence conservation in the alignment (Fodor and Aldrich, 2004). A second key problem that all the methods share is distinguishing signal from noise. In other words, it is necessary to differentiate between correlated mutations that reflect short- or long-range inter-residue interactions (because of selective pressure to maintain protein structure and/or function) and those that reflect other evolutionary processes related to common ancestry (Pollock et al., 1999
; Larson et al., 2000
; Wollenberg and Atchley, 2000
) such as changes in codon usage or amino acid frequencies (referred to in this paper as noise). The need to take into account common ancestry in correlated mutation analysis has been recognized before. Shindyalov et al. (1994)
incorporated the evolutionary tree structure in their formalism but without implementing it in their algorithm to filter out evolutionary noise. Larson et al. (2000)
sought to eliminate artifactual covariations by discarding those found to arise from subsets of sequences with an average pairwise identity higher than the median of the distribution. In their study, separate sequence diversity thresholds were determined empirically for each system. A different approach (Wollenberg and Atchley, 2000
) to this problem was to compare the distribution of an inter-site mutual information statistic for an alignment of naturally occurring sequences with the distribution of this statistic for artificial sequence data generated using the parametric bootstrap from a random ancestral sequence, a given substitution matrix and the same tree. Correlated mutations in the set of artificial sequences can arise solely from common ancestry. Hence such a comparison enabled the probability that a pair of covarying sites with a certain value of the mutual information statistic did not result from common ancestry to be determined.
In this paper, we first provide an unambiguous demonstration that the level of noise due to common ancestry is of the same magnitude as that of the signal due to inter-residue interactions. We then describe a new method for detecting correlated mutations that improves the signal-to-noise ratio by determining for each individual pair of covarying sites in an alignment the likelihood that they did not result from common ancestry. Our method is similar in spirit to the method of Wollenberg and Atchley (2000) but with two key differences: (i) the amino acid composition of each position in the alignment is conserved in the artificial sequences; and (ii) the likelihood that a pair of covarying sites did not result from common ancestry is determined individually for every pair of sites in the alignment. Finally, we report results that show that the fraction of correlated mutations that reflect direct inter-residue interactions is enriched when this new method is used, thereby indicating that the signal has a physical basis that the noise lacks.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The SWISS-PROT database (Bairoch and Apweiler, 2000; Boeckmann et al., 2003
) was searched for large families of proteins and an initial multiple sequence alignment (MSA) of each family was then carried out with version 1.82 of the CLUSTAL W program (Thompson et al., 1994
) using its default parameters. Sequences found to have a low average pairwise sequence identity with all the other sequences in the alignment were eliminated and the remaining sequences were then realigned. This process was repeated until the average pairwise identity of all the sequences in the alignment was >45%. Correlated mutation analysis was carried out only on such MSAs that contained
50 sequences. Analysis of correlated mutations between families (in order to detect noise) was carried out by concatenation of sequences in the corresponding MSAs that are from the same organism. One such alignment, for example, consists of the sequence of adenylate kinase from organism X concatenated to the sequence of carbamoyl-phosphate synthase from the same organism X, the sequence of adenylate kinase from organism Y concatenated to the sequence of carbamoyl-phosphate synthase from the same organism Y, etc. We created 16 such alignments (see Table I) of sequences of different artificial chimeras that contain between 40 and 50 sequences each. All the alignments can be provided upon request.
|
Frequencies of all amino acids at all positions in the MSA containing N sequences were calculated. The expected number of sequences, NEX, that contain amino acid A at position i and amino acid B at position j assuming no coupling between the two positions is given by NfA,ifB,j where fA,i and fB,j are the frequencies of A and B at positions i and j, respectively. Each column j was shuffled randomly up to 2000 times (while leaving all other columns in the alignment including column i intact) and the observed number of sequences, NOBS, that contain amino acid A at position i and amino acid B at position j in each shuffle was determined. A 2(i,j) value for column i and each shuffle of column j was then calculated, as follows:
![]() | (1) |
Detection of correlated mutations using tree-based shuffling (i.e. with noise reduction)
This method is similar to that described above except that each shuffle for column j was generated by (i) randomly selecting 20% of the sequences in the alignment and (ii) carrying out pairwise permutations between each of the selected sequences and another sequence in the alignment chosen with a probability based on the evolutionary distance between them, as follows:
![]() | (2) |
Analysis of correlated mutations in different elements of secondary structure
Information on the location of -helices, ß-strands and unstructured segments (that comprise
10 amino acids) in the sequences of the proteins in Table II (available as Supplementary data at PEDS online) with known three-dimensional structure was extracted from the Protein Data Bank (PDB). The PDB codes of these proteins are 1ad2, 1e4y:A, 1k7w:A, 1il2:A, 1fx0:A, 1kmh:B, 1ocz:A, 1e9i:A, 1aon:O, 4hhb:AB, 1fmt:A, 3pfk, 3pgk, 1set:A, 1m6j:A, 1oel:A and 1h1t:A. Correlated mutations found in protein families containing these proteins were then classified into those that are in
-helices, ß-strands and unstructured segments (if both positions in the pairwise correlation are in the same secondary structure element) or other.
Analysis of distances between positions with covarying residues
Inter-residue distances between Cß atoms (Zemla et al., 1997) (or C
in the case of glycine) were calculated between all pairs of positions found to have covarying residues using the old method with random shuffling or the new method with the tree-based shuffling. In addition, distances between all pairs of residues in the proteins analyzed were calculated as a control (except for positions that were excluded from the correlated mutation analysis owing to high gap content or conservation). Histograms of these distances were created only for the non-homo-oligomeric proteins in Table II (Supplementary data) ribosomal 50S L1 protein (1ad2), adenylate kinase (1e4y:A), methionyl-tRNA formyl transferase (1fmt:A) and phosphoglycerate kinase (3pgk), since we wanted to avoid cases where it is not clear whether the distances that should be measured are within subunits, between subunits or both.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Magnitude of evolutionary noise in correlated mutation analysis
The method of Kass and Horovitz (2002) (see also http://bioportal.weizmann.ac.il/cmutatd/) for detecting correlated mutations is based on the chi-squared test, which requires that results for a pair of positions i,j are discarded if (i) any Nn,EX is <1 or (ii) >20% of the Nn,EX are <5. Here, these two conditions were replaced by determining a P-value for
2(i, j) from the
2(i, j) value obtained without shuffling relative to the 2000
2(i, j) values obtained with shuffling column j with respect to column i. This statistical test, which is based on a distribution of
2(i,j) values generated for every pair of positions, circumvents the need to assume that the chi-squared distribution holds as is often not the case when one of the above two conditions is not met. This method was applied to 16 pairs of concatenated non-interacting proteins (Table I) in order to evaluate the extent of evolutionary noise in correlated mutation analysis. The criterion for deciding that there is no (direct or indirect) interaction between two proteins,
and ß, was that thorough searching of PubMed did not reveal one (these pairs are used only to test the methods and, therefore, it is not crucial if it transpires in the future that a few of them are indeed interacting). The sequences of
i and ßi from i organisms were aligned and concatenated and a search for correlated mutations within proteins
and ß and between proteins
and ß was then carried out. Surprisingly, little difference was found between the number or density (the number of correlated mutations found divided by the total number of possible correlated pairs of positions) of correlated mutations found within proteins and between proteins for all 16 pairs examined, despite the fact that proteins
and ß do not interact (Figure 1a). In contrast, shuffling of the sequences of ßi with respect to
i (so that now
i is no longer concatenated to ßi and instead is concatenated, for example, to ßj from another organism) eliminated almost completely correlated mutations between
and ß but not within
and ß (Figure 1b). These results show, therefore, that the signal in correlated mutation analysis due to inter-residue interactions is of the same magnitude or weaker than the noise due to other evolutionary processes.
|
The method of Kass and Horovitz (2002) with random shuffling as described above was modified in order to improve the signal (due to inter-residue interactions)-to-noise (due to other evolutionary processes) ratio. In this new method, the probability for shuffling two sequences is not random but inversely proportional to their evolutionary distance. Consider, for example, the two evolutionary trees shown in Figure 2. In the case of the left tree, the pattern of coordinated mutations is correlated with the tree, i.e. the pairs AB and CD of amino acids are found only in the top and bottom branches of the tree, respectively. In the case of the right tree, however, such a correlation is absent since each of the pairs AB and CD is found in both branches of the tree. The contribution of evolutionary noise to the results of correlated mutation analysis is, therefore, expected to be larger for the pattern of coordinated mutations that is correlated with the evolutionary tree (Figure 2a). In the case of the left tree, shuffling of sequences with a probability that is inversely proportional to their evolutionary distance will cause little change in column j since pairs that are close in distance are also identical. The
2(i, j) value obtained without shuffling column j relative to column i will, therefore, be relatively similar to the
2(i, j) values obtained with shuffling and a high P-value will be assigned to that pair of positions. In the case of the right tree, however, shuffling of sequences with a probability that is inversely proportional to their evolutionary distance will cause a large change in column j since pairs that are close in distance are not necessarily identical. Hence the
2(i, j) value obtained without shuffling column j relative to column i will tend to differ from the
2(i, j) values obtained with shuffling and it is more likely that a low P-value will be assigned to that pair of positions. The new method, therefore, assigns a higher score to patterns of coordinated mutations with a weak correlation to the evolutionary tree (Figure 2b) whereas other methods (see, for example, Lockless and Ranganathan, 1999
; Kass and Horovitz, 2002
; Dima and Thirumalai, 2004
) assign the same score to both patterns.
|
|
|
The results described above show that application of the new evolutionary tree-based shuffling method to concatenated non-interacting proteins leads to a reduction of the evolutionary noise that is reflected in inter-protein correlated mutations. This implies that the new method should also lead to enrichment of intra-protein correlated mutations that reflect residueresidue interactions. Direct evidence for this conclusion was obtained by analyzing the separation in sequence of correlated positions found in different types of secondary structure elements (Figure 5) and the distance distribution of correlated positions (Figure 6) using the new method in comparison with the old method. The correlated mutations in -helices obtained with the new method show two clear peaks corresponding to respective separations in sequence of four and eight residues (Figure 5a) that are poorly resolved in the case of the results obtained with the old method (Figure 5b). Similarly, the results for correlated mutations in ß-strands obtained with the new method show a clear peak that corresponds to separation in sequence of two residues (Figure 5a) that is poorly resolved in the case of the results obtained with old method (Figure 5b). Such separations in sequence are expected for direct inter-residue interactions in
-helices and ß-strands. A control for these results is the finding that no major peak is observed in the case of the other secondary structure elements (turns, loops and unstructured segments) using both the new and the old methods. In addition, it may be seen in Figure 6 that, in the case of the four non-homo-oligomeric proteins examined, the fraction of pairs of positions separated by a short distance is enriched significantly when considering pairs of correlated positions revealed by the new method relative to pairs of correlated positions revealed by the old method or all pairs of positions in the proteins. Taken together, the results in Figures 5 and 6, therefore, indicate that a certain fraction of correlated mutations reflects direct inter-residue interactions and that this fraction is enriched significantly when using the new method. Detection of such inter-residue interactions can assist, for example, in protein structure predictions and in revealing proteinprotein interactions.
|
|
Correlated mutation analysis of concatenated non-interacting proteins using the method of Kass and Horovitz (2002) with random shuffling revealed many inter-protein coordinated mutations that reflect evolutionary noise and not direct or indirect inter-residue interactions. Replacing the random shuffling step with evolutionary tree-based shuffling (which may also be useful in other non-related applications) was found to reduce evolutionary noise and, thus, increase the signal-to-noise ratio. This new method is similar in spirit to the method of Wollenberg and Atchley (2000)
but has two important advantages: (i) the amino acid composition of each position in the alignment is conserved in the artificial sequences and (ii) the likelihood that a pair of covarying sites did not result from common ancestry is determined individually for every pair of sites in the alignment. Evidence for the increase in signal when the new method described here is applied is provided by the observation that the fraction of correlated mutations that reflect direct inter-residue interactions in the tertiary structure and different structural elements of proteins is enriched. Hence correlated mutations at distant positions revealed by the new method are also more likely to have physical significance and reflect long-range energetic coupling in proteins.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bairoch,A. and Apweiler R. (2000) Nucleic Acids Res., 28, 4548.
Boeckmann,B. et al. (2003) Nucleic Acids Res., 31, 365370.
Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, DC, pp. 345358.
Dima,R.I. and Thirumalai,D. (2004) Bioinformatics, 20, 23452354.
Fleishman,S.J., Yifrach,O. and Ben-Tal,N. (2004) J. Mol. Biol., 340, 307318.[CrossRef][ISI][Medline]
Fodor,A.A. and Aldrich,R.W. (2004) Proteins, 56, 211221.[CrossRef][Medline]
Göbel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins, 18, 309317.[ISI][Medline]
Goh,C.S., Bogan,A.A., Joachimiak,M., Walther,D. and Cohen,F.E. (2000) J. Mol. Biol., 299, 283293.[CrossRef][ISI][Medline]
Horovitz,A., Bochkareva,E.S., Yifrach,O. and Girshovich,A.S. (1994) J. Mol. Biol., 238, 133138.[CrossRef][ISI][Medline]
Kass,I. and Horovitz,A. (2002) Proteins, 48, 611617.[CrossRef][ISI][Medline]
Larson,S.M., Di Nardo,A.A. and Davidson,A.R. (2000) J. Mol. Biol., 303, 433446.[CrossRef][ISI][Medline]
Lockless,S.W. and Ranganathan,R. (1999) Science, 286, 295299.
Neher,E. (1994) Proc. Natl Acad. Sci. USA, 91, 98102.
Ortiz,A.R., Kolinski,A., Rotkiewicz,P., Ilkowski,B. and Skolnick,J. (1999) Proteins, Suppl. 3, 177185.
Pearson,W.R. (1996) Methods Enzymol., 266, 227258.[ISI][Medline]
Pollock,D.D., Taylor,W.R. and Goldman,N. (1999) J. Mol. Biol., 287, 187198.[CrossRef][ISI][Medline]
Schmidt,H.A., Strimmer,K., Vingron,M. and von Haeseler,A. (2002) Bioinformatics, 18, 502504.
Shindyalov,I.N., Kolchanov,N.A. and Sander,C. (1994) Protein Eng., 7, 349358.[ISI][Medline]
Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 46734680.[Abstract]
Wollenberg,K.R. and Atchley,W.R. (2000) Proc. Natl Acad. Sci. USA, 97, 32883291.
Zemla,A., Venclovas,C., Reinhardt,A., Fidelis,K. and Hubbard,T.J. (1997) Proteins, Suppl. 1, 140150.
Received April 19, 2005; accepted April 21, 2005.
Edited by Valerie Daggett
|