Center of Applied Molecular Engineering, Institute for Chemistry and Biochemistry, University of Salzburg, Jakob-Haringerstrasse 3, A-5020 Salzburg, Austria
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: alignment accuracy/protein evolution/sequence alignment/structure alignment/substitution matrix
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The success and reliability of sequence comparisons depend on, among other ingredients, the substitution matrix used (Henikoff and Henikoff, 1993; Vogt et al., 1995
). Popular matrices such as PAM, BLOSUM and the GONNET matrix are based on sequence alignments (Dayhoff et al., 1978
; Gonnet et al., 1992
; Henikoff and Henikoff, 1992
). The accuracy of these alignments is important for the quality of the matrices. This in particular is the case for large evolutionary distances, where sequence alignments become less reliable.
Structure alignments are generated independently of sequence similarity. They are reliable even in the case of distant evolutionary relationships. Several structure-based matrices have been published (Risler et al., 1988; Johnson and Overington, 1993
; Naor et al., 1996
; Russell et al., 1997
). An important issue in structure comparison is the definition of structural equivalence of residue pairs. Here, structural equivalence is defined by close C
and Cß distances, corresponding to residues that occupy similar positions in the structures and resemble each other in their side-chain orientation.
A data set of superimposed protein pairs (Domingues et al., 2000) is used for the derivation of amino acid substitution matrices. Since the goal is to exploit structural information rather than sequence information, the pairs used in this analysis were chosen to have high structural but low sequence similarity. From this dataset a substitution matrix is derived. To investigate the extent to which phylogenetic distances within the data set influence results, a second matrix is calculated by excluding pairs with no or unclear evolutionary relationships. The alignment accuracy of sequence alignments created using both matrices is compared with results from 12 other previously published matrices.
Structure comparison can have multiple solutions (Boutonnet et al., 1995; Feng and Sippl, 1996
; Godzik, 1996
). It is difficult to judge which of the alternative solutions is the most relevant in terms of evolution. The question arises of which one should be used for the compilation of substitution matrices. Here we use the alignment having the largest number of equivalent residues to derive the matrices. However, in performance tests alignment accuracy is measured by comparing sequence alignments with all alternative solutions.
In the following sections we describe how the amino acid substitution matrices are derived and discuss differences between matrices derived from homologous pairs and matrices derived from homologous as well as analogous pairs. We apply the matrices in a performance test.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We need a data set for two purposes: to derive amino acid substitution matrices and to evaluate alignment accuracy. For the first task, sequence similarity among two proteins A and B that are structurally related has to be low. For the second task it is important that there is no sequence similarity of one pair AB to any other pair CD in the set. Otherwise, although in the jack-knife test the pair AB is removed, there could still be a pair CD related to AB, resulting in in a statistical bias. A data set that correlates with this criterion was published recently (Domingues et al., 2000). In summary, the procedure to derive the data set is as follows: (i) starting from the PDB (Berman et al., 2000
), a set of proteins is prepared, so that any proteins A and B taken from the set have <30% sequence identity; (ii) these proteins are grouped into structurally similar subsets using PROSUP (Feng and Sippl, 1996
); and (iii) from these groups representative pairs of superimposed structurally related proteins are selected after visual inspection.
After this procedure, a data set of 122 protein pairs (Table I) is obtained having the following characteristics: (i) only 19 structurally related pairs have significant sequence similarity within the pair that can be detected using SSEARCH (Smith and Waterman, 1981
; Pearson, 1991
), but ID <30%. In terms of the statistics this means that there is a strong relation in terms of equivalent position and a very low if not negligible correlation in terms of sequence identity; (ii) there is no detectable sequence similarity between proteins that belong to one pair AB and to those from another pair CD; this ensures that there is no statistical bias in the jack-knife test.
|
When two protein structures are aligned, frequently alternative structure alignments are obtained. The one with the highest number of structurally equivalent residues is used to calculate amino acid substitutions. If the C and Cß atoms of an amino acid pair Ai, Bj are separated by less than 5 Å, they are considered to occupy equivalent positions. The frequency of occurrence fij of these equivalent positions corresponds to the observed substitution frequency of amino acid pair Ai, Bj.
To calculate the matrix the formalism of Henikoff is used (Henikoff and Henikoff, 1992).
The relative frequency qij of occurrence of an amino acid pair Ai, Bj is
![]() | (1) |
where fij is the observed substitution frequency of pair Ai, Bj. The frequency of occurrence of amino acid Ai in a pair Ai, Bj is
![]() | (2) |
and the expected frequency eij for a substitution of a pair Ai, Bj is then
![]() |
![]() |
Finally, the logarithm of the odds matrix is calculated by
![]() | (3) |
The relative entropy H of a matrix, also called the average mutual information per residue, is calculated according to Altschul (1991):
![]() | (4) |
The matrices
The data set of 122 protein pairs yields 13 908 structurally equivalent amino acid pairs. The values sij are derived from these substitutions according to Equation 3, resulting in the Structure Derived Matrix (SDM, Table II
).
|
A second matrix, called the Homologous Structure Derived Matrix (HSDM, Table III), is calculated from the subset of 77 proteins, which are classified as homologous by CATH (9947 amino acid pairs).
|
Performance test
The 122 protein pairs used to derive SDM and HSDM are applied in a jack-knife test, to estimate the accuracy of alignments obtained from these matrices. Hence, for a pair of proteins being aligned, unique SDM and HSDM matrices are generated where substitution counts from this particular pair are excluded from the statistics. In order to test the performance of different substitution matrices, an implementation of the Needleman and Wunsch algorithm is used to construct the sequence alignments (Needleman and Wunsch, 1970; Gotoh, 1982
), where end gaps are not penalized. Alignment accuracy is measured in terms of residue pairs that are placed at equivalent structure positions and are also aligned in the sequence alignment. The accuracy is expressed as a percentage of the length of the structure alignment (Vogt et al., 1995
).
Structure alignments are calculated using the program PROSUP (Feng and Sippl, 1996). Each alternative is considered as a possible solution for a structure alignment in the evaluation. The structure alignment (Lx) which yields the highest number of correctly aligned residues, when compared with the sequence alignment, is used to assess alignment accuracy.
The alignment accuracy is estimated in the same way for 12 previously reported matrices (Dayhoff et al., 1978; Risler et al., 1988
; Gonnet et al., 1992
; Henikoff and Henikoff, 1992
; Johnson and Overington, 1993
; Naor et al., 1996
; Russell et al., 1997
). For each matrix the optimum combination of gap opening and gap extension penalties is determined as the one that corresponds to the maximum average alignment accuracy obtained from the pairs. Optimum gap opening penalties are tested for each matrix in the range from 0 and 20 and extension penalties between 0 and 10.
To investigate if there are differences between local and global alignment algorithms, parameter optimization was done for a few matrices using a local alignment algorithm. The results are similar to those from global alignment, not penalizing end gaps (data not shown).
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The relative frequencies of the amino acids in the data set are similar to others reported in the literature (Table IV; Johnson and Overington, 1993
). The clustering of the amino acids in the dendrogram tree is similar for the structure-derived matrices (Figure 1
). Two groups are obtained, one consisting of hydrophobic amino acids and the other containing polar and charged amino acids. The most pronounced differences from the tree obtained for the GONNET matrix are the position of tryptophan and cysteine. In the GONNET matrix these two are grouped separately from the other amino acids, whereas they belong to the hydrophobic branch in the HSDM dendrogram.
|
|
Matrix performance test
Gap penalties are optimized for several matrices in order to obtain the maximum average alignment accuracy (Table V). The top-ranking matrices are the HSDM and SDM. On average they align ~34 and 33% of the alignment correctly. A similar performance is shown by the GONNET matrix (Gonnet et al., 1992
) and the NAOR matrix (Naor et al., 1996
), with ~33 and 31% correctly aligned. On average there is no clear gap between the top-ranking matrices, but for single protein pairs there are several cases where one matrix yields a much better performance than the other (Figure 2
). The correlation coefficient between GONNET and HSDM is 0.87.
|
|
The results above refer to the parameter set that gives the best alignment performance on average over the whole set of protein pairs. Another question is what the alignment accuracy is that can be obtained by optimizing parameters for each pair individually. Here, the average performance increases ~15% for the HSDM, SDM and GONNET matrices. For some protein pairs a large difference in alignment accuracy can be observed.
The structure alignment of the pair hisactophilin and fibroblast growth factor has several small gaps (see Figure 3). The average best-performing gap penalties are too high to allow the opening of the multiple gaps. Using the optimized parameters for this case the opening penalty is much lower and a better alignment is achieved.
|
The pairs of proteins used for testing can be classified either as homologous or as analogous. It is interesting how good analogous pairs can be aligned, as no evidence of an evolutionary relationship can be found for these cases. Here we find that there is a clear difference in the distribution of alignment accuracy between these two groups (Figure 4). In almost half of the analogous structures not a single amino acid pair can be aligned correctly. This only happens in <10% of the homologous proteins. Differences in alignment accuracy between analogous and homologous proteins have also been observed by Russell et al. (1998).
|
|
Multiple solutions for structure alignments can be found in 88% of the 122 protein structure alignments. For 63% more than four solutions exist. To estimate the similarity of these alternative alignments, they are grouped into different clusters (Lackner, Koppensteiner, Sippl, Domingues, in preparation). Half of the pairs show more than one cluster, one third more than two clusters.
Owing to this finding, it is necessary to consider all the alternative solutions L1 ... Ln when measuring the accuracy of a sequence alignment. The one that shares the highest number of correctly aligned residues with the sequence alignment is Lx. In 20% of the alignments, Lx is different from L1, the alternative with the highest number of equivalent residues. Using only L1 in the performance test, the average performance decreases for all of the matrices by 34%.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Two amino acid substitution matrices are derived. One is calculated from the whole data set of protein structure alignments, SDM. The second, HSDM, is computed after proteins of unclear evolutionary relationship are excluded from the data set. The information content of HSDM is higher, although the total number of observed substitutions is 30% smaller than those of SDM. Also, the average alignment accuracy increases (Table 5). A similar clustering of substitution values can be observed in both matrices (Figure 1
).
These structure-derived substitution matrices, and also previously published matrices, are applied in sequence comparisons of sequences at the border of detectable similarity. The accuracy of the sequence alignments is evaluated by comparison with structure superimpositions. The best average alignment accuracy is observed with the new matrices HSDM and SDM, although the difference from the other top-ranking matrices GONNET and NAOR is small (Table 5).
A closer investigation of the top-ranking matrices shows that the GONNET matrix performs fairly well in the alignment of distantly related sequences. It was based on exhaustive matching of an entire protein sequence database, resulting in 1.7x106 sub-sequence matches. In contrast to this large data set, only 77 protein structure comparisons are used to derive HSDM. This demonstrates that structure alignments provide a good source to derive substitution matrices. The good performance of the NAOR matrix was not expected by its authors (Naor et al., 1996). Our data show that it is suitable for sequence alignments of distantly related sequences.
For several pairs the alignment accuracy using GONNET and HSDM is completely different (Figure 2). Also in several cases no correct sequence alignment is obtained when average optimum gap penalties are used. As an example, it is shown how the alignment accuracy could be improved enormously by applying different gap penalties (Figure 3
). It has been reported previously that the choice of gap penalty and substitution matrix used considerably affects the sequence comparison results (Henikoff and Henikoff, 1993
; Johnson and Overington, 1993
; Pearson, 1995
; Vogt et al., 1995
). The results obtained in this study reiterate one of the basic problems of sequence alignment: if gap penalties are kept constant over all of the sequence, the biologically most meaningful alignment often cannot be found. It would be interesting to investigate the alignment quality of algorithms that do not require gap penalties (Morgenstern et al., 1996
) or use a position-dependent gap penalty (Flöckner et al., 1997
; Sanchez and Sali, 1997
).
Our approach provided us with a high-quality data source to observe amino acid substitutions. The matrices derived here are among the best-performing ones, although they are based on a fairly small number of amino acid replacements. Further improvements could be achieved by collecting data from additional homologous structure alignments. Finding a way to estimate which of the alternative structure alignments is the biologically most meaningful might also improve the quality of the matrices.
The matrices can be downloaded from our web site at http://www.came.sbg.ac.at.
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) J. Mol. Biol., 215, 403410.[ISI][Medline]
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235242.
Boutonnet,N.S., Rooman,M.J., Ochagavia,M.E., Richelle,J. and Wodak,S.J. (1995) Protein Eng., 8, 647662.[Abstract]
Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Atlas of Protein Sequence and Structure, vol. 5, suppl. National Biomedical Research Foundation, Washington, DC, ed. Dayhoff,M.O., pp. 345352.
Domingues,F.S., Lackner,P., Andreeva,A. and Sippl,M.J. (2000) J. Mol. Biol., 297, 10031013.[ISI][Medline]
Felsenstein,J. (1985) Evolution, 39, 783791.[ISI]
Feng,Z.K. and Sippl,M.J. (1996) Folding Des., 1, 123132.[ISI][Medline]
Flöckner,H., Domingues,F.S. and Sippl,M.J. (1997) Proteins, Suppl 1, 129133.
Godzik,A. (1996) Protein Eng., 5, 13251338.
Gonnet,G.H., Cohen,M.A. and Benner,S.A. (1992) Science, 256, 14331445.
Gotoh,O. (1982) J. Mol. Biol., 162, 705708.[ISI][Medline]
Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl Acad. Sci. USA, 89, 1091510919.[Abstract]
Henikoff,S. and Henikoff,J.G. (1993) J. Mol. Biol., 233, 716738.[ISI][Medline]
Johnson,M.S. and Overington,J.P. (1993) J. Mol. Biol., 233, 716738.[ISI][Medline]
Kyrpides,N.S. (1999) Bioinformatics, 15, 773774.
Morgenstern,B., Dress,A. and Werner,T. (1996) Proc. Natl Acad. Sci. USA, 29, 1209812103.
Naor,D., Fischer,D., Jernigan,R.L., Wolfson,H.J. and Nussinov,R. (1996) J. Mol. Biol., 256, 924938.[ISI][Medline]
Needleman,S.B. and Wunsch,C.D. (1970) J. Mol. Biol., 48, 443453.[ISI][Medline]
Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 10931108.[ISI][Medline]
Pearson,W.R. (1991) Genomics, 11, 635650.[ISI][Medline]
Pearson,W.R. (1995) Protein Sci., 4, 11451160.
Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 24442448.[Abstract]
Risler,J.L., Delorme,M.O., Delacroix,H. and Henault,A. (1988) J. Mol. Biol., 204, 10191029.[ISI][Medline]
Russell,R.B., Saqi,M.A.S., Sayle,R.A., Bates,P.A. and Sternberg,M.J.E. (1997) J. Mol. Biol., 269, 423439.[ISI][Medline]
Russell,R.B., Saqi,M.A.S., Bates,P.A., Sayle,R.A. and Sternberg,M.J.E. (1998) Protein Eng., 11, 19.[Abstract]
Sanchez,R. and Sali,A. (1997) Proteins, Suppl 1, 5058.
Scharf,M., Schneider,R., Casari,G., Bork,P., Valencia,A., Ouzounis,C. and Sander,C. (1994) Proceedings 2nd International Conference on Intelligent Systems for Molecular Biology (ISMB), 2, 348353.
Smith,T.F. and Waterman,M.S. (1981) J. Mol. Biol., 147, 195197.[ISI][Medline]
Teichmann,S.A., Chothia,C. and Gerstein,M. (1999) Curr. Opin. Struct. Biol., 9, 390399.[ISI][Medline]
Vogt,G., Etzold,T. and Argos,P. (1995) J. Mol. Biol., 249, 819831.
Received February 25, 2000; revised June 20, 2000; accepted June 26, 2000.