Structure-derived substitution matrices for alignment of distantly related sequences

Andreas Prlic, Francisco S. Domingues and Manfred J. Sippl1

Center of Applied Molecular Engineering, Institute for Chemistry and Biochemistry, University of Salzburg, Jakob-Haringerstrasse 3, A-5020 Salzburg, Austria


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Sequence alignment is a standard method to infer evolutionary, structural, and functional relationships among sequences. The quality of alignments depends on the substitution matrix used. Here we derive matrices based on superimpositions from protein pairs of similar structure, but of low or no sequence similarity. In a performance test the matrices are compared with 12 other previously published matrices. It is found that the structure-derived matrices are applicable for comparisons of distantly related sequences. We investigate the influence of evolutionary relationships of protein pairs on the alignment accuracy.

Keywords: alignment accuracy/protein evolution/sequence alignment/structure alignment/substitution matrix


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
By the end of 1999, 26 genomes have been fully sequenced and more than 100 additional genome projects are in progress, causing sequence databases to explode (Kyrpides, 1999Go). For a newly determined sequence, structural, functional and other biologically relevant information can be inferred if evolutionarily related sequences are found (Scharf et al., 1994Go; Teichmann et al., 1999Go). Sequence alignment is a standard method to search for such relationships (Pearson and Lipman, 1988Go; Altschul et al., 1990Go, 1997Go).

The success and reliability of sequence comparisons depend on, among other ingredients, the substitution matrix used (Henikoff and Henikoff, 1993Go; Vogt et al., 1995Go). Popular matrices such as PAM, BLOSUM and the GONNET matrix are based on sequence alignments (Dayhoff et al., 1978Go; Gonnet et al., 1992Go; Henikoff and Henikoff, 1992Go). The accuracy of these alignments is important for the quality of the matrices. This in particular is the case for large evolutionary distances, where sequence alignments become less reliable.

Structure alignments are generated independently of sequence similarity. They are reliable even in the case of distant evolutionary relationships. Several structure-based matrices have been published (Risler et al., 1988Go; Johnson and Overington, 1993Go; Naor et al., 1996Go; Russell et al., 1997Go). An important issue in structure comparison is the definition of structural equivalence of residue pairs. Here, structural equivalence is defined by close C{alpha} and Cß distances, corresponding to residues that occupy similar positions in the structures and resemble each other in their side-chain orientation.

A data set of superimposed protein pairs (Domingues et al., 2000Go) is used for the derivation of amino acid substitution matrices. Since the goal is to exploit structural information rather than sequence information, the pairs used in this analysis were chosen to have high structural but low sequence similarity. From this dataset a substitution matrix is derived. To investigate the extent to which phylogenetic distances within the data set influence results, a second matrix is calculated by excluding pairs with no or unclear evolutionary relationships. The alignment accuracy of sequence alignments created using both matrices is compared with results from 12 other previously published matrices.

Structure comparison can have multiple solutions (Boutonnet et al., 1995Go; Feng and Sippl, 1996Go; Godzik, 1996Go). It is difficult to judge which of the alternative solutions is the most relevant in terms of evolution. The question arises of which one should be used for the compilation of substitution matrices. Here we use the alignment having the largest number of equivalent residues to derive the matrices. However, in performance tests alignment accuracy is measured by comparing sequence alignments with all alternative solutions.

In the following sections we describe how the amino acid substitution matrices are derived and discuss differences between matrices derived from homologous pairs and matrices derived from homologous as well as analogous pairs. We apply the matrices in a performance test.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
The data set of protein pairs

We need a data set for two purposes: to derive amino acid substitution matrices and to evaluate alignment accuracy. For the first task, sequence similarity among two proteins A and B that are structurally related has to be low. For the second task it is important that there is no sequence similarity of one pair A–B to any other pair C–D in the set. Otherwise, although in the jack-knife test the pair A–B is removed, there could still be a pair C–D related to A–B, resulting in in a statistical bias. A data set that correlates with this criterion was published recently (Domingues et al., 2000Go). In summary, the procedure to derive the data set is as follows: (i) starting from the PDB (Berman et al., 2000Go), a set of proteins is prepared, so that any proteins A and B taken from the set have <30% sequence identity; (ii) these proteins are grouped into structurally similar subsets using PROSUP (Feng and Sippl, 1996Go); and (iii) from these groups representative pairs of superimposed structurally related proteins are selected after visual inspection.

After this procedure, a data set of 122 protein pairs (Table IGo) is obtained having the following characteristics: (i) only 19 structurally related pairs have significant sequence similarity within the pair that can be detected using SSEARCH (Smith and Waterman, 1981Go; Pearson, 1991Go), but ID <30%. In terms of the statistics this means that there is a strong relation in terms of equivalent position and a very low if not negligible correlation in terms of sequence identity; (ii) there is no detectable sequence similarity between proteins that belong to one pair A–B and to those from another pair C–D; this ensures that there is no statistical bias in the jack-knife test.


View this table:
[in this window]
[in a new window]
 
Table I. Set of protein pairs used in this study. The name of a protein is given by its PDB code (Berman et al., 2000), its chain identifier and the model number
 
Compilation of substitution matrices

When two protein structures are aligned, frequently alternative structure alignments are obtained. The one with the highest number of structurally equivalent residues is used to calculate amino acid substitutions. If the C{alpha} and Cß atoms of an amino acid pair Ai, Bj are separated by less than 5 Å, they are considered to occupy equivalent positions. The frequency of occurrence fij of these equivalent positions corresponds to the observed substitution frequency of amino acid pair Ai, Bj.

To calculate the matrix the formalism of Henikoff is used (Henikoff and Henikoff, 1992Go).

The relative frequency qij of occurrence of an amino acid pair Ai, Bj is


(1)

where fij is the observed substitution frequency of pair Ai, Bj. The frequency of occurrence of amino acid Ai in a pair Ai, Bj is


(2)

and the expected frequency eij for a substitution of a pair Ai, Bj is then



Finally, the logarithm of the odds matrix is calculated by


(3)

The relative entropy H of a matrix, also called the average mutual information per residue, is calculated according to Altschul (1991):


(4)

The matrices

The data set of 122 protein pairs yields 13 908 structurally equivalent amino acid pairs. The values sij are derived from these substitutions according to Equation 3Go, resulting in the Structure Derived Matrix (SDM, Table IIGo).


View this table:
[in this window]
[in a new window]
 
Table II. Lower left diagonal, observed amino acid substitutions over all of the data set of structure alignments; upper right diagonal, the structure-derived substitution matrix (SDM) derived from these observed frequencies (values in the matrix are scaled by a factor of 2)
 
CATH classifies structures according to different level of structural similarity and evolutionary relationship (Orengo et al., 1997Go). Proteins that share the same H-level (`H' for homology) are assumed to be evolutionarily related and share a similar fold. Such proteins are called `homologous'. Proteins that share the same fold, but for which no evidence for an evolutionary relationship can be found, are classified in different H-levels. These are called `analogous'.

A second matrix, called the Homologous Structure Derived Matrix (HSDM, Table IIIGo), is calculated from the subset of 77 proteins, which are classified as homologous by CATH (9947 amino acid pairs).


View this table:
[in this window]
[in a new window]
 
Table III. Lower left diagonal, observed amino acid substitutions over the subset of homologous pairs; upper right diagonal, the substitution matrix derived from homologous structural pairs (HSDM) (values in the matrix are scaled by a factor of 5)
 
Using PHYLIP (Felsenstein, 1985Go), a phylogenetic inference package, a tree diagram is derived to display the relationships among amino acids.

Performance test

The 122 protein pairs used to derive SDM and HSDM are applied in a jack-knife test, to estimate the accuracy of alignments obtained from these matrices. Hence, for a pair of proteins being aligned, unique SDM and HSDM matrices are generated where substitution counts from this particular pair are excluded from the statistics. In order to test the performance of different substitution matrices, an implementation of the Needleman and Wunsch algorithm is used to construct the sequence alignments (Needleman and Wunsch, 1970Go; Gotoh, 1982Go), where end gaps are not penalized. Alignment accuracy is measured in terms of residue pairs that are placed at equivalent structure positions and are also aligned in the sequence alignment. The accuracy is expressed as a percentage of the length of the structure alignment (Vogt et al., 1995Go).

Structure alignments are calculated using the program PROSUP (Feng and Sippl, 1996Go). Each alternative is considered as a possible solution for a structure alignment in the evaluation. The structure alignment (Lx) which yields the highest number of correctly aligned residues, when compared with the sequence alignment, is used to assess alignment accuracy.

The alignment accuracy is estimated in the same way for 12 previously reported matrices (Dayhoff et al., 1978Go; Risler et al., 1988Go; Gonnet et al., 1992Go; Henikoff and Henikoff, 1992Go; Johnson and Overington, 1993Go; Naor et al., 1996Go; Russell et al., 1997Go). For each matrix the optimum combination of gap opening and gap extension penalties is determined as the one that corresponds to the maximum average alignment accuracy obtained from the pairs. Optimum gap opening penalties are tested for each matrix in the range from 0 and 20 and extension penalties between 0 and 10.

To investigate if there are differences between local and global alignment algorithms, parameter optimization was done for a few matrices using a local alignment algorithm. The results are similar to those from global alignment, not penalizing end gaps (data not shown).


    Results
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
The substitution matrices

The relative frequencies of the amino acids in the data set are similar to others reported in the literature (Table IVGo; Johnson and Overington, 1993Go). The clustering of the amino acids in the dendrogram tree is similar for the structure-derived matrices (Figure 1Go). Two groups are obtained, one consisting of hydrophobic amino acids and the other containing polar and charged amino acids. The most pronounced differences from the tree obtained for the GONNET matrix are the position of tryptophan and cysteine. In the GONNET matrix these two are grouped separately from the other amino acids, whereas they belong to the hydrophobic branch in the HSDM dendrogram.


View this table:
[in this window]
[in a new window]
 
Table IV. Frequency of occurrence (Equation 2Go) of amino acids in the substitutions used to calculate SDM compared with the values published by Johnson and Overington (1993)
 


View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1. Tree diagrams for (a) Structure-Derived Matrix (SDM), (b) HSDM and (c) GONNET matrix.

 
The relative entropy (Equation 4Go) of the HSDM of 0.28 bit is higher than the relative entropy of the SDM 0.22 bit. The matrix derived from homologous pairs appears to have a higher information content than matrices derived from homologous and analogous pairs.

Matrix performance test

Gap penalties are optimized for several matrices in order to obtain the maximum average alignment accuracy (Table VGo). The top-ranking matrices are the HSDM and SDM. On average they align ~34 and 33% of the alignment correctly. A similar performance is shown by the GONNET matrix (Gonnet et al., 1992Go) and the NAOR matrix (Naor et al., 1996Go), with ~33 and 31% correctly aligned. On average there is no clear gap between the top-ranking matrices, but for single protein pairs there are several cases where one matrix yields a much better performance than the other (Figure 2Go). The correlation coefficient between GONNET and HSDM is 0.87.


View this table:
[in this window]
[in a new window]
 
Table V. Results of parameter optimizationa
 


View larger version (25K):
[in this window]
[in a new window]
 
Fig. 2. Alignment accuracy of HSDM and GONNET matrices for different sequence pairs. Alignments are calculated using optimum gap penalties on average over all of the data set. There is some correlation between the two matrices, but in some cases they perform differently.

 
The alignment accuracy results collected here are comparable to those published by Vogt et al. (1995). In their study, the top-ranking matrices showed similar averages, with the GONNET matrix performing best.

The results above refer to the parameter set that gives the best alignment performance on average over the whole set of protein pairs. Another question is what the alignment accuracy is that can be obtained by optimizing parameters for each pair individually. Here, the average performance increases ~15% for the HSDM, SDM and GONNET matrices. For some protein pairs a large difference in alignment accuracy can be observed.

The structure alignment of the pair hisactophilin and fibroblast growth factor has several small gaps (see Figure 3Go). The average best-performing gap penalties are too high to allow the opening of the multiple gaps. Using the optimized parameters for this case the opening penalty is much lower and a better alignment is achieved.



View larger version (37K):
[in this window]
[in a new window]
 
Fig. 3. Structures 1hce and 4fgf. Using the average optimum gap penalties over the whole data set, the GONNET matrix aligns 65% of the alignment correctly, whereas HSDM does not correctly align any residue. Optimizing the alignment accuracy for this pair, HSDM achieves 74% (open penalty 8.0, extension 5.0) and GONNET 72% (open penalty 11.0, extension 0.4). Black residues correspond to structurally equivalent and correctly aligned in sequence alignment, using HSDM and optimum parameters for this pair and dark gray to residues structurally equivalent and not correctly aligned.

 
Evolutionary relationship of sequences

The pairs of proteins used for testing can be classified either as homologous or as analogous. It is interesting how good analogous pairs can be aligned, as no evidence of an evolutionary relationship can be found for these cases. Here we find that there is a clear difference in the distribution of alignment accuracy between these two groups (Figure 4Go). In almost half of the analogous structures not a single amino acid pair can be aligned correctly. This only happens in <10% of the homologous proteins. Differences in alignment accuracy between analogous and homologous proteins have also been observed by Russell et al. (1998).



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 4. Alignment accuracy of analogous and homologous structural pairs. Almost 50% of the analogous structural pairs show a performance <5%. Homologous structural pairs show an even distribution over all ranges of performance.

 
These results show that it is more problematic to align analogous proteins. We investigated how the performance changes after analogous pairs are excluded from the test set (Table 6Go). The average alignment accuracy increases by ~10%. The average optimum gap penalties do not change much. Also, the ranking of the matrices remains the same.


View this table:
[in this window]
[in a new window]
 
Table VI. Average performance of the top-ranking matrices excluding analogous structures from the test seta
 
Alternative alignments

Multiple solutions for structure alignments can be found in 88% of the 122 protein structure alignments. For 63% more than four solutions exist. To estimate the similarity of these alternative alignments, they are grouped into different clusters (Lackner, Koppensteiner, Sippl, Domingues, in preparation). Half of the pairs show more than one cluster, one third more than two clusters.

Owing to this finding, it is necessary to consider all the alternative solutions L1 ... Ln when measuring the accuracy of a sequence alignment. The one that shares the highest number of correctly aligned residues with the sequence alignment is Lx. In 20% of the alignments, Lx is different from L1, the alternative with the highest number of equivalent residues. Using only L1 in the performance test, the average performance decreases for all of the matrices by 3–4%.


    Discussion
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
The goal of this study was to investigate the possibility of deriving substitution matrices suitable for sequence alignment of distantly related sequences. We derive matrices based on a data set of 122 structure alignments. The superimpositions are produced considering the C{alpha} distances as well as the orientation of the side chains. Matrices are derived from structurally equivalent residues of homologous or analogous protein pairs.

Two amino acid substitution matrices are derived. One is calculated from the whole data set of protein structure alignments, SDM. The second, HSDM, is computed after proteins of unclear evolutionary relationship are excluded from the data set. The information content of HSDM is higher, although the total number of observed substitutions is 30% smaller than those of SDM. Also, the average alignment accuracy increases (Table 5Go). A similar clustering of substitution values can be observed in both matrices (Figure 1Go).

These structure-derived substitution matrices, and also previously published matrices, are applied in sequence comparisons of sequences at the border of detectable similarity. The accuracy of the sequence alignments is evaluated by comparison with structure superimpositions. The best average alignment accuracy is observed with the new matrices HSDM and SDM, although the difference from the other top-ranking matrices GONNET and NAOR is small (Table 5Go).

A closer investigation of the top-ranking matrices shows that the GONNET matrix performs fairly well in the alignment of distantly related sequences. It was based on exhaustive matching of an entire protein sequence database, resulting in 1.7x106 sub-sequence matches. In contrast to this large data set, only 77 protein structure comparisons are used to derive HSDM. This demonstrates that structure alignments provide a good source to derive substitution matrices. The good performance of the NAOR matrix was not expected by its authors (Naor et al., 1996Go). Our data show that it is suitable for sequence alignments of distantly related sequences.

For several pairs the alignment accuracy using GONNET and HSDM is completely different (Figure 2Go). Also in several cases no correct sequence alignment is obtained when average optimum gap penalties are used. As an example, it is shown how the alignment accuracy could be improved enormously by applying different gap penalties (Figure 3Go). It has been reported previously that the choice of gap penalty and substitution matrix used considerably affects the sequence comparison results (Henikoff and Henikoff, 1993Go; Johnson and Overington, 1993Go; Pearson, 1995Go; Vogt et al., 1995Go). The results obtained in this study reiterate one of the basic problems of sequence alignment: if gap penalties are kept constant over all of the sequence, the biologically most meaningful alignment often cannot be found. It would be interesting to investigate the alignment quality of algorithms that do not require gap penalties (Morgenstern et al., 1996Go) or use a position-dependent gap penalty (Flöckner et al., 1997Go; Sanchez and Sali, 1997Go).

Our approach provided us with a high-quality data source to observe amino acid substitutions. The matrices derived here are among the best-performing ones, although they are based on a fairly small number of amino acid replacements. Further improvements could be achieved by collecting data from additional homologous structure alignments. Finding a way to estimate which of the alternative structure alignments is the biologically most meaningful might also improve the quality of the matrices.

The matrices can be downloaded from our web site at http://www.came.sbg.ac.at.


    Notes
 
1 To whom correspondence should be addressed. E-mail: sippl{at}came.sbg.ac.at Back


    Acknowledgments
 
We thank Hannes Flöckner and Markus Jaritz for valuable discussions. Thanks are due to Antonina Andreeva, Nora Benhabiles and Peter Lackner for carefully reading the manuscript. This work was supported by grant number P13710-MOB from the Fonds zur Förderung der wissenschaftlichen Forschung (Austria).


    References
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Altschul,S.F. (1991) J. Mol. Biol., 219, 555–565.[ISI][Medline]

Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) J. Mol. Biol., 215, 403–410.[ISI][Medline]

Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235–242.[Abstract/Free Full Text]

Boutonnet,N.S., Rooman,M.J., Ochagavia,M.E., Richelle,J. and Wodak,S.J. (1995) Protein Eng., 8, 647–662.[Abstract]

Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Atlas of Protein Sequence and Structure, vol. 5, suppl. National Biomedical Research Foundation, Washington, DC, ed. Dayhoff,M.O., pp. 345–352.

Domingues,F.S., Lackner,P., Andreeva,A. and Sippl,M.J. (2000) J. Mol. Biol., 297, 1003–1013.[ISI][Medline]

Felsenstein,J. (1985) Evolution, 39, 783–791.[ISI]

Feng,Z.K. and Sippl,M.J. (1996) Folding Des., 1, 123–132.[ISI][Medline]

Flöckner,H., Domingues,F.S. and Sippl,M.J. (1997) Proteins, Suppl 1, 129–133.

Godzik,A. (1996) Protein Eng., 5, 1325–1338.

Gonnet,G.H., Cohen,M.A. and Benner,S.A. (1992) Science, 256, 1433–1445.

Gotoh,O. (1982) J. Mol. Biol., 162, 705–708.[ISI][Medline]

Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl Acad. Sci. USA, 89, 10915–10919.[Abstract]

Henikoff,S. and Henikoff,J.G. (1993) J. Mol. Biol., 233, 716–738.[ISI][Medline]

Johnson,M.S. and Overington,J.P. (1993) J. Mol. Biol., 233, 716–738.[ISI][Medline]

Kyrpides,N.S. (1999) Bioinformatics, 15, 773–774.[Abstract/Free Full Text]

Morgenstern,B., Dress,A. and Werner,T. (1996) Proc. Natl Acad. Sci. USA, 29, 12098–12103.

Naor,D., Fischer,D., Jernigan,R.L., Wolfson,H.J. and Nussinov,R. (1996) J. Mol. Biol., 256, 924–938.[ISI][Medline]

Needleman,S.B. and Wunsch,C.D. (1970) J. Mol. Biol., 48, 443–453.[ISI][Medline]

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 1093–1108.[ISI][Medline]

Pearson,W.R. (1991) Genomics, 11, 635–650.[ISI][Medline]

Pearson,W.R. (1995) Protein Sci., 4, 1145–1160.[Abstract/Free Full Text]

Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444–2448.[Abstract]

Risler,J.L., Delorme,M.O., Delacroix,H. and Henault,A. (1988) J. Mol. Biol., 204, 1019–1029.[ISI][Medline]

Russell,R.B., Saqi,M.A.S., Sayle,R.A., Bates,P.A. and Sternberg,M.J.E. (1997) J. Mol. Biol., 269, 423–439.[ISI][Medline]

Russell,R.B., Saqi,M.A.S., Bates,P.A., Sayle,R.A. and Sternberg,M.J.E. (1998) Protein Eng., 11, 1–9.[Abstract]

Sanchez,R. and Sali,A. (1997) Proteins, Suppl 1, 50–58.

Scharf,M., Schneider,R., Casari,G., Bork,P., Valencia,A., Ouzounis,C. and Sander,C. (1994) Proceedings 2nd International Conference on Intelligent Systems for Molecular Biology (ISMB), 2, 348–353.

Smith,T.F. and Waterman,M.S. (1981) J. Mol. Biol., 147, 195–197.[ISI][Medline]

Teichmann,S.A., Chothia,C. and Gerstein,M. (1999) Curr. Opin. Struct. Biol., 9, 390–399.[ISI][Medline]

Vogt,G., Etzold,T. and Argos,P. (1995) J. Mol. Biol., 249, 819–831.

Received February 25, 2000; revised June 20, 2000; accepted June 26, 2000.