Site-Specific Amino Acid Replacement Matrices from Structurally Constrained Protein Evolution Simulations

María Silvina Fornasari, Gustavo Parisi and Julian Echave

Universidad Nacional de Quilmes, Saenz Peña 180, B1876BXD Bernal, Argentina

Evolutionary amino acid replacement rates depend on local structural environment (Overington et al. 1992Citation ; Koshi and Goldstein 1995Citation ). Recent models of protein evolution aim to take such site heterogeneity into account by using site-specific amino acid replacement matrices (Lio and Goldman 1998Citation ; Thorne 2000Citation ). This is a difficult task, mainly because of a lack-of-data problem: too many sequences would be needed to fit the large number of parameters required to specify the replacement matrix for each site. Two strategies have been used to tackle this problem. A first type of model assumes that protein sites can be classified into a limited number of structural classes. Then, class-specific replacement matrices are obtained by analyzing large databases of sequences (Thorne, Goldman, and Jones 1996Citation ; Lio et al. 1998Citation ). These models are promising, provided the assumption of universality of structural classes is a reasonable one. However, such replacement matrices cannot be fitted to the actual sequence family under analysis. A second strategy is to develop models that depend on a reduced number of parameters (Halpern and Bruno 1998Citation ; Koshi and Goldstein 1998Citation ). These models have the advantage that they can, in principle, be fitted to the analyzed family. On the other hand, it is possible that the high reduction in the number of parameters oversimplifies such models to be generally applicable. Here we report a third strategy, which allows the determination of site-specific replacement matrices adapted to the sequence family under study, using just one member of the known structure. The lack-of-data problem is overcome by using large amounts of simulated sequence data generated using a structurally constrained protein evolution (SCPE) model developed recently (Parisi and Echave 2001Citation ). We apply the method to a case taken as an example and compare it with other models of protein evolution.

We will deal with Markov models that assume that sequence sites evolve independently of each other. Even though this assumption is unrealistic, it is most often indispensable if the model is to be used for phylogenetic inference purposes. For a given site, an independent-sites Markov model is completely characterized by a 20 x 20 matrix Q composed by the instantaneous time-independent replacement rates Qij between amino acids i and j. In general, Q is not a symmetric matrix: Qij != Qji. However, it represents a reversible process if it satisfies

where S is a symmetric matrix and {pi}j is the equilibrium frequency of amino acid j. Using equation (1) , any reversible model can be adapted to the analyzed sequence family by treating the equilibrium frequencies as free parameters to be estimated from the data under study. This is usually indicated by adding +F at the end of the model name. A more detailed description of independent-sites Markov models can be found in recent reviews (Lio and Goldman 1998Citation ; Yang, Nielsen, and Hasegawa 1998Citation ).

Probabilistic models can be compared using likelihood-based statistical tests (Goldman 1993Citation ; Posada 2001Citation ; Whelan, Lio, and Goldman 2001Citation ). Let Pr(s|T,Q) be the probability of observing a certain sequence data set s = {s1,s2,...,sN}, given a tree T and a model Q. The maximum likelihood of the model, given the data is defined as the result of maximizing such a probability with respect to T (tree topology and branch lengths) and Q (free model parameters): L = maxT maxQPr(s|T,Q). Apart from the maximum likelihood, for model comparison, it is important to consider the number of free parameters. In this work, the relative support given by the data to a model is measured by the Bayesian information criterion (BIC), specified by BIC = - 2 ln L + npln ns, where np is the number of free parameters of the model evaluated and ns is the sequence length (Schwarz 1974Citation ; Posada 2001Citation ). The smaller the BIC, the better the fit of the model to the data.

As stated previously, our purpose is to obtain evolutionary replacement matrices using the SCPE model. This model is based on introducing random mutations at the gene level and selecting sequences against too much structural variation. Without making a priori assumptions regarding site heterogeneity in substitution patterns, SCPE naturally predicts site-specific amino acid frequency distributions (Parisi and Echave 2001Citation ). Not being a site-independent model, SCPE cannot be used for maximum likelihood phylogenetic inference. However, it can be employed to generate enough simulated sequence data to determine site-specific substitution matrices. These matrices, in turn, can be used to build an independent-sites model of evolution, that we shall call independent-sites–structurally constrained protein evolution (IS-SCPE).

The IS-SCPE site-specific replacement matrices are obtained in a straightforward manner by counting substitutions in SCPE simulations. To start, we need one member of the family under study whose structure is known. From this starting point, we generate simulated lineages: independent SCPE runs. We perform several runs of several mutational steps each to generate a database of accepted replacements. Next, we assume that sequence sites can be classified into c = 1,2,...,Nclass site classes, where 1 <= Nclass <= Nsites. Then, for each class we set up a matrix of counts: for i != j, Nijc is half the number of mutational steps that result in either i -> j or j -> i amino acid replacements at site class c, and Niic is the number of mutational steps for which amino acid i remains constant (i -> i replacement). Finally, for each class, the replacement matrix is obtained using


The set of Qc for all site classes constitutes the IS-SCPE model.

As an illustrative example, IS-SCPE is applied to the left-handed parallel ß helix (LßH) domain of UDP-N-acetylglucosamine acyltransferases (LpxA). From a search of SWISS-PROT (Bairoch and Apweiler 2000Citation ) 25 LpxA sequences were found: LPXA_AQUAE, LPXA_BURAB, LPXA_CAMJE, LPXA_CHLMU, LPXA_CHLPN, LPXA_CHLTR, LPXA_CHRVI, LPXA_CYACA, LPXA_ECOLI, LPXA_HAEIN, LPXA_HELPJ, LPXA_HELPY, LPXA_NEIMA, LPXA_NEIMB, LPXA_PASMU, LPXA_PROMI, LPXA_PSEAE, LPXA_RICPR, LPXA_RICRI, LPXA_SALTY, LPXA_SYNY3, LPXA_VIBCH, LPXA_XYLFA, LPXA_YEREN, and Q9AQK4. The structure of the LpxA of Escherichia coli is known (Raetz and Roderick 1995Citation ) (PDB code 1lxa). It shows a left-handed ß helix (LßH) domain (sites 1–177) that displays a characteristic hexapeptide motif (Vaara 1992Citation ; Vuorio et al. 1994Citation ; Raetz and Roderick 1995Citation ). Because of this, most sites can be classified into one of the six site classes, according to their position in the hexapeptides. These classes are designated i - 2, i - 1, i, i + 1, i + 2, i + 3. We aligned the 25 LpxA sequences using CLUSTAL W (Thompson, Higgins, and Gibson 1994Citation ). The columns of the multiple alignment that correspond to the active site (Wyckoff and Raetz 1999Citation ), loops, and the first and last incomplete coils of the LßH fold were not further considered because they are subject to different constraints than the rest of the hexapeptides. Each of the remaining 116 sites was classified into one of the six hexapeptide classes. With this alignment, we calculated a standard JTT distance matrix (Jones, Taylor, and Thornton 1992Citation ) and obtained a phylogenetic tree, using module FITCH (Fitch and Margoliash 1967Citation ) of PHYLIP 3.57c (Felsenstein 1993Citation ). This tree was used for model comparison, taking advantage of the observation that fixing the tree does not affect the model selection procedure (Posada 2001Citation ). Such model comparison is shown in table 1 , discussed subsequently. Robustness with respect to tree topology was verified by repeating the calculations on two other trees, obtained using the Neighbor-Joining method and the average linking clustering method, respectively, as implemented in the NEIGHBOR module of PHYLIP 3.57c (data not shown) (Felsenstein 1993Citation ). All maximum likelihood calculations were performed with PAML (Yang 1997Citation ).


View this table:
[in this window]
[in a new window]
 
Table 1 Model Comparison: IS-SCPE Versus JTT

 
Using LPXA_ECOLI as the starting point, we performed 50 independent SCPE runs of 15,000 mutational steps each. Then we obtained the IS-SCPE replacement matrices for each of the six hexapeptide structural classes, as described previously. These matrices are shown in figure 1 . It is clear from the figure that, as expected, these matrices are strongly site specific. Note that they are particularly sparse for site classes i - 2 and i, which are the most conserved sites that characterize the hexapeptide motif (Raetz and Roderick 1995Citation ).



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 1.—Site-specific replacement matrices obtained using SCPE simulations for the six hexapeptide structural classes of LpxA. Classes are identified by their position in the hexapeptide motif relative to site i of each hexapeptide: i - 2, i - 1, i, i + 1, i + 2, i + 3. This designation is indicated at the top, left corner of each panel

 
It is of interest at this point to see whether the IS-SCPE model is able to reproduce reasonably well the evolution of the LpxA family. To do this, we performed a model comparison, shown in table 1 . The second column of this table lists the models compared. JTT, JTT+{Gamma}, JTT+F, and JTT+F+{Gamma} are global models based on using the standard JTT replacement matrix (Jones, Taylor, and Thornton 1992Citation ) for all site classes. +F indicates that the replacement matrix is modified such that it has equilibrium frequencies in agreement with the analyzed data (19 free parameters: 20 frequencies that must sum up to 1) (Lio and Goldman 1998Citation ). +{Gamma} points out that a gamma distribution of replacement rates is used to describe site heterogeneity in replacement rates (one extra parameter needed to define the gamma distribution) (Yang 1993Citation ). The next two models use the site-specific IS-SCPE replacement matrices. We take into account the possibly different overall relative rates of the six classes (designated by +Rc in table 1 . Five free parameters: six rates minus an arbitrary time unit). Also, we consider rate heterogeneity within each class by using a class-dependent gamma distribution (one extra free parameter per class: +{Gamma}c). Finally, the last two models of table 1 , JTT+Fc, use site-specific JTT matrices, obtained by adapting JTT to each site class using the class-specific amino acid distributions estimated from the analyzed data. As for IS-SCPE, we consider interclass rate variation (+Rc) and intraclass rate variation (+{Gamma}c).

The last two columns of table 1 show the logarithm of the maximum likelihood and the BIC values for the different models compared. First of all we note, from comparing the column of BIC values, that IS-SCPE fits the observed sequences better than any of the other models considered. The best fit is obtained with IS-SCPE+Rc+{Gamma}c: the model that consists of using site-specific replacement matrices obtained from SCPE simulations as described in this work and taking into account interclass (+Rc) and intraclass (+{Gamma}c) rate heterogeneity. IS-SCPE gives better results than the global JTT because it takes into account site-specific replacement patterns. Regarding the site-specific JTT models (JTT+Fc models), the maximum likelihood values seem to show that they are better than the global JTT models (see column 4 of table 1 ). However, the BIC values (column 5 of table 1 ) demonstrate that two global models—JTT+{Gamma} and JTT+{Gamma}+F—provide a better fit than their site-specific counterparts JTT+Fc and JTT+Fc+{Gamma}c. IS-SCPE is better than global JTT, whereas JTT+Fc is worse because JTT+Fc has 19 x 6 = 114 more parameters than IS-SCPE: the class-specific equilibrium frequencies (notice the nplnns term of the BIC statistic). In general, JTT+Fc (or any model that treats class frequencies as free parameters) will have 19 x nclasses more parameters than its IS-SCPE counterpart. Therefore, for a general case of a protein that is not as regular as LpxA, for which a larger number of classes may be needed, the gap between IS-SCPE and JTT+Fc models will rapidly increase in favor of IS-SCPE. Moreover, to mimic the site-specific substitution pattern, the JTT+Fc models depend on a reliable estimation of the 19 independent amino acid frequencies for each site class. Therefore, one faces again the lack-of-data problem: such models will be applicable only to large families or a small number of structural classes (or both) to which most sites belong, such as the present case. In contrast, the IS-SCPE replacement matrices are obtained using only one reference protein, being thus applicable independent of the size of the protein family under study or the number of site classes needed.

To summarize, we have shown that a whole set of site-dependent evolutionary amino acid replacement matrices can be obtained using just one member of known structure of the protein family under study. The method overcomes the lack-of-data problem by using large sequence data sets simulated using the SCPE model of protein evolution. The replacement matrices obtained constitute an IS-SCPE probabilistic model of the evolution of the family under study. As an illustration, we apply the IS-SCPE model to the LßH domain of the LpxA family, for which we show that it performs better than JTT, or even JTT corrected to take into account site-specific amino acid distributions.

Acknowledgements

We wish to acknowledge fruitful discussions with Pietro Lio and Nick Goldman, with whom we are currently working on a rigorous and detailed statistical assessment of IS-SCPE models using likelihood-ratio tests. This work was supported by the Universidad Nacional de Quilmes and the Fundación Antorchas. J.E. is a researcher of CONICET and a Guggenheim Fellow.

Footnotes

William Taylor, Reviewing Editor

Keywords: molecular evolution protein evolution simulation model substitution matrix Back

Abbreviations: SCPE, structurally constrained protein evolution; IS-SCPE, independent-sites–structurally constrained protein evolution. Back

Address for correspondence and reprints: Julian Echave, Universidad Nacional de Quilmes, Saenz Peña 180, B1876BXD Bernal, Argentina. je{at}unq.edu.ar . Back

References

    Bairoch A., R. Apweiler, 2000 The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res 28:45-48[Abstract/Free Full Text]

    Felsenstein J., 1993 PHYLIP (phylogeny inference package) Distributed by the author, Department of Genetics, University of Washington, Seattle

    Fitch W. M., E. Margoliash, 1967 Construction of phylogenetic trees Science 155:279-284[ISI][Medline]

    Goldman N., 1993 Statistical tests of models of DNA substitution J. Mol. Evol 36:182-198[ISI][Medline]

    Halpern A. L., W. J. Bruno, 1998 Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies Mol. Biol. Evol 15:910-917[Abstract]

    Jones D. T., W. R. Taylor, J. M. Thornton, 1992 The rapid generation of mutation data matrices from protein sequences Comput. Appl. Biosci 8:275-282[Abstract]

    Koshi J. M., R. A. Goldstein, 1995 Context-dependent optimal substitution matrices Protein Eng 8:641-645[Abstract]

    ———. 1998 Models of natural mutations including site heterogeneity Proteins Struct. Funct. Genet 32:289-295[ISI][Medline]

    Lio P., N. Goldman, 1998 Models of molecular evolution and phylogeny Genome Res 8:1233-1244[Abstract/Free Full Text]

    Lio P., N. Goldman, J. L. Thorne, D. T. Jones, 1998 PASSML: combining evolutionary inference and protein secondary structure prediction Bioinformatics 14:726-733[Abstract]

    Overington J., D. Donnelly, M. S. Johnson, A. Sali, T. L. Blundell, 1992 Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds Protein Sci 1:216-226[Abstract/Free Full Text]

    Parisi G., J. Echave, 2001 Structural constraints and emergence of sequence patterns in protein evolution Mol. Biol. Evol 18:750-756[Abstract/Free Full Text]

    Posada D., 2001 The effect of branch length variation on the selection of models of molecular evolution J. Mol. Evol 52:434-444[ISI][Medline]

    Raetz C. R., S. L. Roderick, 1995 A left-handed parallel beta helix in the structure of UDP-N-acetylglucosamine acyltransferase Science 270:997-1000[Abstract]

    Schwarz G., 1974 Estimating the dimension of a model Ann. Stat 6:461-464

    Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]

    Thorne J. L., 2000 Models of protein sequence evolution and their applications Curr. Opin. Genet. Dev 10:602-605[ISI][Medline]

    Thorne J. L., N. Goldman, D. T. Jones, 1996 Combining protein evolution and secondary structure Mol. Biol. Evol 13:666-673[Abstract]

    Vaara M., 1992 Eight bacterial proteins, including UDP-N-acetylglucosamine acyltransferase (LpxA) and three other transferases of Escherichia coli, consist of a six-residue periodicity theme FEMS Microbiol. Lett 76:249-254[Medline]

    Vuorio R., T. Harkonen, M. Tolvanen, M. Vaara, 1994 The novel hexapeptide motif found in the acyltransferases LpxA and LpxD of lipid A biosynthesis is conserved in various bacteria FEBS Lett 337:289-292[ISI][Medline]

    Whelan S., P. Lio, N. Goldman, 2001 Molecular phylogenetics: state-of-the-art methods for looking into the past Trends Genet 17:262-272[ISI][Medline]

    Wyckoff T. J. O., C. R. H. Raetz, 1999 The active site of Escherichia coli UDP-N-acetylglucosamine acyltransferase, chemical modification and site-directed mutagenesis J. Biol. Chem 274:27047-27055[Abstract/Free Full Text]

    Yang Z., 1993 Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites Mol. Biol. Evol 10:1396-1401[Abstract]

    ———. 1997 PAML: a program package for phylogenetic analysis by maximum likelihood Comput. Appl. Biosci 13:555-556[Medline]

    Yang Z. H., R. Nielsen, M. Hasegawa, 1998 Models of amino acid substitution and applications to mitochondrial protein evolution Mol. Biol. Evol 15:1600-1611[Abstract/Free Full Text]

Accepted for publication October 19, 2001.