Universidad Nacional de Quilmes, Saenz Peña 180, B1876BXD Bernal, Argentina
Evolutionary amino acid replacement rates depend on local structural environment (Overington et al. 1992
; Koshi and Goldstein 1995
). Recent models of protein evolution aim to take such site heterogeneity into account by using site-specific amino acid replacement matrices (Lio and Goldman 1998
; Thorne 2000
). This is a difficult task, mainly because of a lack-of-data problem: too many sequences would be needed to fit the large number of parameters required to specify the replacement matrix for each site. Two strategies have been used to tackle this problem. A first type of model assumes that protein sites can be classified into a limited number of structural classes. Then, class-specific replacement matrices are obtained by analyzing large databases of sequences (Thorne, Goldman, and Jones 1996
; Lio et al. 1998
). These models are promising, provided the assumption of universality of structural classes is a reasonable one. However, such replacement matrices cannot be fitted to the actual sequence family under analysis. A second strategy is to develop models that depend on a reduced number of parameters (Halpern and Bruno 1998
; Koshi and Goldstein 1998
). These models have the advantage that they can, in principle, be fitted to the analyzed family. On the other hand, it is possible that the high reduction in the number of parameters oversimplifies such models to be generally applicable. Here we report a third strategy, which allows the determination of site-specific replacement matrices adapted to the sequence family under study, using just one member of the known structure. The lack-of-data problem is overcome by using large amounts of simulated sequence data generated using a structurally constrained protein evolution (SCPE) model developed recently (Parisi and Echave 2001
). We apply the method to a case taken as an example and compare it with other models of protein evolution.
We will deal with Markov models that assume that sequence sites evolve independently of each other. Even though this assumption is unrealistic, it is most often indispensable if the model is to be used for phylogenetic inference purposes. For a given site, an independent-sites Markov model is completely characterized by a 20 x 20 matrix Q composed by the instantaneous time-independent replacement rates Qij between amino acids i and j. In general, Q is not a symmetric matrix: Qij Qji. However, it represents a reversible process if it satisfies
![]() |
Probabilistic models can be compared using likelihood-based statistical tests (Goldman 1993
; Posada 2001
; Whelan, Lio, and Goldman 2001
). Let Pr(s|T,Q) be the probability of observing a certain sequence data set s = {s1,s2,...,sN}, given a tree T and a model Q. The maximum likelihood of the model, given the data is defined as the result of maximizing such a probability with respect to T (tree topology and branch lengths) and Q (free model parameters): L = maxT maxQPr(s|T,Q). Apart from the maximum likelihood, for model comparison, it is important to consider the number of free parameters. In this work, the relative support given by the data to a model is measured by the Bayesian information criterion (BIC), specified by BIC = - 2 ln L + npln ns, where np is the number of free parameters of the model evaluated and ns is the sequence length (Schwarz 1974
; Posada 2001
). The smaller the BIC, the better the fit of the model to the data.
As stated previously, our purpose is to obtain evolutionary replacement matrices using the SCPE model. This model is based on introducing random mutations at the gene level and selecting sequences against too much structural variation. Without making a priori assumptions regarding site heterogeneity in substitution patterns, SCPE naturally predicts site-specific amino acid frequency distributions (Parisi and Echave 2001
). Not being a site-independent model, SCPE cannot be used for maximum likelihood phylogenetic inference. However, it can be employed to generate enough simulated sequence data to determine site-specific substitution matrices. These matrices, in turn, can be used to build an independent-sites model of evolution, that we shall call independent-sitesstructurally constrained protein evolution (IS-SCPE).
The IS-SCPE site-specific replacement matrices are obtained in a straightforward manner by counting substitutions in SCPE simulations. To start, we need one member of the family under study whose structure is known. From this starting point, we generate simulated lineages: independent SCPE runs. We perform several runs of several mutational steps each to generate a database of accepted replacements. Next, we assume that sequence sites can be classified into c = 1,2,...,Nclass site classes, where 1 Nclass
Nsites. Then, for each class we set up a matrix of counts: for i
j, Nijc is half the number of mutational steps that result in either i
j or j
i amino acid replacements at site class c, and Niic is the number of mutational steps for which amino acid i remains constant (i
i replacement). Finally, for each class, the replacement matrix is obtained using
|
The set of Qc for all site classes constitutes the IS-SCPE model.
As an illustrative example, IS-SCPE is applied to the left-handed parallel ß helix (LßH) domain of UDP-N-acetylglucosamine acyltransferases (LpxA). From a search of SWISS-PROT (Bairoch and Apweiler 2000
) 25 LpxA sequences were found: LPXA_AQUAE, LPXA_BURAB, LPXA_CAMJE, LPXA_CHLMU, LPXA_CHLPN, LPXA_CHLTR, LPXA_CHRVI, LPXA_CYACA, LPXA_ECOLI, LPXA_HAEIN, LPXA_HELPJ, LPXA_HELPY, LPXA_NEIMA, LPXA_NEIMB, LPXA_PASMU, LPXA_PROMI, LPXA_PSEAE, LPXA_RICPR, LPXA_RICRI, LPXA_SALTY, LPXA_SYNY3, LPXA_VIBCH, LPXA_XYLFA, LPXA_YEREN, and Q9AQK4. The structure of the LpxA of Escherichia coli is known (Raetz and Roderick 1995
) (PDB code 1lxa). It shows a left-handed ß helix (LßH) domain (sites 1177) that displays a characteristic hexapeptide motif (Vaara 1992
; Vuorio et al. 1994
; Raetz and Roderick 1995
). Because of this, most sites can be classified into one of the six site classes, according to their position in the hexapeptides. These classes are designated i - 2, i - 1, i, i + 1, i + 2, i + 3. We aligned the 25 LpxA sequences using CLUSTAL W (Thompson, Higgins, and Gibson 1994
). The columns of the multiple alignment that correspond to the active site (Wyckoff and Raetz 1999
), loops, and the first and last incomplete coils of the LßH fold were not further considered because they are subject to different constraints than the rest of the hexapeptides. Each of the remaining 116 sites was classified into one of the six hexapeptide classes. With this alignment, we calculated a standard JTT distance matrix (Jones, Taylor, and Thornton 1992
) and obtained a phylogenetic tree, using module FITCH (Fitch and Margoliash 1967
) of PHYLIP 3.57c (Felsenstein 1993
). This tree was used for model comparison, taking advantage of the observation that fixing the tree does not affect the model selection procedure (Posada 2001
). Such model comparison is shown in table 1
, discussed subsequently. Robustness with respect to tree topology was verified by repeating the calculations on two other trees, obtained using the Neighbor-Joining method and the average linking clustering method, respectively, as implemented in the NEIGHBOR module of PHYLIP 3.57c (data not shown) (Felsenstein 1993
). All maximum likelihood calculations were performed with PAML (Yang 1997
).
|
|
The last two columns of table 1
show the logarithm of the maximum likelihood and the BIC values for the different models compared. First of all we note, from comparing the column of BIC values, that IS-SCPE fits the observed sequences better than any of the other models considered. The best fit is obtained with IS-SCPE+Rc+c: the model that consists of using site-specific replacement matrices obtained from SCPE simulations as described in this work and taking into account interclass (+Rc) and intraclass (+
c) rate heterogeneity. IS-SCPE gives better results than the global JTT because it takes into account site-specific replacement patterns. Regarding the site-specific JTT models (JTT+Fc models), the maximum likelihood values seem to show that they are better than the global JTT models (see column 4 of table 1
). However, the BIC values (column 5 of table 1
) demonstrate that two global modelsJTT+
and JTT+
+Fprovide a better fit than their site-specific counterparts JTT+Fc and JTT+Fc+
c. IS-SCPE is better than global JTT, whereas JTT+Fc is worse because JTT+Fc has 19 x 6 = 114 more parameters than IS-SCPE: the class-specific equilibrium frequencies (notice the nplnns term of the BIC statistic). In general, JTT+Fc (or any model that treats class frequencies as free parameters) will have 19 x nclasses more parameters than its IS-SCPE counterpart. Therefore, for a general case of a protein that is not as regular as LpxA, for which a larger number of classes may be needed, the gap between IS-SCPE and JTT+Fc models will rapidly increase in favor of IS-SCPE. Moreover, to mimic the site-specific substitution pattern, the JTT+Fc models depend on a reliable estimation of the 19 independent amino acid frequencies for each site class. Therefore, one faces again the lack-of-data problem: such models will be applicable only to large families or a small number of structural classes (or both) to which most sites belong, such as the present case. In contrast, the IS-SCPE replacement matrices are obtained using only one reference protein, being thus applicable independent of the size of the protein family under study or the number of site classes needed.
To summarize, we have shown that a whole set of site-dependent evolutionary amino acid replacement matrices can be obtained using just one member of known structure of the protein family under study. The method overcomes the lack-of-data problem by using large sequence data sets simulated using the SCPE model of protein evolution. The replacement matrices obtained constitute an IS-SCPE probabilistic model of the evolution of the family under study. As an illustration, we apply the IS-SCPE model to the LßH domain of the LpxA family, for which we show that it performs better than JTT, or even JTT corrected to take into account site-specific amino acid distributions.
Acknowledgements
We wish to acknowledge fruitful discussions with Pietro Lio and Nick Goldman, with whom we are currently working on a rigorous and detailed statistical assessment of IS-SCPE models using likelihood-ratio tests. This work was supported by the Universidad Nacional de Quilmes and the Fundación Antorchas. J.E. is a researcher of CONICET and a Guggenheim Fellow.
Footnotes
William Taylor, Reviewing Editor
Keywords: molecular evolution
protein evolution
simulation
model
substitution matrix
Abbreviations: SCPE, structurally constrained protein evolution; IS-SCPE, independent-sitesstructurally constrained protein evolution.
Address for correspondence and reprints: Julian Echave, Universidad Nacional de Quilmes, Saenz Peña 180, B1876BXD Bernal, Argentina. je{at}unq.edu.ar
.
References
Bairoch A., R. Apweiler, 2000 The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res 28:45-48
Felsenstein J., 1993 PHYLIP (phylogeny inference package) Distributed by the author, Department of Genetics, University of Washington, Seattle
Fitch W. M., E. Margoliash, 1967 Construction of phylogenetic trees Science 155:279-284[ISI][Medline]
Goldman N., 1993 Statistical tests of models of DNA substitution J. Mol. Evol 36:182-198[ISI][Medline]
Halpern A. L., W. J. Bruno, 1998 Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies Mol. Biol. Evol 15:910-917[Abstract]
Jones D. T., W. R. Taylor, J. M. Thornton, 1992 The rapid generation of mutation data matrices from protein sequences Comput. Appl. Biosci 8:275-282[Abstract]
Koshi J. M., R. A. Goldstein, 1995 Context-dependent optimal substitution matrices Protein Eng 8:641-645[Abstract]
. 1998 Models of natural mutations including site heterogeneity Proteins Struct. Funct. Genet 32:289-295[ISI][Medline]
Lio P., N. Goldman, 1998 Models of molecular evolution and phylogeny Genome Res 8:1233-1244
Lio P., N. Goldman, J. L. Thorne, D. T. Jones, 1998 PASSML: combining evolutionary inference and protein secondary structure prediction Bioinformatics 14:726-733[Abstract]
Overington J., D. Donnelly, M. S. Johnson, A. Sali, T. L. Blundell, 1992 Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds Protein Sci 1:216-226
Parisi G., J. Echave, 2001 Structural constraints and emergence of sequence patterns in protein evolution Mol. Biol. Evol 18:750-756
Posada D., 2001 The effect of branch length variation on the selection of models of molecular evolution J. Mol. Evol 52:434-444[ISI][Medline]
Raetz C. R., S. L. Roderick, 1995 A left-handed parallel beta helix in the structure of UDP-N-acetylglucosamine acyltransferase Science 270:997-1000[Abstract]
Schwarz G., 1974 Estimating the dimension of a model Ann. Stat 6:461-464
Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]
Thorne J. L., 2000 Models of protein sequence evolution and their applications Curr. Opin. Genet. Dev 10:602-605[ISI][Medline]
Thorne J. L., N. Goldman, D. T. Jones, 1996 Combining protein evolution and secondary structure Mol. Biol. Evol 13:666-673[Abstract]
Vaara M., 1992 Eight bacterial proteins, including UDP-N-acetylglucosamine acyltransferase (LpxA) and three other transferases of Escherichia coli, consist of a six-residue periodicity theme FEMS Microbiol. Lett 76:249-254[Medline]
Vuorio R., T. Harkonen, M. Tolvanen, M. Vaara, 1994 The novel hexapeptide motif found in the acyltransferases LpxA and LpxD of lipid A biosynthesis is conserved in various bacteria FEBS Lett 337:289-292[ISI][Medline]
Whelan S., P. Lio, N. Goldman, 2001 Molecular phylogenetics: state-of-the-art methods for looking into the past Trends Genet 17:262-272[ISI][Medline]
Wyckoff T. J. O., C. R. H. Raetz, 1999 The active site of Escherichia coli UDP-N-acetylglucosamine acyltransferase, chemical modification and site-directed mutagenesis J. Biol. Chem 274:27047-27055
Yang Z., 1993 Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites Mol. Biol. Evol 10:1396-1401[Abstract]
. 1997 PAML: a program package for phylogenetic analysis by maximum likelihood Comput. Appl. Biosci 13:555-556[Medline]
Yang Z. H., R. Nielsen, M. Hasegawa, 1998 Models of amino acid substitution and applications to mitochondrial protein evolution Mol. Biol. Evol 15:1600-1611