Bioinformatics Laboratory, HKU-Pasteur Research Center, Hong Kong;
Department of Microbiology, University of Hong Kong;
Institute of Environmental Protection, Hunan University, Changsha, People's Republic of China
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
These two approaches assume that amino acid substitutions at different amino acid sites are independent of each other. In other words, the amino acid substitution occurring at site i is irrelevant to what amino acid is found at sites i - 1, i + 1, or any other site. This assumption is problematic for the following reason. The normal functioning of proteins depends on its three-dimensional conformation that, in the micro scale, depends on the angles of the peptide chain, especially the angles of the NC (
) and the C
C (
) bonds. According to the Ramachandran plot (Ramachandran and Sasisekharan 1968
) and subsequent empirical studies (Morris et al. 1992
), only particular combinations of these two angles can give rise to, or maintain, the basic secondary structures such as
-helices or ß-sheets. In other words, only particular combinations of amino acids can cooperate to form particular secondary structures. For example, a stretch of Glu, Ala, or Met tends to form an
-helix, but the insertion of Gly or Pro would tend to break the
-helix (Chou and Fasman 1978a
). This implies that an amino acid substitution at site i may depend on what the neighboring amino acids are.
Different amino acids have different preferences either for or against being in certain secondary structures. For example, Ala and Glu are good -helix formers, whereas some others such as Gly and Pro tend to disrupt the
-helix structure. Similarly, Ile and Val are good, whereas Glu and Pro are poor ß-sheet formers (Chou and Fasman 1974a, 1978b
; Branden and Tooze 1998
). This kind of empirical evidence has led to the derivation of the Chou-Fasman conformational parameters that can be used to predict secondary structures of protein molecules (Chou and Fasman 1978a
). A corollary of this is that
-helix formers should be found more frequently as neighbors, as should ß-sheet formers.
The existing conformational parameters (Chou and Fasman 1978a
) were based on a small data set and should be revised. Proteins with known structures now have accumulated to about 15,000 in the PDB database (Berman et al. 2000
). One of the objectives of this paper is to obtain a more updated estimate of the propensity of the 20 amino acids occurring in the three major secondary structures, i.e., helices, sheets, and turns. This will not only complement a previous study on interactions of non-neighbor amino acids (Singh and Thornton. 1992
), but will also help us to better interpret the neighbor preference of amino acids.
A study on neighboring amino acids can also shed light on amino acid dissimilarities. Two indices of amino acid dissimilarities have been proposed (Grantham 1974
; Miyata, Miyazawa, and Yasunaga 1979
), with Grantham's distance based on the volume, the polarity, and the chemical property of the side chain, and Miyata's distance based on the first two amino acid properties. Amino acids can differ in many ways, and Sneath (1966) has indeed listed 134 properties. Of the 10 properties studied in detail, all exhibit significant relationship with substitution rates (Xia and Li 1998
), suggesting that they are all important properties related to the normal functioning of proteins. It is difficult to agree upon which amino acid properties should be used to construct an index of amino acid dissimilarities, and the choice of three properties to build Grantham's distance and two properties to build Miyata's distance is, to a large extent, arbitrary.
The arbitrary choice of amino acid properties and potentially false formulation of the amino acid dissimilarities may be responsible for some of the old controversies between Kimura (1983
, p. 159) and Gillespie (1991
, p. 43). Kimura, being a neutralist, argued that the most frequent nonsynonymous substitutions were those involving similar amino acids and the substitution rate would decrease monotonously with increasing dissimilarity between involved amino acids (fig. 7.1 in Kimura 1983
). This is of course what one would expect from the neutral theory of molecular evolution, in which positive selection plays a negligible role in molecular evolution and purifying (negative) selection eliminates those mutations with major effects. Gillespie, on the other hand, argued that the most frequent nonsynonymous substitutions were not between the chemically most similar amino acids, but instead were between amino acids with a Miyata's distance near 1 (fig. 1.12
in Gillespie 1991
). It is difficult to appreciate or interpret the latter finding, and we are inclined to think that the finding may be an artifact because of inappropriate formulation of amino acid dissimilarities. For example, those amino acid pairs with a Miyata's distance near 1 may actually be more similar to each other than what Miyata's distances would let us believe. The peak of substitutions at Miyata's distance near 1 may disappear when better indices are formulated.
|
This paper has three objectives. The first is to estimate the propensity of the 20 amino acids occurring in the three major categories of secondary structures, i.e., helices, sheets, and turns, by using the large number of proteins now available with known structures. The second is to document the genomic pattern of neighbor preference for the 20 amino acids by taking advantage of the huge amount of available protein data and interpret the neighbor preference with reference to protein secondary structures. The third is to incorporate the differences in neighbor preference between amino acids into a new formulation of amino acid dissimilarity index.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The propensity of an amino acid occurring in one of the three structure categories is calculated as follows. Let NTot be the total number of amino acids in the three structure categories; Ni (where i = 1, 2, ..., 20 corresponding to the 20 amino acids) be the number of amino acid i found in all three structure categories; Nh, Ns, and Nt be the number of amino acids found in helices, sheets, and turns, respectively; and Nh,i, Ns,i, and Nt,i be the number of amino acids in helices, sheets, and turns, respectively. If amino acids occur equally likely in the three secondary structures, then the expected numbers of Nh,i, Ns,i, and Nt,i are, respectively,
|
The propensity of amino acid i occurring in helices is defined as
|
Ph,i measures how strongly an amino acid is associated with one particular secondary structure and is independent of sample size. Ps,i and Pt,i are calculated in the same way. We retrieved only 7,342 proteins instead of all proteins in the PDB database, because the Ph,i, Ps,i, and Pt,i values are stabilized after analyzing just 3,000 protein structures. Using more data will not change the Ph,i, Ps,i, and Pt,i values.
Neighbor Preference in Amino Acids
A total of 25,467 protein-coding sequences (CDS) from human (Homo sapiens), 11,490 CDS from mouse (Mus musculus), and 15,028 CDS from Escherichia coli were retrieved and translated into protein sequences by using the ACNUC retrieval system (Gouy et al. 1985
). We excluded from further analysis 719 human CDS, 169 mouse CDS, and 20 E. coli CDS, which contain embedded stop codons. These sequences are likely pseudogenes and are irrelevant to this study.
Some genes have been sequenced and deposited in GenBank multiple times, and this may bias the result in the way that the observed pattern of neighbor preference may not reflect the genomic pattern but instead may reflect the pattern of those over-represented genes. For this reason, we have also analyzed all 4,289 CDS in the complete genome of E. coli K-12. The E. coli genomic data set will be referred to as E. coliG hereafter.
With 20 amino acids, there are 400 possible amino acid doublets (i.e., neighbors). Let Nij (where i and j = 1, 2, ..., 20 corresponding to the 20 amino acids) be the number of amino acid pairs, with amino acid j following amino acid i. For example, NAla,Arg is the number of Ala-Arg pairs in all sequences; NArg,Ala is the number of Arg-Ala pairs in all sequences, and so on. The counting is from the N-terminal to the C-terminal of the amino acid sequences. The first methionine is not counted. Data extraction is done with DAMBE (Xia 2000
).
For amino acid i, all 20 Nij values, with j = 1, 2, ..., 20 corresponding to the 20 amino acids, make up a profile of neighbor preference for amino acids found after amino acid i, and all 20 Nji values makes another profile for amino acids found before amino acid i along the amino acid sequence. The former set will be referred hereafter as the Profilea of amino acid i, with the subscript a meaning after. The latter set of 20 Nji values will be referred hereafter as the Profileb of amino acid i, with the subscript b meaning before.
The Nij values apparently depend on amino acid usage. If amino acid j is very abundant, then obviously Nij and Nji will be large, too. If amino acid i does not have any neighbor preference, then the expected value for Nij is
|
Whether the 20 Nij values for amino acid i deviate significantly from the expectation of random association can be tested by a chi-square goodness-of-fit test with
|
The degree of freedom associated with 2 is 19 rather than 18 because Pj is not calculated from the 20 Nij values.
The strength of the neighbor preference for amino acid i (SPi) can be simply measured by
|
Note that we should not use 2 directly to measure the strength of preference because the
2 value depends on the sample size, i.e., a more abundant amino acid tends to yield a large
2 value than a less abundant amino acid, everything else being equal. In contrast, SPi is independent of sample size and can therefore facilitate comparisons among amino acids. As SPi can only take positive values and therefore cannot indicate which amino acid is favored or disfavored by amino acid i, we also use the following index (Iij) to measure the preference of amino acid i for amino acid j:
|
Apparently, Iij will be positive if amino acid i has amino acid j as its neighbor more frequently than expected, and negative if amino acid i has amino acid j as its neighbor less frequently than expected.
Nij may differ from Nji, i.e., amino acid i may have different preferences for amino acids that go before it and those that go after it. This difference, or similarity, between these two profiles can be measured by the Pearson correlation coefficient between the 20 Nij values and the 20 Nji values (where j = 1, 2, ..., 20). Note that such correlation coefficients measure only the similarity between Profilea and Profileb. They do not measure the strength of preferences. For example, if there is no preference at all, then Profilea and Profileb will both be expected to approach the relative abundance of the 20 amino acids, and will have a correlation coefficient near 1 given the large data set.
If two amino acids, x and y, have similar neighbor preference, then Nxj and Nyj will be highly correlated, and we can use the correlation coefficient to measure similarity in neighbor preference between the two amino acids. Alternatively, we can treat the 20 Nij values as allele frequencies for one locus, and calculate a pair-wise genetic distance between amino acids by using genetic distances based on allele frequencies (e.g., Cavalli-Sforza and Edwards 1967
; Nei 1972
; Reynolds, Weir, and Cockerham 1983
). The amino acid distance based on similarity in neighbor preference will be referred to hereafter as Dnp, with np standing for neighbor preference.
To test whether Dnp is related to the rate of amino acid substitutions, we compiled substitution data from two sets of protein-coding sequences. One set consists of 58 presumably orthologous genes from the human, the mouse, and the cow, and the other is made of the 13 protein-coding genes from each of the 19 completely sequenced mitochondrial sequences used in Xia (1998)
. The ancestral sequences were reconstructed using the CODEML program in the PAML package (Yang 2000
), with jones.dat for the nuclear genes and mtmam.dat for the mitochondrial genes. Pair-wise comparisons were made between neighboring nodes along the tree. The tree for the first data set with only three operational taxonomic units (OTUs) is simply a trifurcating tree with one internal node, and the tree for the second data set is the same as in Xia and Li (1998)
. The number of substitutions involving amino acids i and j is designated as NSij.
We expect NSij to be large between similar amino acids and small between different amino acids. However, NSij values depend not only on the amino acid dissimilarities, but also on the frequencies of the amino acids involved. For example, NSi,j (where i, j = 1, 2, ..., 20 corresponding to the 20 amino acids and i j) will necessarily be zero if the sequences contain no amino acid i or amino acid j. Thus, NSij should be adjusted for amino acid frequencies before it is used to evaluate indices of amino acid distances.
Let Pi (where i = 1, 2, ..., 20 corresponding to the 20 amino acids) be the frequency of amino acid i in the set of amino acid sequences, and Ns be the total number of amino acid substitutions. The expected value of NSi,j, when amino acids replace each other randomly, is
|
![]() |
Another method for evaluating the relative performance of different amino acid distances is to apply them in a likelihood-based phylogenetic analysis (Yang, Nielsen, and Hasegawa 1998
). The best distance should generate larger likelihood values than other distances. For this purpose, we have used the 13 protein-coding genes from six OTUs, with two chimpanzees (GenBank LOCUS names: CHPMTB and CHPMTE), one gorilla (GGMTG), one human (HSMITG), one orangutan (ORAMTD), and one gibbon (HLMITCSEQ).
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The three aromatic amino acids (Tyr, Phe, and Trp) are clustered together and tend to occur in sheets (table 1
). Aromaticity can affect the rate of amino acid substitutions (Xia and Li 1998
), and our observation that they are all sheet-formers suggests that the replacement of a helix-forming amino acid (which are all nonaromatic) by one of these aromatic amino acids may destabilize the secondary structure. Consequently, purifying selection should act against such replacements. The similarity in sheet-forming among these amino acids represents a new dimension of similarity that is ignored by previous formulation of amino acid distances.
Amino Acid Usage
Amino acid usage for the human and mouse sequences are very similar (table 2
), with a Pearson correlation coefficient equal to 0.999, suggesting that amino acid usage is conserved among distantly related mammalian species. The correlation coefficient is 0.903 between E. coli and human and 0.894 between E. coli and mouse. The correlation coefficient between E coli and E coliG is 0.9991, suggesting that, at the amino acid usage level, the potential bias caused by differential representation of genes in GenBank is not obvious. The amino acid usage from the PDB database is closer to E. coli than to the two mammalian species, with the correlation coefficient being 0.9607, 0.9519, 0.9005, and 0.8920, respectively, for E coliG, E coli, human, and mouse.
|
|
|
|
|
|
Aside from the self preference, different amino acids also exhibit association and repulsion with other amino acids. A subset of these association and repulsion patterns, with Iij values either greater than 0.2 or lesser than -0.2 is shown in table 7 . All of these associations and repulsions can be easily explained with reference to figure 1 . In general, those amino acids with a high propensity for occurring in the same secondary structure are associated and those with a high propensity for occurring in different secondary structures are repulsive.
|
Amino Acid Distance Based on Neighbor Preference
We have so far focused only on the neighbor preference of individual amino acids, but have not yet studied the similarity in neighbor preference between amino acids. We could measure the similarity in neighbor preference between amino acids x and y by calculating the Pearson correlation coefficient between the Nij values for x and the Nij values for y. However, the correlation coefficient measuring the similarity between amino acids is not convenient for comparison with other indices such as Grantham's and Miyata's distances that measure the dissimilarity but not the similarity between amino acids. An alternative measure of amino acid dissimilarities in neighbor preference is to treat the profile for each amino acid as one locus with 20 alleles, i.e., 20 Nij values. We can then calculate a genetic distance by using available formulation of genetic distances (e.g., Cavalli-Sforza and Edwards 1967
; Nei 1972
; Reynolds, Weir, and Cockerham 1983
). In this study, we used Nei's method and the E. coliG data to obtain Dnp, with the subscript np standing for neighbor preference.
The reason for deriving Dnp values from the E. coliG data is that modern proteins tend to make repetitive use of the same amino acids, whereas ancient proteins (e.g., E. coli proteins) do not (Nishizawa and Nishizawa 1999
; Nishizawa, Nishizawa, and Kim 1999
). Thus, the local repetitiveness may be a derived character caused by factors such as replication slippage. The resulting repetitiveness may distort the similarity in neighbor preference between amino acids. For this reason, we used the ancient proteins in E. coli instead.
To test whether Dnp is related to the rate of amino acid substitutions, we compiled substitution data from two sets of sequences. One set consists of 58 presumably orthologous protein-coding genes from the human, the mouse, and the cow, and the other is made of the 13 protein-coding genes from each of the 19 completely sequenced mitochondrial sequences used in Xia (1998)
. The number of substitutions involving each amino acid pairs, obtained by comparing neighboring nodes along a phylogenetic tree, is partially shown in table 8
(NSNuc and NSMT). For example, there are 14 amino acid substitutions involving Arg and Ala for the nuclear genes (table 8
).
|
![]() |
A multiple regression (table 9 ) shows that all three amino acid dissimilarities are negatively correlated with RNuc and RMT. The model accounts for 43.35% of the total variation in RNuc and 37.76% of the total variation in RMT. The nonsignificant P value for Miyata's distance suggests that the distance does not add much to improve the model once Grantham's distance and Dnp are already in the model.
|
|
|
|
|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
Keywords: protein structure
neighbor preference
amino acid
amino acid distance
phylogenetics
Address for correspondence and reprints: Xuhua Xia, Bioinformatics Laboratory, HKU-Pasteur Research Center, Dexter H.C. Man Building, 8 Sassoon Road, Pokfulam, Hong Kong. xxia{at}hkusua.hku.hk
.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Berman H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, P. E. Bourne, 2000 The protein data bank Nucleic Acids Res 28:235-242
Branden C., J. Tooze, 1998 Introduction to protein structure Garland Publishing, Inc., New York
Cavalli-Sforza L. L., A. W. F. Edwards, 1967 Phylogenetic analysis: models and estimation procedures Evolution 32:550-570
Chou P. Y., G. D. Fasman, 1974a. Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins Biochemistry 13:211-222[ISI][Medline]
. 1974b. Prediction of protein conformation Biochemistry 13:222-245[ISI][Medline]
. 1978a. Empirical predictions of protein conformation Annu. Rev. Biochem 47:251-276[ISI][Medline]
. 1978b. Prediction of the secondary structure of proteins from their amino acid sequence Adv. Enzymol. Relat. Areas Mol. Biol 47:45-148[Medline]
Clarke B., 1970 Selective constraints on amino-acid substitutions during the evolution of proteins Nature 228:159-160[ISI][Medline]
Creighton T. E., 1993 Proteins: structure and molecular properties Freeman, New York
Dayhoff M. O., R. M. Schwartz, B. C. Orcutt, 1978 A model of evolutionary change in protein Pp. 345352 in M. O. Dayhoff, ed. Atlas of protein sequence and structure. Natl. Biomed. Res. Found., Silver Spring, Md
Dayhoff M. O., W. C. Barker, 1972 Mechanisms and molecular evolution: examples Pp. 4145 in M. O. Dayhoff, ed. Atlas of protein sequence and structure. Natl. Biomed. Res. Found., Washington, D.C
Dayhoff M. O., W. C. Barker, L. T. Hunt, 1983 Establishing homologies in protein sequences Methods Enzymol 91:524-545[ISI][Medline]
Epstein C. J., 1967 Non-randomness of amino-acid changes in the evolution of homologous proteins Nature 215:355-359[ISI][Medline]
Fasman G. D., P. Y. Chou, 1974 Prediction of protein conformation: consequences and aspirations Pp. 114125 in E. R. Blout, F. A. Bovey, M. Goodman, and N. Latan, eds. Peptides, polypeptides and proteins. Wiley, New York
Gillespie J. H., 1991 The causes of molecular evolution Oxford University Press, Oxford
Goldman N., Z. Yang, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences Mol. Biol. Evol 11:725-736
Gouy M., C. Gautier, M. Attimonelli, C. Larave, G. DiPaola, 1985 ACNUCa portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage Comput. Appl. Biosci 1:167-172[Abstract]
Grantham R., 1974 Amino acid difference formula to help explain protein evolution Science 185:862-864[ISI][Medline]
Kimura M., 1983 The neutral theory of molecular evolution Cambridge University Press, Cambridge, United Kingdom
Miyata T., S. Miyazawa, T. Yasunaga, 1979 Two types of amino acid substitutions in protein evolution J. Mol. Evol 12:219-236[ISI][Medline]
Morris A. L., M. W. Macarthur, E. G. Hutchinson, J. M. Thornton, 1992 Stereochemical quality of protein structure coordinates Proteins 12:345-364[ISI][Medline]
Nei M., 1972 Genetic distance between populations Am. Nat 106:283-292[ISI]
Nishizawa M., K. Nishizawa, 1999 Local-scale repetitiveness in amino acid use in eukaryote protein sequences: a genomic factor in protein evolution Proteins 37:284-292[ISI][Medline]
Nishizawa K., M. Nishizawa, K. S. Kim, 1999 Tendency for local repetitiveness in amino acid usages in modern proteins J. Mol. Biol 294:937-953[ISI][Medline]
Ramachandran G. N., V. Sasisekharan, 1968 Conformation of polypeptides and proteins Adv. Protein Chem 23:284-438
Reynolds J. B., B. S. Weir, C. C. Cockerham, 1983 Estimation of the coancestry coefficient: basis for a short-term genetic distance Genetics 105:767-779
Schulz G. E., R. H. Schirmer, 1979 Principles of protein structure Springer, New York
Singh J., J. M. Thornton, 1992 Atlas of protein side-chain interactions IRL Press, Oxford
Sneath P. H. A., 1966 Relations between chemical structure & biological activity in peptides J. Theor. Biol 12:157-195[ISI][Medline]
Thornton J. M., 1992 Protein structures: the end point of the folding pathway Pp. 5982 in T. E. Creighton, ed. Protein folding. Freeman, New York
Xia X., 1998 The rate heterogeneity of nonsynonymous substitutions in mammalian mitochondrial genes Mol. Biol. Evol 15:336-344[Abstract]
Xia X., 2000 DAMBE (software package for data analysis in molecular biology and evolution) Version 4.0 Department of Ecology and Biodiversity, University of Hong Kong, Hong Kong.
Xia X., W.-H. Li, 1998 What amino acid properties affect protein evolution? J. Mol. Evol 47:557-564[ISI][Medline]
Yang Z., 2000 PAML (phylogenetic analysis by maximum likelihood) University College, London
Yang Z., S. Kumar, M. Nei, 1995 A new method of inference of ancestral nucleotide and amino acid sequences Genetics 141:1641-1650
Yang Z., R. Nielsen, M. Hasegawa, 1998 Models of amino acid substitution and applications to mitochondrial protein evolution Mol. Biol. Evol 15:1600-1611
Zuckerkandl E., L. Pauling, 1965 Evolutionary divergence and convergence in proteins Pp. 97166 in V. Bryson and H. J. Vogel, eds. Evolving genes and proteins. Academic Press, New York.