Protein Structure, Neighbor Effect, and a New Index of Amino Acid Dissimilarities

Xuhua Xia and Zheng Xie

Bioinformatics Laboratory, HKU-Pasteur Research Center, Hong Kong;
Department of Microbiology, University of Hong Kong;
Institute of Environmental Protection, Hunan University, Changsha, People's Republic of China


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
Amino acids interact with each other, especially with neighboring amino acids, to generate protein structures. We studied the pattern of association and repulsion of amino acids based on 24,748 protein-coding genes from human, 11,321 from mouse, and 15,028 from Escherichia coli, and documented the pattern of neighbor preference of amino acids. All amino acids have different preferences for neighbors. We have also analyzed 7,342 proteins with known secondary structure and estimated the propensity of the 20 amino acids occurring in three of the major secondary structures, i.e., helices, sheets, and turns. Much of the neighbor preference can be explained by the propensity of the amino acids in forming different secondary structures, but there are also a number of intriguing association and repulsion patterns. The similarity in neighbor preference among amino acids is significantly correlated with the number of amino acid substitutions in both mitochondrial and nuclear genes, with amino acids having similar sets of neighbors replacing each other more frequently than those having very different sets of neighbors. This similarity in neighbor preference is incorporated into a new index of amino acid dissimilarities that can predict nonsynonymous codon substitutions better than the two existing indices of amino acid dissimilarities, i.e., Grantham's and Miyata's distances.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
The genetic variation of protein-coding genes represents a major component in genetic biodiversity, and much effort has been spent in understanding how proteins evolve and diversify by amino acid substitutions. Two approaches have been taken to study the pattern of amino acid substitutions. The first is empirical (Dayhoff and Barker 1972Citation ; Dayhoff, Schwartz, and Orcutt 1978Citation ; Dayhoff, Barker, and Hunt 1983Citation ), based on a comparative analysis of amino acid sequences. The second is parametric, initialized by studies on the relationship between amino acid dissimilarities and substitution patterns (Zuckerkandl and Pauling 1965Citation ; Sneath 1966Citation ; Epstein 1967Citation ; Clarke 1970Citation ; Grantham 1974Citation ; Miyata, Miyazawa, and Yasunaga 1979Citation ; Kimura 1983Citation ; Xia and Li 1998Citation ), with the objective of building a realistic model of amino acid substitutions. Some of these findings have been incorporated into codon-based or amino acid–based models for phylogenetic analysis (Goldman and Yang 1994Citation ; Yang, Kumar, and Nei 1995Citation ; Yang, Nielsen, and Hasegawa 1998Citation ).

These two approaches assume that amino acid substitutions at different amino acid sites are independent of each other. In other words, the amino acid substitution occurring at site i is irrelevant to what amino acid is found at sites i - 1, i + 1, or any other site. This assumption is problematic for the following reason. The normal functioning of proteins depends on its three-dimensional conformation that, in the micro scale, depends on the angles of the peptide chain, especially the angles of the N–C{alpha} ({phi}) and the C{alpha}–C ({psi}) bonds. According to the Ramachandran plot (Ramachandran and Sasisekharan 1968Citation ) and subsequent empirical studies (Morris et al. 1992Citation ), only particular combinations of these two angles can give rise to, or maintain, the basic secondary structures such as {alpha}-helices or ß-sheets. In other words, only particular combinations of amino acids can cooperate to form particular secondary structures. For example, a stretch of Glu, Ala, or Met tends to form an {alpha}-helix, but the insertion of Gly or Pro would tend to break the {alpha}-helix (Chou and Fasman 1978aCitation ). This implies that an amino acid substitution at site i may depend on what the neighboring amino acids are.

Different amino acids have different preferences either for or against being in certain secondary structures. For example, Ala and Glu are good {alpha}-helix formers, whereas some others such as Gly and Pro tend to disrupt the {alpha}-helix structure. Similarly, Ile and Val are good, whereas Glu and Pro are poor ß-sheet formers (Chou and Fasman 1974a, 1978bCitation ; Branden and Tooze 1998Citation ). This kind of empirical evidence has led to the derivation of the Chou-Fasman conformational parameters that can be used to predict secondary structures of protein molecules (Chou and Fasman 1978aCitation ). A corollary of this is that {alpha}-helix formers should be found more frequently as neighbors, as should ß-sheet formers.

The existing conformational parameters (Chou and Fasman 1978aCitation ) were based on a small data set and should be revised. Proteins with known structures now have accumulated to about 15,000 in the PDB database (Berman et al. 2000Citation ). One of the objectives of this paper is to obtain a more updated estimate of the propensity of the 20 amino acids occurring in the three major secondary structures, i.e., helices, sheets, and turns. This will not only complement a previous study on interactions of non-neighbor amino acids (Singh and Thornton. 1992Citation ), but will also help us to better interpret the neighbor preference of amino acids.

A study on neighboring amino acids can also shed light on amino acid dissimilarities. Two indices of amino acid dissimilarities have been proposed (Grantham 1974Citation ; Miyata, Miyazawa, and Yasunaga 1979Citation ), with Grantham's distance based on the volume, the polarity, and the chemical property of the side chain, and Miyata's distance based on the first two amino acid properties. Amino acids can differ in many ways, and Sneath (1966) has indeed listed 134 properties. Of the 10 properties studied in detail, all exhibit significant relationship with substitution rates (Xia and Li 1998Citation ), suggesting that they are all important properties related to the normal functioning of proteins. It is difficult to agree upon which amino acid properties should be used to construct an index of amino acid dissimilarities, and the choice of three properties to build Grantham's distance and two properties to build Miyata's distance is, to a large extent, arbitrary.

The arbitrary choice of amino acid properties and potentially false formulation of the amino acid dissimilarities may be responsible for some of the old controversies between Kimura (1983Citation , p. 159) and Gillespie (1991Citation , p. 43). Kimura, being a neutralist, argued that the most frequent nonsynonymous substitutions were those involving similar amino acids and the substitution rate would decrease monotonously with increasing dissimilarity between involved amino acids (fig. 7.1 in Kimura 1983Citation ). This is of course what one would expect from the neutral theory of molecular evolution, in which positive selection plays a negligible role in molecular evolution and purifying (negative) selection eliminates those mutations with major effects. Gillespie, on the other hand, argued that the most frequent nonsynonymous substitutions were not between the chemically most similar amino acids, but instead were between amino acids with a Miyata's distance near 1 (fig. 1.12 in Gillespie 1991Citation ). It is difficult to appreciate or interpret the latter finding, and we are inclined to think that the finding may be an artifact because of inappropriate formulation of amino acid dissimilarities. For example, those amino acid pairs with a Miyata's distance near 1 may actually be more similar to each other than what Miyata's distances would let us believe. The peak of substitutions at Miyata's distance near 1 may disappear when better indices are formulated.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 1.—Dendrogram of the 20 amino acids based on their propensity to occur in helices, sheets, and turns

 
An entirely different approach to study amino acid dissimilarities is to look at whether two amino acids have similar sets of neighbors. We know that an amino acid in a protein needs to interact with neighbors in certain ways to maintain the normal functional structure of the protein. If an amino acid has no preference for its neighbors, then the probability of having one particular amino acid as its neighbor is simply the proportion of the amino acid among all 20 amino acids. The deviation from this random expectation represents the degree of preference for its neighbors. If two amino acids have strong but identical preferences for the same set of amino acids as their neighbors, then we can say that the two amino acids are functionally equivalent, no matter how they differ in their amino acid properties. This would seem to be a more objective way of obtaining amino acid dissimilarities, objective in the sense that we do not need to choose arbitrarily two or three out of many amino acid properties to build a dissimilarity index.

This paper has three objectives. The first is to estimate the propensity of the 20 amino acids occurring in the three major categories of secondary structures, i.e., helices, sheets, and turns, by using the large number of proteins now available with known structures. The second is to document the genomic pattern of neighbor preference for the 20 amino acids by taking advantage of the huge amount of available protein data and interpret the neighbor preference with reference to protein secondary structures. The third is to incorporate the differences in neighbor preference between amino acids into a new formulation of amino acid dissimilarity index.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
Propensity of Amino Acids Occurring in Helices, Sheets, and Turns
We retrieved 7,342 proteins with known structures from the PDB database (Berman et al. 2000Citation ), extracted helices, sheets, and turns according to the PDB Format Description, Version 2.2, and counted the frequency distribution of amino acids in each of the three structure categories. A total of 935 files did not conform to the format description and were discarded.

The propensity of an amino acid occurring in one of the three structure categories is calculated as follows. Let NTot be the total number of amino acids in the three structure categories; Ni (where i = 1, 2, ..., 20 corresponding to the 20 amino acids) be the number of amino acid i found in all three structure categories; Nh, Ns, and Nt be the number of amino acids found in helices, sheets, and turns, respectively; and Nh,i, Ns,i, and Nt,i be the number of amino acids in helices, sheets, and turns, respectively. If amino acids occur equally likely in the three secondary structures, then the expected numbers of Nh,i, Ns,i, and Nt,i are, respectively,


The propensity of amino acid i occurring in helices is defined as


Ph,i measures how strongly an amino acid is associated with one particular secondary structure and is independent of sample size. Ps,i and Pt,i are calculated in the same way. We retrieved only 7,342 proteins instead of all proteins in the PDB database, because the Ph,i, Ps,i, and Pt,i values are stabilized after analyzing just 3,000 protein structures. Using more data will not change the Ph,i, Ps,i, and Pt,i values.

Neighbor Preference in Amino Acids
A total of 25,467 protein-coding sequences (CDS) from human (Homo sapiens), 11,490 CDS from mouse (Mus musculus), and 15,028 CDS from Escherichia coli were retrieved and translated into protein sequences by using the ACNUC retrieval system (Gouy et al. 1985Citation ). We excluded from further analysis 719 human CDS, 169 mouse CDS, and 20 E. coli CDS, which contain embedded stop codons. These sequences are likely pseudogenes and are irrelevant to this study.

Some genes have been sequenced and deposited in GenBank multiple times, and this may bias the result in the way that the observed pattern of neighbor preference may not reflect the genomic pattern but instead may reflect the pattern of those over-represented genes. For this reason, we have also analyzed all 4,289 CDS in the complete genome of E. coli K-12. The E. coli genomic data set will be referred to as E. coliG hereafter.

With 20 amino acids, there are 400 possible amino acid doublets (i.e., neighbors). Let Nij (where i and j = 1, 2, ..., 20 corresponding to the 20 amino acids) be the number of amino acid pairs, with amino acid j following amino acid i. For example, NAla,Arg is the number of Ala-Arg pairs in all sequences; NArg,Ala is the number of Arg-Ala pairs in all sequences, and so on. The counting is from the N-terminal to the C-terminal of the amino acid sequences. The first methionine is not counted. Data extraction is done with DAMBE (Xia 2000Citation ).

For amino acid i, all 20 Nij values, with j = 1, 2, ..., 20 corresponding to the 20 amino acids, make up a profile of neighbor preference for amino acids found after amino acid i, and all 20 Nji values makes another profile for amino acids found before amino acid i along the amino acid sequence. The former set will be referred hereafter as the Profilea of amino acid i, with the subscript a meaning after. The latter set of 20 Nji values will be referred hereafter as the Profileb of amino acid i, with the subscript b meaning before.

The Nij values apparently depend on amino acid usage. If amino acid j is very abundant, then obviously Nij and Nji will be large, too. If amino acid i does not have any neighbor preference, then the expected value for Nij is


where Pj is the frequency of amino acid j. For example, if Pala = 0.1, and the sum of the 20 NGly,j values is 10,000 (i.e., Gly has 10,000 downstream neighbors), then the expected value of NGly,Ala is 1,000 (=0.1 x 10,000). Given our reasoning (see Introduction) we expect certain amino acids to be neighbors more likely than expected from random association. For example, good {alpha}-helix formers should be more likely to be neighbors, as should ß-sheet formers.

Whether the 20 Nij values for amino acid i deviate significantly from the expectation of random association can be tested by a chi-square goodness-of-fit test with


The degree of freedom associated with {chi}2 is 19 rather than 18 because Pj is not calculated from the 20 Nij values.

The strength of the neighbor preference for amino acid i (SPi) can be simply measured by


Note that we should not use {chi}2 directly to measure the strength of preference because the {chi}2 value depends on the sample size, i.e., a more abundant amino acid tends to yield a large {chi}2 value than a less abundant amino acid, everything else being equal. In contrast, SPi is independent of sample size and can therefore facilitate comparisons among amino acids. As SPi can only take positive values and therefore cannot indicate which amino acid is favored or disfavored by amino acid i, we also use the following index (Iij) to measure the preference of amino acid i for amino acid j:


Apparently, Iij will be positive if amino acid i has amino acid j as its neighbor more frequently than expected, and negative if amino acid i has amino acid j as its neighbor less frequently than expected.

Nij may differ from Nji, i.e., amino acid i may have different preferences for amino acids that go before it and those that go after it. This difference, or similarity, between these two profiles can be measured by the Pearson correlation coefficient between the 20 Nij values and the 20 Nji values (where j = 1, 2, ..., 20). Note that such correlation coefficients measure only the similarity between Profilea and Profileb. They do not measure the strength of preferences. For example, if there is no preference at all, then Profilea and Profileb will both be expected to approach the relative abundance of the 20 amino acids, and will have a correlation coefficient near 1 given the large data set.

If two amino acids, x and y, have similar neighbor preference, then Nxj and Nyj will be highly correlated, and we can use the correlation coefficient to measure similarity in neighbor preference between the two amino acids. Alternatively, we can treat the 20 Nij values as allele frequencies for one locus, and calculate a pair-wise genetic distance between amino acids by using genetic distances based on allele frequencies (e.g., Cavalli-Sforza and Edwards 1967Citation ; Nei 1972Citation ; Reynolds, Weir, and Cockerham 1983Citation ). The amino acid distance based on similarity in neighbor preference will be referred to hereafter as Dnp, with np standing for neighbor preference.

To test whether Dnp is related to the rate of amino acid substitutions, we compiled substitution data from two sets of protein-coding sequences. One set consists of 58 presumably orthologous genes from the human, the mouse, and the cow, and the other is made of the 13 protein-coding genes from each of the 19 completely sequenced mitochondrial sequences used in Xia (1998)Citation . The ancestral sequences were reconstructed using the CODEML program in the PAML package (Yang 2000Citation ), with jones.dat for the nuclear genes and mtmam.dat for the mitochondrial genes. Pair-wise comparisons were made between neighboring nodes along the tree. The tree for the first data set with only three operational taxonomic units (OTUs) is simply a trifurcating tree with one internal node, and the tree for the second data set is the same as in Xia and Li (1998)Citation . The number of substitutions involving amino acids i and j is designated as NSij.

We expect NSij to be large between similar amino acids and small between different amino acids. However, NSij values depend not only on the amino acid dissimilarities, but also on the frequencies of the amino acids involved. For example, NSi,j (where i, j = 1, 2, ..., 20 corresponding to the 20 amino acids and i != j) will necessarily be zero if the sequences contain no amino acid i or amino acid j. Thus, NSij should be adjusted for amino acid frequencies before it is used to evaluate indices of amino acid distances.

Let Pi (where i = 1, 2, ..., 20 corresponding to the 20 amino acids) be the frequency of amino acid i in the set of amino acid sequences, and Ns be the total number of amino acid substitutions. The expected value of NSi,j, when amino acids replace each other randomly, is


The quantity

can then be taken as a measure of substitution rate for evaluating the indices of amino acid dissimilarities. Whether one amino acid distance is better than others depends on whether it can predict the Rij values better than others.

Another method for evaluating the relative performance of different amino acid distances is to apply them in a likelihood-based phylogenetic analysis (Yang, Nielsen, and Hasegawa 1998Citation ). The best distance should generate larger likelihood values than other distances. For this purpose, we have used the 13 protein-coding genes from six OTUs, with two chimpanzees (GenBank LOCUS names: CHPMTB and CHPMTE), one gorilla (GGMTG), one human (HSMITG), one orangutan (ORAMTD), and one gibbon (HLMITCSEQ).


    Results and Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
Propensity of Amino Acids to Occur in Helices, Sheets, and Turns
Different amino acids have strong association with particular secondary structures, with Ala and Glu found most frequently in helices, Val, Cys, and Ile found most frequently in sheets and Gly and Pro found most frequently in turns (table 1 ). A dendrogram of amino acids, based on the average linkage clustering method on the Ph, Ps, and Pt values, grouped the helix-forming amino acids in one cluster, the sheet-forming amino acids in another, and the turn-forming amino acids (Gly, Pro) in the third (fig. 1 ).


View this table:
[in this window]
[in a new window]
 
Table 1 Frequency Distribution of Amino Acids in Helices, Sheets, and Turns, Together with Calculated Propensity (P) of the Amino Acids to Occur in These Secondary Structures. Ph and Ps Are Strongly and Negatively Correlated

 
The Ph and Ps values are positively correlated with the helix- and sheet-forming propensities (Chou and Fasman 1978aCitation ), with r being 0.5742 and 0.7229, respectively. Pt is correlated with the turn-forming propensity (Fasman and Chou 1974Citation ), with r = 0.7977. It has long been known that Pro and Gly often occur together in reverse turns (Schulz and Schirmer 1979Citation , p. 111; Thornton 1992Citation ; Creighton 1993Citation , Pp. 225–226;). For example, Pro occurs frequently in such turns typically at position i + 1. The turn requires a residue with a positive {varphi} angle and Gly, having no side chain to constrain the angle, is one of the few amino acids that can take such a conformation.

The three aromatic amino acids (Tyr, Phe, and Trp) are clustered together and tend to occur in sheets (table 1 ). Aromaticity can affect the rate of amino acid substitutions (Xia and Li 1998Citation ), and our observation that they are all sheet-formers suggests that the replacement of a helix-forming amino acid (which are all nonaromatic) by one of these aromatic amino acids may destabilize the secondary structure. Consequently, purifying selection should act against such replacements. The similarity in sheet-forming among these amino acids represents a new dimension of similarity that is ignored by previous formulation of amino acid distances.

Amino Acid Usage
Amino acid usage for the human and mouse sequences are very similar (table 2 ), with a Pearson correlation coefficient equal to 0.999, suggesting that amino acid usage is conserved among distantly related mammalian species. The correlation coefficient is 0.903 between E. coli and human and 0.894 between E. coli and mouse. The correlation coefficient between E coli and E coliG is 0.9991, suggesting that, at the amino acid usage level, the potential bias caused by differential representation of genes in GenBank is not obvious. The amino acid usage from the PDB database is closer to E. coli than to the two mammalian species, with the correlation coefficient being 0.9607, 0.9519, 0.9005, and 0.8920, respectively, for E coliG, E coli, human, and mouse.


View this table:
[in this window]
[in a new window]
 
Table 2 Amino Acid Usage (%) Based on 6,407 Proteins in the PDB Database (from PDB), on the 4289 CDS in E. coli K-12 genome (E. coliG), and on the Sequences in GenBank for the Three Species

 
Neighbor Preference in Amino Acids
The Nij and Nji values are generally very similar (table 3 and fig. 2 ) which is true for all the three species studied. However, for each of the 20 amino acids, the Nij values deviate highly significantly from E(Nij) for all the three species, with P = 0.0000 for all the species and for all individual {chi}2 tests, of which one is illustrated in table 4 , for amino acid alanine. Different amino acids exhibit different degrees of neighbor preference, with Glu and Pro consistently having strong preference in all the three species (table 5 ). Glu is the best {alpha}-helix former and Pro is the ultimate {alpha}-helix breaker (Chou and Fasman 1974a, 1974b, 1978aCitation ). It is understandable that they should occur mostly in particular combinations of amino acids. The amino acids with the least preference are Leu and Val, which happen to be two of the three most typical amino acids (Sneath 1966Citation ). The typicalness of an amino acid in Sneath (1966)Citation is measured by the average differences between the amino acid and all the other amino acids, with the most typical amino acid having the smallest mean difference. The weak preference of these two amino acids suggests that they are general-purpose amino acids that can perhaps be put anywhere in protein molecules.


View this table:
[in this window]
[in a new window]
 
Table 3 Neighbor Preference Data (Nij values in rows and Nji values in columns) from E. coliG. Each Row Represents one Profile of Neighbor Preference. For Example, the Row Headed by Ala Is the Profilea of Alanine

 


View larger version (15K):
[in this window]
[in a new window]
 
Fig. 2.—Nij and Nji values are similar, as illustrated here with human data. The same pattern is also true for the mouse and E. coli

 

View this table:
[in this window]
[in a new window]
 
Table 4 {chi}2-Test of Goodness-of-Fit for Profileb of Alanine, Based on E. coliG. The Column Headed by {chi}2 Shows the Individual Terms of the {chi}2 Statistic, i.e., (Nij - E[Nij]2/E(Nij), and the Last Column Iala,j, Is Calculated According to Equation (6)

 

View this table:
[in this window]
[in a new window]
 
Table 5 Strength of Neighbor Preference, i.e., SPi Values in Equation (5). Values Greater than 0.14 Are in Bold

 
One particular neighbor preference in amino acids that is consistent in the three species is the preference of its own kind, with the only exception of Pro in E. coli (table 6 ). There are several possible explanations for the self preference. One reviewer suggested that many proteins are transmembrane, and similar amino acids would cluster in the intramembrane hydrophobic and cytoplasmic hydrophilic regions. An alternative explanation is replication slippage leading to stretches of identical codons.


View this table:
[in this window]
[in a new window]
 
Table 6 Preference of Its Own Kind Measured by IijValues According to Equation (6). Those Iii Values Larger than 0.4 Are in Bold. Ala Means Ala-Ala Doublet, Arg Means Arg-Arg Doublet and So on

 
Pro-Pro doublets are common in mammalian proteins, but rare in E. coli proteins. In general, the self preference is weaker in E. coli than in the two mammalian species. This corroborates recent studies (Nishizawa and Nishizawa 1999Citation ; Nishizawa, Nishizawa, and Kim 1999Citation ) showing that modern proteins have a tendency for repetitive use of the same amino acid at a local scale, whereas this local repetitiveness is weak in ancient proteins, e.g., human homologues of E. coli proteins.

Aside from the self preference, different amino acids also exhibit association and repulsion with other amino acids. A subset of these association and repulsion patterns, with Iij values either greater than 0.2 or lesser than -0.2 is shown in table 7 . All of these associations and repulsions can be easily explained with reference to figure 1 . In general, those amino acids with a high propensity for occurring in the same secondary structure are associated and those with a high propensity for occurring in different secondary structures are repulsive.


View this table:
[in this window]
[in a new window]
 
Table 7 Association (defined as having an Iij > 0.2) and Repulsion (with an Iij < -0.2) Between Amino Acids. Leu and Gln Do Not Have Association with or Repulsion Against Other Amino Acids According to This Definition and Are Not Listed. Based on Human Data Only

 
One might wonder why Pro and Trp are not associated because both occur very frequently in reverse turns (typically made of four amino acid residues indexed, i, i + 1, i + 2, and i + 3). This is because Pro is almost exclusively found in position i + 1 and Trp appears to occur only in position i + 3, i.e., they do not occur as immediate neighbors (Schulz and Schirmer 1979Citation , p. 111; Thornton 1992Citation ; Creighton 1993Citation , pp. 225–226).

Amino Acid Distance Based on Neighbor Preference
We have so far focused only on the neighbor preference of individual amino acids, but have not yet studied the similarity in neighbor preference between amino acids. We could measure the similarity in neighbor preference between amino acids x and y by calculating the Pearson correlation coefficient between the Nij values for x and the Nij values for y. However, the correlation coefficient measuring the similarity between amino acids is not convenient for comparison with other indices such as Grantham's and Miyata's distances that measure the dissimilarity but not the similarity between amino acids. An alternative measure of amino acid dissimilarities in neighbor preference is to treat the profile for each amino acid as one locus with 20 alleles, i.e., 20 Nij values. We can then calculate a genetic distance by using available formulation of genetic distances (e.g., Cavalli-Sforza and Edwards 1967Citation ; Nei 1972Citation ; Reynolds, Weir, and Cockerham 1983Citation ). In this study, we used Nei's method and the E. coliG data to obtain Dnp, with the subscript np standing for neighbor preference.

The reason for deriving Dnp values from the E. coliG data is that modern proteins tend to make repetitive use of the same amino acids, whereas ancient proteins (e.g., E. coli proteins) do not (Nishizawa and Nishizawa 1999Citation ; Nishizawa, Nishizawa, and Kim 1999Citation ). Thus, the local repetitiveness may be a derived character caused by factors such as replication slippage. The resulting repetitiveness may distort the similarity in neighbor preference between amino acids. For this reason, we used the ancient proteins in E. coli instead.

To test whether Dnp is related to the rate of amino acid substitutions, we compiled substitution data from two sets of sequences. One set consists of 58 presumably orthologous protein-coding genes from the human, the mouse, and the cow, and the other is made of the 13 protein-coding genes from each of the 19 completely sequenced mitochondrial sequences used in Xia (1998)Citation . The number of substitutions involving each amino acid pairs, obtained by comparing neighboring nodes along a phylogenetic tree, is partially shown in table 8 (NSNuc and NSMT). For example, there are 14 amino acid substitutions involving Arg and Ala for the nuclear genes (table 8 ).


View this table:
[in this window]
[in a new window]
 
Table 8 Substitution Data from the 58 Nuclear Protein-coding Genes from Human, Mouse, and Cow, and 13 Protein-coding Mitochondrial Genes from 19 Mammalian Species. NSNuc: Observed Number of Substitutions from the 58 Nuclear Genes; NSMT: Observed Number of Substitutions from the Mitochondrial Genes

 
Rij values were computed, according to equation (8) , for both the nuclear gene and for the mitochondrial gene. The resulting RNuc and RMT values, however, are highly correlated with NSNuc and NSMT, respectively (r > 0.95 for both), suggesting that the adjustment of amino acid frequencies does not really matter much. We regressed RNuc and RMT separately on the three amino acid distances in table 8 , after log-transforming RNuc and RMT. The transformation is done after adding a constant value to RNuc and RMT so that the minimum value of RNuc and RMT is 0.5. The reason for the log-transformation follows from the simple formulation in Kimura (1983)Citation :

where R is equivalent to RNuc and RMT, and Dij is the distance between amino acids i and j. The equation implies that ln(R) is linearly related to Dij, and hence the transformation.

A multiple regression (table 9 ) shows that all three amino acid dissimilarities are negatively correlated with RNuc and RMT. The model accounts for 43.35% of the total variation in RNuc and 37.76% of the total variation in RMT. The nonsignificant P value for Miyata's distance suggests that the distance does not add much to improve the model once Grantham's distance and Dnp are already in the model.


View this table:
[in this window]
[in a new window]
 
Table 9 Regression of the Pair-wise Amino Acid Substitution Rate on the Three Measures of Amino Acid Dissimilarities

 
The result suggests that Dnp should be incorporated into the index of amino acid dissimilarities. We have taken a simple approach by rescaling Dnp to have the same mean and variance as Grantham's distance, and obtain a new index as


where the subscript G indicates the fact that Dij results from a combination of Dnp with Grantham's distance. Similarly, we also obtained Dij_M (table 10 ), where the subscript M indicates the fact that Dij_M results from a combination of Dnp with Miyata's distance. A plot of Dij_M versus log-transformed NSNuc is shown in figure 3 . The number of substitutions seems to decrease monotonously with increasing Dij_M, consistent with Kimura's (1983)Citation observation but not with Gillespie's (1991)Citation . However, this may not be because of an improvement of Dij_M over Miyata's distance, because the monotonous decrease in the number of substitutions is also visible with Miyata's distance. Thus, the pattern observed by Gillespie, that the substitution rate increases first with amino acid distance and then decreases with amino acid distance, may simply be caused by a less representative data set.


View this table:
[in this window]
[in a new window]
 
Table 10 Amino Acid Dissimilarities Based on Neighbor Preference from E. coli genomic DNA and Miyata's Distance

 


View larger version (22K):
[in this window]
[in a new window]
 
Fig. 3.—Frequency distribution of amino acid substitutions over Dij_M (top panel) and Miyata's distance (bottom panel). Data from 58 presumably orthologous protein-coding genes from human, mouse, and cow

 
To evaluate the performance of these two new distance indices, we have used them together with Grantham's and Miyata's distances in a codon-based phylogenetic reconstruction involving the 13 protein-coding genes from six ape species. The maximum likelihood values (table 11 ) show that setting all amino acid distances as equal is the worst, followed by Grantham's and Miyata's distances. This result is consistent with a previous study (Yang, Nielsen, and Hasegawa 1998Citation ). Dij_G is better than all preceding distances, but Dij_M is the best of all (table 11 ). Because Dnp was derived solely from the neighbor preference data of the 4289 CDS from the genome of E. coli K-12, not from mitochondrial genes, the better performance of Dij_M involving mitochondrial genes suggests that Dij_M may be generally applicable to other genes.


View this table:
[in this window]
[in a new window]
 
Table 11 Comparison of the Performance of Four Indices of Amino Acid Dissimilarity in Tree Estimation

 
In summary, amino acids have different propensities to occur in different secondary structures, and they have different neighbor preferences. Amino acids with similar neighbor preferences tend to replace each other more frequently than amino acids with different neighbor preferences. The incorporation of the neighbor preference into the index of amino acid dissimilarities can substantially improve codon-based and amino acid-based substitution models.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
This study is supported by a CRCG grant from the University of Hong Kong (10203043/27662/25400/302/01) and RGC grants from Hong Kong Research Grant Council (HKU7265/00M, HKU7212/01M) to X.X. Comments by F. X. Fu and two anonymous reviewers have significantly improved the manuscript.


    Footnotes
 
Yun-Xin Fu, Reviewing Editor

Keywords: protein structure neighbor preference amino acid amino acid distance phylogenetics Back

Address for correspondence and reprints: Xuhua Xia, Bioinformatics Laboratory, HKU-Pasteur Research Center, Dexter H.C. Man Building, 8 Sassoon Road, Pokfulam, Hong Kong. xxia{at}hkusua.hku.hk . Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 

    Berman H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, P. E. Bourne, 2000 The protein data bank Nucleic Acids Res 28:235-242[Abstract/Free Full Text]

    Branden C., J. Tooze, 1998 Introduction to protein structure Garland Publishing, Inc., New York

    Cavalli-Sforza L. L., A. W. F. Edwards, 1967 Phylogenetic analysis: models and estimation procedures Evolution 32:550-570

    Chou P. Y., G. D. Fasman, 1974a. Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins Biochemistry 13:211-222[ISI][Medline]

    ———. 1974b. Prediction of protein conformation Biochemistry 13:222-245[ISI][Medline]

    ———. 1978a. Empirical predictions of protein conformation Annu. Rev. Biochem 47:251-276[ISI][Medline]

    ———. 1978b. Prediction of the secondary structure of proteins from their amino acid sequence Adv. Enzymol. Relat. Areas Mol. Biol 47:45-148[Medline]

    Clarke B., 1970 Selective constraints on amino-acid substitutions during the evolution of proteins Nature 228:159-160[ISI][Medline]

    Creighton T. E., 1993 Proteins: structure and molecular properties Freeman, New York

    Dayhoff M. O., R. M. Schwartz, B. C. Orcutt, 1978 A model of evolutionary change in protein Pp. 345–352 in M. O. Dayhoff, ed. Atlas of protein sequence and structure. Natl. Biomed. Res. Found., Silver Spring, Md

    Dayhoff M. O., W. C. Barker, 1972 Mechanisms and molecular evolution: examples Pp. 41–45 in M. O. Dayhoff, ed. Atlas of protein sequence and structure. Natl. Biomed. Res. Found., Washington, D.C

    Dayhoff M. O., W. C. Barker, L. T. Hunt, 1983 Establishing homologies in protein sequences Methods Enzymol 91:524-545[ISI][Medline]

    Epstein C. J., 1967 Non-randomness of amino-acid changes in the evolution of homologous proteins Nature 215:355-359[ISI][Medline]

    Fasman G. D., P. Y. Chou, 1974 Prediction of protein conformation: consequences and aspirations Pp. 114–125 in E. R. Blout, F. A. Bovey, M. Goodman, and N. Latan, eds. Peptides, polypeptides and proteins. Wiley, New York

    Gillespie J. H., 1991 The causes of molecular evolution Oxford University Press, Oxford

    Goldman N., Z. Yang, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences Mol. Biol. Evol 11:725-736[Abstract/Free Full Text]

    Gouy M., C. Gautier, M. Attimonelli, C. Larave, G. DiPaola, 1985 ACNUC–a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage Comput. Appl. Biosci 1:167-172[Abstract]

    Grantham R., 1974 Amino acid difference formula to help explain protein evolution Science 185:862-864[ISI][Medline]

    Kimura M., 1983 The neutral theory of molecular evolution Cambridge University Press, Cambridge, United Kingdom

    Miyata T., S. Miyazawa, T. Yasunaga, 1979 Two types of amino acid substitutions in protein evolution J. Mol. Evol 12:219-236[ISI][Medline]

    Morris A. L., M. W. Macarthur, E. G. Hutchinson, J. M. Thornton, 1992 Stereochemical quality of protein structure coordinates Proteins 12:345-364[ISI][Medline]

    Nei M., 1972 Genetic distance between populations Am. Nat 106:283-292[ISI]

    Nishizawa M., K. Nishizawa, 1999 Local-scale repetitiveness in amino acid use in eukaryote protein sequences: a genomic factor in protein evolution Proteins 37:284-292[ISI][Medline]

    Nishizawa K., M. Nishizawa, K. S. Kim, 1999 Tendency for local repetitiveness in amino acid usages in modern proteins J. Mol. Biol 294:937-953[ISI][Medline]

    Ramachandran G. N., V. Sasisekharan, 1968 Conformation of polypeptides and proteins Adv. Protein Chem 23:284-438

    Reynolds J. B., B. S. Weir, C. C. Cockerham, 1983 Estimation of the coancestry coefficient: basis for a short-term genetic distance Genetics 105:767-779[Abstract/Free Full Text]

    Schulz G. E., R. H. Schirmer, 1979 Principles of protein structure Springer, New York

    Singh J., J. M. Thornton, 1992 Atlas of protein side-chain interactions IRL Press, Oxford

    Sneath P. H. A., 1966 Relations between chemical structure & biological activity in peptides J. Theor. Biol 12:157-195[ISI][Medline]

    Thornton J. M., 1992 Protein structures: the end point of the folding pathway Pp. 59–82 in T. E. Creighton, ed. Protein folding. Freeman, New York

    Xia X., 1998 The rate heterogeneity of nonsynonymous substitutions in mammalian mitochondrial genes Mol. Biol. Evol 15:336-344[Abstract]

    Xia X., 2000 DAMBE (software package for data analysis in molecular biology and evolution) Version 4.0 Department of Ecology and Biodiversity, University of Hong Kong, Hong Kong.

    Xia X., W.-H. Li, 1998 What amino acid properties affect protein evolution? J. Mol. Evol 47:557-564[ISI][Medline]

    Yang Z., 2000 PAML (phylogenetic analysis by maximum likelihood) University College, London

    Yang Z., S. Kumar, M. Nei, 1995 A new method of inference of ancestral nucleotide and amino acid sequences Genetics 141:1641-1650[Abstract/Free Full Text]

    Yang Z., R. Nielsen, M. Hasegawa, 1998 Models of amino acid substitution and applications to mitochondrial protein evolution Mol. Biol. Evol 15:1600-1611[Abstract/Free Full Text]

    Zuckerkandl E., L. Pauling, 1965 Evolutionary divergence and convergence in proteins Pp. 97–166 in V. Bryson and H. J. Vogel, eds. Evolving genes and proteins. Academic Press, New York.

Accepted for publication August 27, 2001.