Departments of * Biochemistry
Zoology, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Ramat Aviv, Israel
Correspondence: E-mail: tali{at}kimura.tau.ac.il.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: molecular evolution substitution rates pseudogenes nonfunctionalization computational method galactosyltransferase
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The assumption of equal probabilities of substitution in the one- and the two-parameter models is, however, unrealistic. For instance, under this supposition, the nucleotide frequencies at equilibrium (i.e., after a sufficiently long period of time) are expected to equal each other (Nei 1984). This is in contrast with the preference for nucleotides A and T generally observed in noncoding segments (Graur and Li 1999). Other methods that are less sensitive to deviations from equilibrium frequencies have been proposed (e.g., Tajima and Nei 1984).
Nei (1987) showed that Kimura's two-parameter model might be generalized, at least in theory, by incorporating all 12 types of possible substitutions among nucleotides. Nei (1987) demonstrated that such a formulation leads to a different frequency of nucleotides at equilibrium and therefore will reflect the pattern of nucleotide substitution more closely. Nei's formulation (1987) could not be tested in practice when it was proposed because it required "a reliable nucleotide-substitution matrix based on a large amount of substitution data," which was then unavailable. Today, the availability of a large set of pseudogene sequences makes possible the determination of such a substitution matrix.
Based on the assumption that pseudogenes are not subject to functional constraints, Gojobori, Li, and Graur (1982) devised a method for calculating the relative rates of nucleotide substitutions in pseudogenes in the course of evolution. This method was used to calculate relative substitution rates in a large number of human pseudogenes (Graur and Li 1999). Here, we incorporate these substitution-rates data into a method of evolutionary-distance measurement as proposed by Nei (1987). Thus, we derive a method for pairwise Assessment of Nonfunctionalization Times (pANT) by using the alignment between a functional gene and a paralogous processed pseudogene.
Existing methods for estimating nonfunctionalization times (e.g., Li, Gojobori, and Nei 1981 and Miyata and Yasunaga 1981) rely on knowledge of three sequences: (1) a functional gene, (2) a paralogous pseudogene, and (3) an orthologous functional gene from another species as an outgroup. In calculating the nonfunctionalization times, these methods assume that all three sequences evolve at the same rate as long as they are functional. This assumption was shown to be unrealistic, both because different functional constraints may be at work and because rates are known to vary with organismic lineage.
The model we present here assumes that, because of functional constraints, the gene has undergone a negligible number of substitutions in its nonsynonymous codon positionsi.e., the first and second positionssince the divergence between the gene and the pseudogene (Nei and Kumar 2000). This assumption means, in essence, that the sequence of the functional gene may be treated as the pseudogene's forebear, and makes the use of an orthologous-gene sequence superfluous. Because pANT makes no use of a functional ortholog, we need make no assumption regarding the nucleotide-substitution rate in the orthologous species, thereby reducing inaccuracies.
![]() |
pANT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We use the fact that processed pseudogenes and functional genes are subject to different nucleotide-substitution patterns. Let us assume the evolutionary scenario presented in figure 1. When G is the evolutionary progenitor of the contemporary gene g and the pseudogene , we may assume that, because of functional constraints, the sequences of G and g in the nonsynonymous codon positions are the same. Because processed pseudogenes are "dead on arrival" (Graur and Li 1999), we assume that the nucleotide composition of
at the time of its nonfunctionalization was the same as that of G. Since its nonfunctionalization,
has been devoid of function and has accumulated substitutions at an accelerated rate compared to g.
|
Under our assumptions, diverged from g at the time of its nonfunctionalization. Let us mark by
the proportion of nucleotides in first and second codon positions of type i in sequence g at time t, which are identical in the sequence
. We mark by
i the proportion of nucleotides, which, according to R, are other than i at time t, and are substituted for i at time t + 1. The proportion of nucleotides i at time t that are substituted for other nucleotides at t + 1 is marked by ßi. The proportion of identical nucleotides of type i at time t + 1,
, can then be obtained:
|
Because dqi/dt =
, then
|
Thus, our treatment considers eight different parameters, i.e., and ß for each of the four nucleotides. Solving equation (2) as an ordinary differential equation, and setting
= 1 (Boyce and DiPrima 1977), we obtain,
|
Marking (i + ßi)t as di, defined as the evolutionary distance, and solving equation (3) for di, we obtain:
|
We thus obtain four different measurements for the distance d, one for each of the nucleotides. In the subsequent sections we estimate the distance d as the average of these four values:
|
The pANT software is available online at http://pant.tau.ac.il.
![]() |
Time Calibration |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The time evolution of pseudogene sequences was simulated as spontaneous evolution of the mRNA of translation elongation factor 1 (GenBank accession number NM_001402). We used two parameters for the simulated evolution: (1) the substitution rate of human pseudogenes, i.e., 3.9 x 109 substitutions per year (Graur and Li 1999) and (2) the nucleotide-substitution matrix in Graur and Li (1999).
We designed the simulation so that each iteration corresponds to the spontaneous evolution occurring over the time span of 1 year. Thus, the number of iterations determines the nonfunctionalization time in years of the simulated pseudogene. In each iteration we determined whether each nucleotide i in the sequence undergoes mutation by sampling values from a uniform-probability distribution pi (0 1) and comparing these values to the yearly rate of substitution in pseudogenes. That is, if the condition pi < 3.9 x 109 was satisfied, then the nucleotide i was a target for substitution.
Once a nucleotide was selected to undergo mutation, we sampled another value from a uniform-probability distribution, qi (0 1), to determine the new identity of the nucleotide at position i. The substituting nucleotide was determined by the sampled value in the cumulative probability function formed by the substitution matrix R.
The simulation times ranged from 0 to 400 Myr. We repeated the simulation 10 times for each time value (i.e., we produced 10 different pseudogenes for each nonfunctionalization time). For each resultant sequence we estimated its nonfunctionalization distance, d, using pANT. For comparison, nonfunctionalization times were also estimated using the method of Li, Gojobori, and Nei (1981).
We found good correlation between the evolution times being simulated and the pANT distances (r2 = 0.98), and significant (P << 0.05) linear relation between the two variables. In fact, linearity is maintained for many time ranges. Using simple linear regression with the pANT distances as the predictor and the simulation evolutionary times as the predicted variables, we obtained a regression coefficient of 555 ± 1 Myr, which implies that each distance unit derived from pANT is equivalent to about 550 Myr (fig. 2). Considering the significant linearity and the good correlation between the pANT distance and the simulation times, we can use the regression equation from the simulation results as a reliable converter of the pANT distance units into units of time.
|
It is conceivable that biases in the substitution-rates matrix R may affect the nonfunctionalization-time estimates. The calculation of the R matrix in Graur and Li (1999) is based on 104 pseudogenes. To assess whether the choice of pseudogenes that are used for deriving the matrix R has a significant effect on our estimates, we recalculated the spontaneous-substitution matrix R for a group of 145 pseudogenes of the human ribosomal protein LA21 (Zhang, Harrison, and Gerstein 2002), which are distributed evenly among and along the chromosomes (the recalculated matrix R is available online at http://pant.tau.ac.il).
Using this recalculated substitution matrix, we conducted the evolutionary simulation of a pseudogene sequence. We then recalculated the linear-regression coefficient between the pANT results, using the substitution-rates matrix R taken from Graur and Li (1999) and the simulation results. This resulted in a linear coefficient of 544 ± 2 Myr, which is extremely close to the value we obtained by using the evolutionary simulations based on the matrix in Graur and Li (1999). This indicates that the pANT results are relatively insensitive to biases that may be present in the data sets.
![]() |
Case Study |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Based on the above phylogenetic information, the nonfunctionalization of the 1,3GT pseudogene was estimated to have occurred in the common ancestor of catarrhines, i.e.,
40 MYA (Joziasse 1991; Goodman et al. 1998; Koike et al. 2002). In contrast, Galili and Swanson (1991) used the substitution patterns in genes and pseudogenes to estimate that the nonfunctionalization event occurred after the divergence of apes and monkeys, i.e., 1725 MYA. Their estimate implies that the nonfunctionalization event occurred independently in three different lineages.
Because the human 1,3GT pseudogene is unitaryi.e., it has no functional homolog in humanwe used a homologous sequence from marmoset (GenBank accession number AF384428). The sequences were aligned using ClustalW (Thompson, Higgins, and Gibson 1994), and gaps were eliminated from further consideration. Third codon positions were also excluded from the analysis. The nonfunctionalization-time estimate using the method of Li, Gojobori, and Nei (1981) requires an orthologous sequence, for which we used the
1,3GT from rat (GenBank accession number NM_145674).
The distance obtained using pANT was 0.083; multiplying this value by the factor derived from the evolutionary simulation (555 Myr) yields an estimate of 46 Myr, in agreement with the estimate based on paleontological evidence (Joziasse 1991; Koike 2002). Similar results were obtained by using the substitution-rates matrix R derived from the ribosomal LA21 pseudogenes (Zhang, Harrison, and Gerstein 2002). Using the method of Li, Gojobori, and Nei (1981) resulted in an estimate of 12 Myr, which is significantly lower than the fossil-based estimates, and would imply multiple independent losses of 1,3GT functionality.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
pANT's reliance on only two sequences allows its application even in cases in which other methods cannot be used. An additional factor that increases the accuracy of pANT with respect to other methods is its use of a substitution-rates table that was derived specifically for a particular species (Gojobori, Li, and Graur 1982), and reflects a particular pattern of nucleotide substitution. We can therefore use it in cases, such as bacterial pseudogenes, in which the mammalian pattern of substitution is most certainly inappropriate. Furthermore, the use of a simulation of evolutionary time in order to calibrate pANT's time units sidesteps the problem of using unreliable measures of speciation times, and opens the way to assessing nonfunctionalization times in species whose phylogenetic position is unknown.
Simulations that we conducted on a model sequence showed that the number of substitutions saturates after about 250 Myr, at a level of 70% sequence identity (data not shown). Thus, below 70% identity, pANT and other methods which rely on nucleotide distances should be considered unreliable. However, considering that pseudogenes are generally identified using a 60%70% identity threshold (e.g., Zhang, Harrison, and Gerstein 2002), this limitation does not exclude a large number of potential pseudogene sequences.
The pANT estimate for the nonfunctionalization time of the 1,3GT was in agreement with estimates based on nonmolecular data. Indeed our estimate is somewhat larger than that reported by Goodman et al. (1998) in their assessment of the divergence time of platyrrhines and cattarhines (
40 Myr). However, it was shown that estimates of divergence times using secondary calibration points should be regarded as indicative of the time range of the divergence rather than its exact value (Shaul and Graur 2002). In contrast, the 15-Myr estimate of Galili and Swanson (1991) and the 12-Myr estimate obtained by applying the method of Li, Gojobori, and Nei (1981) are implausible because they would imply multiple pseudogenization events in independent lineages. Given the findings of Koike et al. (2002) to the effect that rhesus, orangutan, and human share two identical molecular defects in their orthologous
1,3GT pseudogenes (a frameshifting deletion in pseudoexon 7 and a nonsense mutation in pseudoexon 9), a date of less than 40 Myr would require us to "believe as many as six impossible things before breakfast."
The various genome-sequencing projects have revealed many processed pseudogenes (and numerous others will most certainly be discovered), for which assessment of orthology or paralogy may be difficult. With pANT, we may estimate the nonfunctionalization times of these pseudogenes by using only the pairwise alignment of the pseudogene with any functional homolog.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Boyce, W. E., and R. C. Diprima. 1977. Elementary differential equations. Wiley, New York.
Galili, U., and K. Swanson. 1991. Gene sequences suggest inactivation of -1,3-galactosyltransferase in catarrhines after the divergence of apes from monkeys. Proc. Natl. Acad. Sci. USA 88:7401-7404.[Abstract]
Gojobori, T., W. H. Li, and D. Graur. 1982. Patterns of nucleotide substitution in pseudogenes and functional genes. J. Mol. Evol. 18:360-369.[ISI][Medline]
Goodman, M., C. A. Porter, J. Czelusniak, S. L. Page, H. Schneider, J. Shoshani, G. Gunnell, and C. P. Groves. 1998. Towards a phylogenetic classification of primates based on DNA evidence complemented by fossil data. Mol. Phyl. Evol. 9:585-598.[CrossRef][ISI][Medline]
Graur, D., and W. H. Li. 1999. Fundamentals of molecular evolution. Sinauer Associates, Sunderland, Mass.
Joziasse, D. H., J. H. Shaper, E. W. Jabs, and N. L. Shaper. 1991. Characterization of an -1,3-galactosyltransferase homologue on human chromosome 12 that is organized as a processed pseudogene. J. Biol. Chem. 266:6991-6998.
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 2132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.
Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.[ISI][Medline]
Koike, C., J. J. Fung, D. A. Geller, R. Kannagi, T. Libert, P. Luppi, I. Nakashima, J. Profozich, W. Rudert, S. B. Sharma, T. E. Starzl, and M. Trucco. 2002. Molecular basis of evolutionary loss of the -1,3-galactosyltransferase genes in higher primates. J. Biol. Chem. 277:10114-10120.
Li, W. H., T. Gojobori, and M. Nei. 1981. Pseudogenes as a paradigm of neutral evolution. Nature 292:237-239.[ISI][Medline]
Miyata, T., and T. Yasunaga. 1981. Rapidly evolving mouse alpha-globin-related pseudo gene and its evolutionary history. Proc. Natl. Acad. Sci. U.S.A. 78:450-453.[Abstract]
Nei, M. 1987. Molecular evolutionary genetics. Columbia University Press, New York.
Nei, M., and S. Kumar. 2000. Molecular evolution and phylogenetics. Oxford University Press, New York.
Ophir, R. 1999. The contribution of mutational factors and functional constraints to the molecular evolution of murids and humans. Doctoral thesis, Tel-Aviv University.
Shaul, S., and D. Graur. 2002. Playing chicken (Gallus gallus): methodological inconsistencies of molecular divergence date estimates due to secondary calibration points. Gene 300:59-61.[CrossRef][ISI][Medline]
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.[Abstract]
Zhang, Z., P. Harrison, and M. Gerstein. 2002. Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome. Genome Res. 12:1466-1482.
|