*Department of Ecology and Evolution, University of Chicago;
and
Falling Rain Genomics, Inc., Palo Alto, California
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Despite these observations of the role of the conserved sequences in exons in the splicing reaction, an alternative view of flanking exon sequence conservation is that these are relics of recognition signals for the insertion of introns, which began soon after the rise of eukaryotes. Thus, these sites, dubbed "proto-splice sites" (Dibb and Newman 1989
), provide a physical entity for a possible mechanism of intron origin and have become a central feature of the intron-late hypothesis (for a recent review, see Logsdon 1998; Logsdon, Stoltzfus, and Doolittle 1998
). However, this proposal has rarely been put to the test (Long et al. 1998
), although it was often invoked in the introns-late argument (e.g., Lee, Stapleton, and Huang 1991
). To help define proto-splice sites, one can, on a case-by-case basis, compare the exon sequences flanking a site occupied by an intron in one gene but lacking an intron in its homology if there are an adequate number of such pairs of homologous genes (Dibb and Newman 1989
; Logsdon 1998
). However, a powerful test of the proto-splice model has to be built on a statistical description of general states of introns, because the definition of proto-splice sites was based on hypothetical ancient eukaryotic genomes and thus would determine all introns of subsequent origin. The distribution of intron phases in eukaryotic genomes provides the first opportunity to develop such a test.
Intron phases were defined as relative positions of an intron within or between codons (an intron is of phase 0, 1, or 2 if it is located between two intact codons or within a codon after the first or second nucleotide, respectively). Because introns are thought to be functionless in general and thus are usually viewed as neutral evolutionary units, an obvious prediction was that the distribution of the three intron phases should be random, like the distribution of point mutations in other functionless genetic elements (e.g., pseudogenes). However, when a large number of introns in GenBank DNA sequence database were examined, making use of the great progress of genome projects in recent years, a series of unexpected distributions of intron phases were discovered.
First, the proportions of the three intron phases were significantly not equal (Fedorov et al. 1992; Long, Rosenberg, and Gilbert 1995
; Tomita, Shimizu, and Brutlag 1996
). Phase 0 is the most abundant (
50%), followed by phase 1 (
30%), with phase 2 being the least abundant (
20%). Although new introns have continually been added to the databases, new analyses always reveal a similar distribution (see Long and Deutsch 1999
; Sakharkar et al. 2000
). Second, more interestingly, multiple introns within a gene showed a significant correlation with respect to the association of their phases. Exons flanked by introns of the same phase significantly outnumbered those predicted based on random association of intron phases, a condition termed "symmetric exon excess" (Long, Rosenberg, and Gilbert 1995
; Tomita, Shimizu, and Brutlag 1996
). In this correlation, the symmetric exons flanked by phase 1 introns ((1, 1) exons) always showed higher excess than the other two symmetric exons, (0, 0) and (2, 2), in accordance with the observation that most of the identified cases of exon shuffling involved the same (1, 1) exons (Patthy 1995
). Finally, the excess phase 0 introns and excess symmetric exons were also observed in ancient conserved regions (ACRs; Green et al. 1993
), suggesting that the same mechanism creating the distinctive distribution of intron phases also worked in such regions of ancient genes.
These observations show that distribution of intron locations within the coding sequences is nonrandom and thus reject a simple form of the insertional hypothesis of intron origin. However, the observed phenomena might be interpreted as a result of intron insertions into nonrandomly distributed proto-splice sites. For this hypothesis, the issue is whether or not the distribution of the phases of proto-splice sites is similar to that of intron phases or, more strictly, whether or not the observed intron phase distribution is a randomly sampled subset of the total proto-splice site distribution. Long et al. (1998)
tested this hypothesis by investigating whether or not the observed proportions of intron phases were consistent with the phase proportions of hypothetical proto-splice sites as predicted by dicodon distributions in humans and other organisms. No consistency was found between the distribution of the three intron phases and the phase distribution of proto-splice sites, thus negating any explanatory power of present-day proto-splice sites with regard to the nonrandomness in phase proportions.
In this paper, we extend our analysis of proto-splice sites from the first observation to the second phenomenon of intron phase correlation. We take a hypothesis test approach to investigate the validity of the proto-splice site model. In this text, the random insertion of introns into nonrandomly distributed proto-splice sites is the null hypothesis that serves as a basis to calculate the probability of observed intron phase correlation. We first ask if there are phase correlations among adjacent proto-splice sites. Then, we test whether or not the proto-splice sites can explain the correlation of intron phases as observed. Finally, we simulate a process of intron insertion into proto-splice sites and ask how often such an insertional process can generate observed distribution of intron phases. We will show that the proto-splice site model cannot yield any distribution resembling observed intron phase correlations.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Exon Database with CDS Sequence
Using a computer program similar to one we wrote to construct an exon database (Long, Rosenberg, and Gilbert 1995
), we collected the entries of all intron-containing genes in GenBank release 114 to form a raw database including the DNA sequence for each locus. We then wrote a program to filter out all questionable entries, such as pseudogenes and entries with inconsistent feature tables. We then developed a final exon database in which we calculated the positions and phases of introns and the positions and phases of proto-splice sites by scanning the entire CDS for each proto-splice site. An example for the phases and positions of pseudo-introns is given in figure 1
. We also calculated other statistical parameters, such as the lengths of exons and proteins and the protein sequences. In this database, we also constructed CDSs based on the information of feature tables. The major computing challenges were that very often the sequences of some exons as defined in one feature table are in different entries. We wrote a program to collect all of those entries that contained the sequence of a single exon into a subdatabase. Then, when we generated CDSs, the computing process automatically visited the subdatabase to fetch the exon sequence to form a complete CDS sequence.
|
Proto-splice Sites and Pseudo-intron Phases
Following our previous study (Long et al. 1998
), we chose four hypothetical sites, G | G, AG | G, AG | GT, and C/AAG | R, where "|" indicates a possible insertion site and "/" indicates two alternative states of one nucleotide site. These sites were chosen either because they were suggested by the proponents of proto-splice sites or because they had higher frequencies for real introns in the databases.
Distributions of Intron Phases and Pseudo-intron Phases
Both intron phases and pseudo-intron phases will generate two separate distributions: the proportions of the three phases and the association of phases within genes, f(i, j), where (i, j) = (0, 0), (1, 1), (2, 2), (0, 1), (0, 2), (1, 2), (1, 0), (2, 0), or (2, 1). As shown previously (Long, Rosenberg, and Gilbert 1995
; Tomita, Shimizu, and Brutlag 1996
; Fedorov et al. 1998
), the distribution of intron phases shows significant bias toward phase 0 introns and excess symmetric exons. The frequencies of the two sets of phase associations of introns and pseudo-introns were compared using a likelihood ratio test (G-test; Sokal and Rohlf 1995
),
where O(i, j) and P(i, j) are the frequencies of the associations of introns and pseudo-introns, respectively.
Monte Carlo Simulation
A Monte Carlo simulation was performed as a direct statistical test of the model in which the real introns in each gene are a result of random insertion into proto-splice sites. We generated all possible pseudo-introns in each gene in the purged database, then randomly targeted these sites once for each real introns in the gene. Figure 2
gives an example of the molecular model of intron insertion into proto-splice sites in the simulation process. Having treated all the genes from the purged database, we created a comparable array of virtual genes, for which we then calculated the distribution of the simulated pseudo-intron phases and associations. One hundred thousand such arrays and distributions were generated. In each array, we measured the difference between the observed intron phase associations and the distribution predicted by the insertion model using
|
where O(i, j) and S(i, j) are observed and simulated frequency of association (i, j) between two adjacent introns or pseudo-introns. The frequency of pseudo-intron association S(i, j), normalized to the total number of observed intron associations, was calculated in two ways: (1) by direct counts from simulated pseudoexon-intron structures; (2) as the product of the proportions of two pseudo-introns, i and j.
We then generated a frequency distribution of the obtained values of X2 to describe the dissimilarity between all simulated results and the observed intron associations. Meanwhile, we also calculated the probability of the statistical patterns of the observed relative excess of symmetric pseudo-exons. The pattern was described by
for (i j), where (1) R(i, i) = (F(i, i) - E(i, i))/E(i, i), R(i, i) is the measurement of the excess of the (i, i) type of symmetric pseudo-exons; (2) E(i, j) = F(i) x F(j) x N, E(i, j) is the expected frequency of the (i, j) pseudo-exon and F(i) and N are the observed proportion of pseudo-intron i and total internal exon number, respectively; and (3) the logic sign "
" means that the given conditions are met simultaneously.
This is a very conservative test of symmetric phase description. Actual observed excess of symmetric exons is just a portion of the excess measured here, and the observed excess of (0, 0) exons is higher than 0.05. The differences in the excesses between (0, 0) and (1, 1) and between (0, 0) and (2, 2) are also larger than 0.05, which we used in the computing. Simulation was carried out in the UNIX environment of the alpha workstation, where the function drand48( ) was used as a random number generator.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The proportions of three intron phases and frequencies of nine associations of introns showed significant nonrandom distribution (table 1
). Similar to previous version of the database, there are unequal intron proportions (48% phase 0, 28% phase 1, and 24% phase 2; G = 7,787, P << 10-100). The arrangements of intron phases within genes showed strong correlation: all symmetric exons, (0, 0), (1, 1), and (2, 2), showed significant excess over a random prediction (G = 867, P = 7.4 x 10-182), with (1, 1) showing the highest relative excess (23%). These observations are consistent with previous reports (Long, Rosenberg, and Gilbert 1995
; Tomita, Shimizu, and Brutlag 1996
; Fedorov et al. 1998
). Our previous investigation (Long et al. 1998
), which focused on the relationship between intron phase proportions and the proto-splice sites, rejected the model that the proto-splice sites could predict the distribution of intron phase proportions. This report focuses on a similar analysis of associations of proto-splice sites as predictors of the intron phase correlation, i.e., the correlation analysis of pseudo-intron phases.
|
|
AG | G and C/AAG | R seem better candidates because all three of their symmetric pseudo-exons have excesses and all their six asymmetric pseudo-exons have deficiencies compared with expectations. However, both sites show conflicting patterns of symmetric pseudo-exons: (2, 2) pseudo-exons show the highest excesses among three symmetric types, 40% for AG | G sites and 36% for C/AAG | R. This differs from intron phase associations which show an excess pattern, (0, 0) < (1, 1) > (2, 2), i.e., (1, 1) pseudo-exons having highest excess, consistent with observed cases of exon shuffling (Patthy 1995
).
The difference as shown in pattern analysis was further supported by simple statistical comparison between the phase distributions of introns and pseudo-introns. All proto-splice sites showed significant difference (the smallest G, from AG | G sites, was 3,680, with P << 10-100). Thus, the phase distributions as generated by proto-splice sites were not the same as the distribution of intron phases. Besides the statistical comparison between observed intron phase association and the overall distribution of pseudo-intron phase associations, another biologically more sensible statistical test was developed by directly simulating the process of intron insertion into proto-splice sites under the hypothesis that introns are results of insertion into preexisting proto-splice sites.
On average, the numbers of most hypothetical proto-splice sites (G | G, AG | G, and C/AAG | R) in each CDS were 415 times as numerous as introns (3.7/kb CDS; Deutsch and Long 1999
). This allows randomization of intron positions among proto-splice sites in each gene in computer simulation processes. Each randomization of the available introns in each gene generated one set of outcomes following the simulated insertions into a portion of proto-splice sites. Then, we investigated the associations of the pseudo-intron phases in each resulting set of outcomes and compared them with the observed intron associations.
In the program written for the simulation process, each simulation experiment began with the first gene in the database. The number of introns in this gene, n, was counted before the random number generator was called to randomly assign introns into n of the m previously defined proto-splice sites (m > n). Then, a string of pseudo-intron phases and their positions was calculated from the assignment. After this process had reached the last gene of the database, we had one new "database" with a set of randomly inserted positions. The associations of pseudo-intron phases for this set of randomized genes was then calculated and compared with the observed intron phase associations in the "real" database. We repeated this simulation experiment 100,000 times and generated a frequency distribution of statistical comparisons.
Figure 3ad
plots the frequency distributions of X2 measures for the comparisons of all 100,000 simulated data sets. These distributions actually contain two sets of comparisons. First, for a particular X2 value that contains a particular expectation from one sampling of proto-splice sites, the standard 2 distribution can be used to test whether the observed intron phase correlation is significantly different from that expectation. Second, by examining the frequency distributions of X2 values, we ask how often, in general, the observed intron phase correlation can be found in those 100,000 simulated data sets.
|
It should be noted that, given the huge sample size in this investigation, this test might have used too strict a criterion of fit to the hypothesis. A more conservative test was conducted by directly examining particular patterns of intron phase correlation. However, when using a very conservative test to calculate the probability of the observed pattern of excess symmetric exons (defined as R(0, 0) > 0.05 R(1, 1) > R(0, 0) + 0.05
R(1, 1) > R(2, 2) + 0.05
R(2, 2) > 0.05
F(i, j) < E(i, j); i
j; see Materials and Methods), none of 100,000 simulations generated an excess of symmetric pseudo-intron associations that fell into the broad range of excess patterns including the observed intron phase associations (P < 10-5).
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We observed that the arrangement of proto-splice sites is nonrandom with respect to their phases. At the first glimpse, the phase correlations of some proto-splice sites seem to mimic the correlation of intron phases, suggesting the possibility that the excess symmetric exons might be a consequence of intron insertions into the correlated proto-splice sites. Upon closer inspection, the sites that show phase correlations are those that have very low conservation, like any random sites in the genomes, and thus cannot be taken as vestigial candidates for insertions. The site that has highest conservation among those tested sites, G | G, does not show any significant correlations.
The nonrandom distribution of proto-splice sites provides a unique opportunity to model a hypothetical process of intron insertion. The Monte Carlo simulation for such a model showed a significantly low probability that proto-splice sites underlie the distribution of intron phases. This is consistent with the statistical test of the difference between the distributions of intron phases and proto-splice sites. These two statistical tests rejected the model of proto-splice sites. Finally, we tried several candidates for proto-splice sites. When more sites, albeit randomly selected ones, are taken as proto-splice sites, one always can find, by chance, some sites that show some similarity to any statistical pattern. For example, we tried AT | T and TT | C, which also show some similarity.
The three introns in Xdh found by Tarrio et al. (1998
) were viewed as evidence for the proto-splice site model of intron gains (Logsdon et al. 1998
). Two clouds weaken this argument. First, the definition of the proto-splice sites is ad hoc. Altogether, four sites were defined, CAG | G, AAG | G, GAA | A, and TCN | G (N refers to any of the four nucleotides, A, T, G, or C). The last two sites have nothing to do with known conserved sequences surrounding splice sites, although the first two sites have some similarity to the consensus sequence of splice sites. Dibb and Newman (1989)
proposed that proto-splice sites existed in intron-lacking ancestral genes. This proposal may offer a specific prediction for distribution of the insertion sites: the flanking exon sequence motif should be in both intron-containing and intron-absent sequences. This concept, however, also predicts some conservation in the proto-splice sites as a recognition signal for intron insertion. This criterion is violated in this case. Second, the argument that these introns are recently acquired is based on a standard but questionable approach of phylogenetic distribution. In this line of logic, once an intron appears in a small number of branches in a tree, by what is thought of as a parsimony approach, this intron is viewed as a recent gain. However, considering the biological reality that intron loss may occur much more frequently than intron gain, this approach is not justified. For example, introns in the whole genome of Saccharomyces cerevisiae are almost all lost by gene conversion (Fink 1987
). Thus, it is not reasonable to dismiss the alternative hypothesis that many lines of independent intron loss are more likely than a single intron insertion that may bring deleterious effects to the target genes.
Finally, one cause can be inferred for the distributions of psuedo-intron phases: the repetition of amino acid residuals with particular dicodon and codon usage, which we found to contribute to the correlation of some pseudo-intron phases. For AG | G proto-splice sites, for example, a string of glutamic acids, Glu.Glu.Glu.Glu, encoded by gag.gag.gag.gag, will create two (0, 0) symmetric pseudo-exons. This scenario was supported by the biased dicodon usage of gag.gag, which is highest among all dicodons in mammals, plants, and invertebrates (1,000-fold higher than the lowest dicodon usages) (Long et al. 1998
) and thus will make a contribution to the (0, 0) symmetric exons. In fact, when we deleted all adjacent amino acid repeats, we found that the excess of symmetric exons significantly dropped. Similarly, particular nonadjacent amino acid repeats and codon usages can also contribute to particular symmetric and asymmetric pseudo-exon distribution.
This investigation, along with a previous companion study, showed that the actual distribution of proto-splice sites in eukaryotic genes differs significantly from the distribution of intron phases. This rejects the proto-splice sites model as a null hypothesis to account for the unique distribution of intron phases. Alternatively, the best explanation for such distribution is that a large amount of exon shuffling, as predicted by the exon theory of genes (Gilbert 1987), created excess symmetric exons and overrepresented phase 0 introns (Long, Rosenberg, and Gilbert 1995
; Long et al. 1998
), because observed patterns of symmetric and asymmetric exons are consistent with observed cases of exon shuffling (e.g., Patthy 1991, 1995, 1999
).
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: intron phase
proto-splice sites
exon shuffling
evolution of intron-exon structures
intron insertion
2 Address for correspondence and reprints: Manyuan Long, Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, Illinois 60637. E-mail: mlong{at}midway.uchicago.edu
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Burge, C. B., T. Tuschl, and P. A. Sharp. 1999. Splicing of precursors to mRNAs by the spliceosomes. Pp. 525560 in R. F. Gesteland, T. R. Cech, and J. F. Atkins, eds. The RNA world. 2nd edition. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y
Deutsch, M., and M. Long. 1999. Intron-exon structures of eukaryotic model organisms. Nucleic Acids Res. 15:32193228
Dibb, N. J., and A. J. Newman. 1989. Evidence that introns arose at proto-splice site. EMBO J. 8:20152022[Abstract]
Fedorov, A., L. Fedorova, V. Starshenko, V. Filatov, and E. Grigor'ev. 1998. Influence of exon duplication on intron and exon phase distribution. J. Mol. Evol. 46:263271[ISI][Medline]
Fink, G. R. 1987. Pseudogenes in yeast? Cell 49:56
Gilbert, W. 1987. The exon theory of genes. Cold Spring Harb. Symp. Quant. Biol. 52:901905[ISI][Medline]
Green, P., D. Lipman, L. Hillier, R. Waterston, D. States, and J. M. Claverie. 1993. Ancient conserved regions in new gene-sequences and the protein databases. Science 259:17111716
Lee, V. D., M. Stapleton, and B. Huang. 1991. Genomic structure of Chlamydomonas caltractin: evidence for intron insertion suggests a probable genealogy for the EF-hand superfamily of proteins. J. Mol. Evol. 221:175191
Logsdon, J. M. 1998. The recent origins of spliceosomal introns revisited. Curr. Opin. Genet. Dev. 8:637648[ISI][Medline]
Logsdon, J. M., A. Stoltzfus, and W. F. Doolittle. 1998. Molecular evolution: recent cases of spliceosomal intron gain? Curr. Biol. 8:R560R563
Long, M., S. J. deSouza, and W. Gilbert. 1997. The yeast splice site revisited: new exon consensus from genomic analysis, Cell 91:739740
Long, M., S. J. DeSouza, C. Rosenberg, and W. Gilbert. 1998. Relationship between "proto-splice sites" and intron phases: evidence from dicodon analysis. Proc. Natl. Acad. Sci. USA 94:219313
Long, M., and M. Deutsch. 1999. Association of intron phases with conservation at splice site sequences and evolution of spliceosomal introns. Mol. Biol. Evol. 16:15281534[Abstract]
Long, M., C. Rosenberg, and W. Gilbert. 1995. Intron phase correlations and the evolution of the intron/exon structure of genes. Proc. Natl. Acad. Sci. USA 92:1249512499
Moore, M. J., C. C. Query, and P. A. Sharp. 1993. Splicing precursors to messenger RNAs by the spliceosome. Pp. 303357 in R. Gesteland and J. Atkins, eds. The RNA world. 1st edition. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y
Newman, A. J., and C. Norman. 1991. Mutations in yeast U5 snRNA alter the specificity of 5' splice-site cleavage. Cell 65:115123
. 1992. U5 snRNA interacts with exon sequences at 5' and 3' splice sites. Cell 68:743
Patthy, L. 1991. Modular exchange principles in proteins. Curr. Opin. Struct. Biol. 4:351361
. 1995. Protein evolution by exon shuffling. Springer-Verlag, New York
. 1999. Genome evolution and the evolution of exon-shuffling. Gene 238:103114
Pearson, W. R. 2000. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132:185219[Medline]
Rubin, G. M., M. D. Yandell, J. R. Wortman et al. (55 co-authors). 2000. Comparative genomics of the eukaryotes. Science 287:22042215
Sakharkar, M., M. Long, T. W. Tan, and S. J. de Souza. 2000. ExInt: an exon/intron database. Nucleic Acids Res. 28:191192
Sokal, R. R., and F. J. Rohlf. 1995. Biometry. 3rd edition. Freeman, New York
Tarrio, R., F. Rodriguez-Trelles, and F. J. Ayala. 1998. New Drosophila introns originate by duplication. Proc. Natl. Acad. Sci. USA 95:16581662
Tomita, M., N. Shimizu, and D. L. Brutlag. 1996. Introns and reading frames: correlation between splicing sites and their codon positions. Mol. Biol. Evol. 13:12191223[Abstract]
Treisman, R., N. J. Proodfoot, M. Shander, and T. Maniatis. 1982. A single-base change at a splice site in a beta 0-thalassemic gene causes abnormal RNA splicing. Cell 29:903911