Messenger RNA Surveillance and the Evolutionary Proliferation of Introns

Michael Lynch and Avinash Kewalramani

Department of Biology, Indiana University


    Abstract
 TOP
 Abstract
 Introduction
 Discussion
 Acknowledgements
 Literature Cited
 
The mechanisms responsible for the proliferation and subsequent stabilization of introns within the eukaryotic lineage have remained elusive. In the early stages of eukaryotic evolution, most introns may have been mildly deleterious at the time of insertion, but enough of them eventually acquired integral roles in transcript processing that few eukaryotic species can any longer survive without them. We suggest that the proliferation of spliceosomal introns was facilitated by the evolution of nonsense-mediated decay, an ancient and (in many cases) intron-dependent mechanism for eliminating aberrant mRNA molecules resulting from errors in transcription and splicing and from mutations at the DNA level. The spatial distribution of introns, as revealed by whole-genome analysis, is consistent with expectations for a model in which maximum protective coverage of a gene stochastically evolves over time.

Key Words: genome complexity • genome evolution • introns • mRNA processing • mRNA surveillance • nonsense-mediated decay • null alleles • splicing


    Introduction
 TOP
 Abstract
 Introduction
 Discussion
 Acknowledgements
 Literature Cited
 
Some of the deepest unsolved mysteries of eukaryotic genome evolution concern introns (Lynch and Richardson 2002). The presence of the spliceosome components in perhaps all of the basal lineages of eukaryotes (Logsdon 1998; Archibald et al. 2002; Nixon et al. 2002) implies that, as a group, spliceosomal introns are ancient. But why have they differentially proliferated within some lineages? Are they selectively neutral, or are they maintained by positive selection? Newly arisen introns are likely to be mildly deleterious as a consequence of the increased mutation rate to nonfunctional alleles associated with essential intron-recognition sequences (Lynch 2002). Nevertheless, assuming the presence of a functional spliceosome, for populations with effective sizes much smaller than the reciprocal of this excess mutation rate, introns might be expected to proliferate in a nearly neutral fashion, whereas in larger populations, selection should be efficient enough to prevent intron colonization. The maximum allowable population size for the initial proliferation of introns depends on the stringency of the intron-recognition requirements of a species as well as on the per-nucleotide mutation rate, but permissive long-term effective sizes as large as 106 to 107 individuals are not out of the question (Lynch 2002). Because such conditions are at least transiently fulfilled in many eukaryotic species, the earliest phase of intron expansion in eukaryotes may have been driven in large part by drift and mutational processes.

Although it is unlikely that many intron-containing alleles were immediately advantageous at the time of origin, introns are no longer passive players in genome evolution. Instead, what once may have been a simple by-product of small population size provided the raw material for the evolution of novel mechanisms for regulating gene expression and processing gene products. Indeed, because almost all of the major events in the production of mature mRNAs (including transcription initiation, elongation, polyadenylation, termination, 5' capping, and mRNA export and surveillance) are now highly coupled with exon definition and/or intron splicing (Maniatis and Reed 2002), it is likely that few if any of today's eukaryotes can survive without introns. A few examples will suffice to make this point.

First, a direct interaction between various splicing factors and elongation factors promotes transcription elongation (Ares et al. 1999; Fong and Zhou 2001). Second, for some intron-containing genes, splicing appears to be required for efficient mRNA export to the cytoplasm, with a functional coupling of these two processes being mediated by a protein complex deposited on spliced mRNAs (Luo and Reed 1999; Le Hir et al. 2001; Read and Hurt 2002). Among other things, this protects the cell from the accumulation of error-containing transcripts by ensuring that unspliced pre-mRNAs are retained in the nucleus. Third, the definition of the final exon, via the splicing signals at the 3' end of the upstream intron, appears to be essential for efficient polyadenylation of transcripts (Niwa, Rose, and Berget 1990; Niwa, MacDonald, and Berget 1992) and is also involved in transcriptional termination (Dye and Proudfoot 1999; McCracken, Lambermon, and Blencowe 2002). Fourth, the full spectrum of introns contained within a gene may mutually facilitate one anothers' removal from pre-mRNAs, with the coordinated use of splice sites for exon recognition imposing stabilizing selection for an optimal exon size (Robberson, Cote, and Berget 1990; Nesic and Maquat 1994; Berget 1995; Cooke, Hans, and Alwine 1999). Finally, as now described in detail, introns provide an additional benefit to their eukaryotic hosts by providing coordinates for the identification of premature termination codons (PTCs) contained within aberrant mRNAs.

Nonsense-Mediated Decay as a Facilitator of Intron Proliferation
Premature termination codon–containing mRNAs arise in a variety of ways, including the direct transcription of inherited mutant alleles, transcriptional or splicing errors involving otherwise functional alleles, and (in animals) the stochastic production of somatic recombinants of immune system genes. The transcriptional error rate alone is on the order of 10-5 per nucleotide (Ninio 1991; Shaw, Bonawitz, and Reines 2002), so with ~5% of random codons denoting stop and with 103 to 104 coding nucleotides comprising a typical gene, at least 0.05% to 0.5% of primary transcripts can be expected to contain a PTC. To a considerable extent, eukaryotes are protected from the accumulation of such transcripts by nonsense-mediated decay (NMD), a mRNA surveillance mechanism that leads to selective degradation of PTC-containing transcripts. Although much remains to be learned about NMD, substantial insight into the underlying molecular mechanisms has emerged for yeasts, Caenorhabditis elegans, and mammals (for reviews, see Hentze and Kulozik 1999; Gonzalez et al. 2001; Lykke-Andersen 2001; Mango 2001; Maquat and Carmichael 2001; Wilusz et al. 2001). It remains to be determined whether any protists are capable of NMD, and NMD is not known to occur in any prokaryote. However, the apparent presence of NMD in plants (Isshiki et al. 2001) suggests that NMD was probably present prior to the divergence of most of the major eukaryotic lineages.

Discriminating PTCs from correct termination codons is the major challenge for a successful NMD pathway, and fungi and animals accomplish this in quite different ways. In mammals, a series of proteins is deposited approximately 20 nucleotides 5' to every exon-exon junction at the time of splicing. These ornamented junctions then serve as markers for the true termination codon, which generally lies further downstream in the mature mRNA than the final exon-exon junction. If during translation a termination codon is detected 50 or more nucleotides upstream of this final marker, the mRNA is targeted for selective degradation (Nagy and Maquat 1998). Although intron-free mammalian genes generally appear to be NMD insensitive (Maquat and Li 2001; Brocke et al. 2002), "failsafe" sequences embedded within exons can sometimes elicit NMD in cases where there is no intron downstream of a PTC (Cheng et al. 1994; Zhang et al. 1998; Rajavel and Neufeld 2001), a situation that has parallels in yeast. Introns are extremely rare in Saccharomyces cerevisiae, which greatly reduces their utility as substrates for transcript marking. In this species, PTC recognition relies entirely on downstream sequence elements (DSEs) within coding DNA, with stop codons less than 200 bp upstream of a DSE being interpreted as premature (Ruiz-Echevarria, González, and Peltz 1998).

It is an open question whether the mammalian intron-based or the S. cerevisiae exon-based PTC-recognition pathway more closely represents the ancestral mode of NMD. However, several observations suggest that S. cerevisiae is derived with respect to NMD and other aspects of mRNA processing. First, some spliceosomal components common to both animals and the fission yeast Schizosaccharomyces pombe are absent from S. cerevisiae, and in terms of sequence variation, splicing genes in S. pombe tend to be much more similar to those in human than to those in S. cerevisiae (Aravind et al. 2000; Käufer and Potashkin 2000). Second, two proteins deployed in the exon-junction complex (EJC) in animals and also known to be present in S. pombe, Mago and Y14, appear to be absent from S. cerevisiae (Zhao et al. 2000). At the same time, empirical work demonstrates that NMD can operate on intron-free genes in S. pombe (Mendell et al. 2000), and as noted above, the presence of introns is a nonessential element for NMD in a minority of mammalian genes, and this seems also to be true for C. elegans (Pulak and Anderson 1993) and plants (Isshiki et al. 2001). Thus, it remains a formal possibility that two NMD pathways exist within eukaryotic lineages, with mammals coming to rely predominantly on the EJC pathway and S. cerevisiae on the DSE pathway. The mode of PTC detection via the EJC pathway may also vary among taxa. For example, suggestions have been made that NMD in C. elegans requires two marked exon junctions (Mango 2001), and that the first exon junction plays a role in rice (Isshiki et al. 2001), although in neither case have the destabilizing elements yet been identified.

Because NMD frequently relies on exon junctions for orientation in identifying PTCs, and because the phenomenon either predates (or coincides with) the proliferation of introns in plants, fungi, and animals, it is likely that the evolution of NMD played a role in the colonization of introns within eukaryotes. Once a reliable intron-dependent system of NMD was in place, a positive feedback in genomic evolution may have then been initiated—the types of splicing errors that are unique to intron-containing alleles as well as the excess mutation rate for such alleles would have intensified selection for efficient NMD. This, in turn, would have relaxed the selective constraints against the further accumulation of introns and may very well have encouraged their addition and/or movement to sites that maximize the efficiency of PTC detection. The intimate spatial and temporal associations among introns, splicing, and surveillance appear to provide an optimal setting for the coevolution of transcript processing mechanisms. Thus, it is notable that (1) at least one of the elements of the EJC involved in mRNA export also functions to recruit a key factor involved in NMD (Kim, Kataoka, and Dreyfuss 2001; Le Hir et al. 2001; Lykke-Andersen, Shu, and Steitz 2001), (2) at least one of the splicing proteins also plays a role in NMD (Luo et al. 2001; Strasser and Hurt 2001), and (3) several of the proteins involved in conventional translation termination are also involved in NMD (Wang et al. 2001).

The Initial Fixation of Introns Facilitated by NMD
We first consider the probability that a newly arisen intron will become fixed in a population with an established NMD mechanism that employs exon junctions to identify PTCs. Starting with a base population fixed for an intron-free allele Ao, the new intron-containing allele (Ai) will have initial frequency 1/(2N), where N is the size of the population, assumed to be diploid and randomly mating. Although both types of alleles incur coding-region mutations that produce a PTC at rate µc per gene per generation, for intron-containing alleles, a fraction p of such mutations falls in locations of the gene that are subject to NMD (alleles in the class ai), whereas the remaining fraction (1-p) does not (alleles in the class ao) (fig. 1). Also unique to intron-containing alleles is a pathway to nonfunctionality resulting from mutations occurring in nucleotide sites critical for proper splicing (Lynch 2002). Alleles with defective splice sites in their only intron are not expected to be subject to NMD, so denoting this excess mutation rate as µi, the total rate of mutation from the Ai to the ao allele is µi + (1- pc.



View larger version (20K):
[in this window]
[in a new window]
 
FIG. 1. Upper panel: Mutational flow from functional to nonfunctional categories of alleles. Shaded regions denote areas containing mutations to PTCs. Lower panel: Genotypic fitnesses, as described in the text

 
The success of an intron-containing allele depends on two selective consequences of its NMD capacity (fig. 1). First, a direct advantage of NMD results from the silencing of transcriptional (and/or posttranscriptional) errors that occur at the locus. Letting {alpha} be the advantage that would accrue to an allele that was capable of eliminating all possible PTC-containing transcripts, then p{alpha} approximates the expected advantage for an allele with NMD capacity p. Second, NMD indirectly influences the extent to which PTC-containing mutant alleles cause a reduction of fitness in the heterozygous state. Letting ß be the reduction in heterozygote fitness resulting from a PTC-containing allele not subject to NMD, then kß is the reduction from an NMD-sensitive allele. In the case of haplo-sufficiency, k = 0.

In addition to the molecular-genetic features {alpha}, p, and µi, the effective population size N is a key determinant of the probability of fixation of a newly arisen intron under this model, as N defines the degree to which allele-frequency changes are determined by random genetic drift. k plays a negligible role in the fixation process, because it only influences the fitness of heterozygotes containing rare nonfunctional mutations incurred within the lineage of intron-containing alleles en route to fixation. Denoting the net selective advantage of an intron-containing allele subject to NMD by s = p{alpha} - µi, the fixation probability is closely approximated by the usual diffusion equation (Crow and Kimura 1970),


The validity of this relationship was verified using the stochastic simulation procedures outlined in Lynch (2002).

The scaled probability of fixation ({Theta}F = 2NuF), which is a simple function of 4Ns, is approximately equal to 1 when |4Ns| << 1, approximately equal to 4Ns when 4Ns >> 1, and asymptotically approaches zero as 4Ns -> -{infty}. If B denotes the rate of birth of new introns (per gene per unit time), then B{Theta}F can be interpreted as the rate of fixation of initial introns in previously intron-free genes. These results imply that if |s| is sufficiently small relative to 1/N, the rate of fixation of the first introns in genes will correspond to the neutral expectation B. If, however, |s| is sufficiently large relative to 1/N, intron establishment is expected to be negligible if s is negative (the advantages of NMD being outweighed by the increased mutation rate to nulls for an intron-containing allele) and to increase with increasing population size if s is positive.

Because the location of an intron dictates the selective advantage associated with NMD (through its influence on p), a biased spatial distribution for initially colonizing introns is expected under this model. Consider, for example, the situation in mammals, where NMD for a construct for one particular gene (triose phosphate isomerase) appears to only be effective with PTCs lying within a span of nucleotides 50 to 550 bases upstream of the intron (Zhang et al. 1998). Assuming equal efficiency of NMD throughout the entire 500 bp range, and letting L be the number of nucleotides in the coding region and I be the position of the intron, p = 0 if I <= 50, p = (I - 50)/L if50 < I < 550, and p = 500/L if I >= 550. Using these relationships and equation (1), along with the mutation rate µi = 10-6 and the NMD-associated selective advantage {alpha} = 10-5, the expected spatial distribution of initial colonizing introns is given in figure 2 (upper panel). For populations with sufficiently small effective sizes, the distribution is nearly uniform over the entire length of the gene, because the magnitude of random genetic drift is large relative to s. However, as N increases, a strong bias develops toward a 3' location. For this example, the distribution is always flat for introns located beyond nucleotide 550, as all of these enjoy the maximum selective advantage associated with NMD.



View larger version (19K):
[in this window]
[in a new window]
 
FIG. 2. Upper panel: Probability distribution for the first introns to fix within a coding region containing a total of L=1000 nucleotides, for three population sizes (N) for the situation in which µi = 10-6 and {alpha} = 10-5, assuming an idealized vertebrate NMD model described in the text. Middle panel: Bivariate distribution for the first two introns to fix, under the same conditions as in the upper panel, with N = 106. Contour lines denote increasing densities. Note that intron 1 denotes the position of the 5'-most intron, which is not necessarily the first intron to colonize. Bottom panel: A sampling of 15 endpoints after which three introns have colonized a gene and become fixed in the population. In each case, the 5'-most intron is denoted by a solid circle, the second by an open circle, and the 3'-most by a solid triangle

 
Although these results serve to demonstrate how NMD can provide a selective environment that molds the spatial distribution of introns, it should be kept in mind that the specific pattern driven by this process depends on a number of factors. As noted above, the length of the surveillance tract may differ substantially among species, and possibly also among genes within species. For example, NMD is elicited by PTCs as far as 700 bp upstream of an intron in the human ß-globin gene (Neu-Yilik et al. 2001), as far as 1,200 bp upstream in the human HSP70 gene (Maquat and Li 2001), and as far as 3,000 bp upstream in the human BRCA1 gene (Perrin-Vidoz et al. 2002). Moreover, the relationship between NMD efficiency and the distance of a PTC from a marked exon junction may be more gradual than a threshold function. In the case of the BRCA1 gene, for example, NMD efficiency appears to decrease with increasing PTC-EJC distance for the first few hundred nucleotides, followed by a plateau at the level of approximately 50% transcriptional silencing (Perrin-Vidoz et al. 2002), but the data are very noisy, suggesting the potential influence of additional sequence-specific factors. For the T-cell receptor gene in humans, the efficiency of NMD increases with the distance of the PTC from the downstream intron, at least up to 250 bp (Wang et al. 2002). The primary implication of these subtle details is that as the span for effective NMD becomes small relative to L, the expected spatial distribution of colonizing introns approaches uniformity.

Secondary Accumulation of Introns
For a species with an EJC-dependent system of NMD, the establishment of the first intron within a gene will modify the selective environment for subsequent intron colonization events because the first intron already covers a fraction of the total potential for NMD. The average selective advantage of a secondarily arising intron is necessarily less than or equal to that of the first, with the magnitude of reduction depending on the spatial configuration of the two introns (fig. 3). Because this same logic applies to all subsequently arising introns, the rate of colonization of introns is expected to be negatively associated with intron number, and once sufficient coverage for NMD has been acquired, all newly arisen introns will be weakly selected against (as a consequence of the enhanced mutation rate to nonfunctional alleles). Ultimately, this negative density dependence should result in a quasi–steady state number of introns per gene, and an overdispersion of intron positions is expected to result from the selective advantage of alleles with introns with minimal overlap in their regions of NMD coverage.



View larger version (13K):
[in this window]
[in a new window]
 
FIG. 3. Hypothetical spatial configurations of the NMD-sensitive regions associated with two introns. The uppermost allele contains a single intron whose upstream NMD-sensitive region spans the region between the two vertical dotted lines. Three hypothetical secondary introns are shown as follows: a, The second intron is far enough downstream from the first that there is no overlap in the NMD-sensitive regions, so that both introns have the same selective advantage. b, About 75% of the NMD-sensitive regions of the two introns overlap (dashed line), reducing the region specific to the second intron to the span associated with the solid horizontal line. c, The second intron is far enough upstream of the first that the 5' end of its potential NMD-sensitive span extends beyond the start of the gene

 
As in the case of the first colonizing intron, the bivariate probability distribution for the locations of the first two introns to become fixed within a gene will be a function of the effective population size (N), the excess mutation rate of intron-containing alleles (µi), the total potential selective advantage of NMD ({alpha}), the length of the gene (L), and the length of the NMD-sensitive region associated with an intron. Using equation (1) to estimate the probability of fixation of first introns as a function of location, and the geometric logic outlined in figure 3 to compute p for the second intron (conditional on the locations of both introns), the joint probability distribution for the spatial configurations of all pairwise combinations of locations can be determined. One such example is shown in the middle panel of figure 2 for a gene containing 999 nucleotides in the coding region, and with N=106, µi = 10-6, and {alpha} = 10-5, again for an idealized mammalian-like NMD scanning region. In this case, the bivariate distribution is far from uniform, with the most common pairwise combination having the first (5'-most) intron located at nucleotide position 500 and the second located at position 999. This configuration maximizes the NMD-associated coverage, with the first intron accounting for nucleotide positions 1 to 449 and the second for positions 450 to 949 (the final 50 nucleotides remaining uninfluenced by NMD owing to the 50 nucleotide upstream limit to coverage under this model). For situations in which the first intron is beyond nucleotide 500, the probability density of the second intron increases toward the 3' end of the gene, as this maximizes NMD coverage conditional on the position of the first intron. However, when the first intron is located near the 5' end of the gene, there is a broad uniformly distributed range of positions for the second intron; this is so because all secondary introns that are sufficiently 3' in location enjoy the full selective advantage of a 500-nucleotide NMD-associated span. These kinds of spatial analyses can be extended to increasing numbers of introns. Some random endpoints after the colonization of the first three introns are shown in the bottom panel of figure 2.

Even in the absence of detailed knowledge of the spatial requirements of NMD, these analyses lead to three qualitative predictions. If maximization of the span of protective coverage by NMD is a significant evolutionary force dictating the location of introns, then (1) the number of introns should scale linearly with the length of coding DNA in a gene, (2) the average positions of consecutive introns should be approximately evenly distributed over the length of a gene, and (3) introns should be overdispersed, i.e., exon sizes should be more uniform than expected under random insertion.

The Spatial Distribution of Introns
To gain information on the spatial distribution of introns, we surveyed the sequenced genomes of Homo sapiens, C. elegans, Drosophila melanogaster, and Arabidopsis thaliana. For all four species, there is a striking linear relationship between the average number of introns contained within a gene and the length of coding DNA (fig. 4). For H. sapiens, on average an intron is present for each 125 (SE = 3) nucleotides of coding DNA, whereas the average exon size for C. elegans is 180 (3) nucleotides, that for D. melanogaster is 325 (13) nucleotides, and that for A. thaliana is 101 (3) nucleotides. The average human and C. elegans gene size (in coding nucleotides) at the point of first intron colonization is ~500 bp, whereas that for D. melanogaster and A. thaliana is nearly twofold greater, ~900 bp. (We did not include intron-free genes in these analyses, as there is compelling evidence that large numbers of such annotated open reading frames are actually processed pseudogenes; Chen et al. 2002).



View larger version (25K):
[in this window]
[in a new window]
 
FIG. 4. Linear regressions for the mean length of coding DNA on the number of introns per gene. r2 = 0.987 for D. melanogaster (11,493 sequences), 0.997 for C. elegans (17,030) sequences, 0.997 for H. sapiens (5,565 sequences), and 0.988 for A. thaliana (18,625 sequences). Where not visible, standard error bars are narrower than the width of the data points. The data for all of the analyses in this paper were obtained from gene-specific coordinates provided in the ExInt Database (http://intron.bic.nus.edu.sg/exint/exint.html) of Sakharkar et al. (2002), after eliminating potentially redundant entries in the database. Similar results were obtained when we directly downloaded sequences deposited at NCBI

 
If introns randomly colonize genes or if selection favors uniform coverage of a gene, average intron locations should follow a simple linear relationship—for a gene with n introns, the ith intron will have expected location = i/(n + 1), where denotes the relative position along the coding (translated) region of the gene. The data for H. sapiens are in fairly close agreement with this expectation, whereas those for the remaining species exhibit some significant deviations (fig. 5). For both invertebrates, the first few introns tend to be more 5' in location than expected under the null model, although the average positions of introns in locations >= 6 fall close to the null expectations. Data from the fission yeast S. pombe yield similar patterns (see table 2 in Wood et al. 2002)—a positive linear relationship between the length of a gene's coding DNA and the mean number of introns implies an average of 124 (40) nucleotides per exon, but the relative positions of introns are shifted in the 5' direction, the moreso for the more 5'-located introns. In contrest, the most 5' introns in A. thaliana tend to be biased in the 3' direction, whereas the most 3' introns tend to be biased in the 5' direction, resulting in a situation where the overall population of introns in this species tends to be concentrated toward the centers of genes.



View larger version (29K):
[in this window]
[in a new window]
 
FIG. 5. The average relative locations of introns in three metazoans for genes with 2, 4, 6, 8, and 10 introns relative to expectations under a random distribution (dotted lines). Where visible, standard errors are given by cross hatches, but in most cases, these are much narrower than the width of the plotted points

 
Information on the degree of dispersion of introns within genes can be summarized by considering the effective number of exons within a gene, defined by ne = 1/({sum}i = 1nei2), with n being the number of exons, and ei being the length of the ith exon (relative to the total length of coding DNA, such that the sum of the ei is equal to 1). The most extreme case of intron overdispersion is the one in which all exons are of uniform length (1/n), yielding ne = n. At the opposite extreme, if all introns are clumped at one end of a gene, one exon approaches length 1.0, while all others approach 0.0, yielding ne~=1. For the null model of randomly distributed introns, we obtained the expected values of ne by computer simulation of the broken-stick model. Introns are overdispersed in all four species (i.e., exon sizes are more uniform than expected under the null model of random insertions), although the deviations for D. melanogaster are relatively small (fig. 6). Moreover, the tight linear relationship between ne and n implies that the efficiency of intron coverage is nearly independent of the amount of coding DNA/gene.



View larger version (23K):
[in this window]
[in a new window]
 
FIG. 6. The average effective number of exons for genes with 1 to 10 introns, compared with the expectations under the null model of randomly distributed introns and under the extreme case of completely uniform exon length. Cross hatches denote standard errors

 

    Discussion
 TOP
 Abstract
 Introduction
 Discussion
 Acknowledgements
 Literature Cited
 
Although we may never know with certainty the initial series of events that led to the establishment of spliceosomal introns, the roles that introns now play in mRNA processing provide insight into the mechanisms that may be responsible for their differential proliferation and maintenance. A key issue, not previously widely appreciated, may be nonsense-mediated decay. Nonsense- mediated decay protects the genome against the production of truncated proteins resulting from a broad array of errors in replication at the DNA level and transcription at the RNA level. Although such errors are common to all genes, they are especially enriched in intron-containing alleles (as a consequence of the excess mutation rate to null alleles and the elevated rate of splicing errors). Thus, colonization of the eukaryotic genome by spliceosomal introns must have provided strong selective pressure for an efficient mechanism of NMD. However, the eventual deployment of introns themselves as orientation guides for NMD is a remarkable example of natural selection taking advantage of a potentially debilitating situation, since an EJC-based system of NMD protects the organism from the accumulation of PTC-containing transcripts resulting from splicing errors not only caused by introns themselves but also from all other sources. We surmise that once the ancestral eukaryotic genome was sufficiently populated with nuclear introns, the evolution of an intron-based system of NMD paved the way for still further colonization of introns, even providing selective pressure for their approximate locations within host genes as a means for minimizing the consequences of aberrant mRNAs. Under this scenario, runaway proliferation of introns is unlikely because once a gene is sufficiently subdivided by introns, the negative consequences of additional introns will begin to outweigh the NMD-associated advantages, putting a natural cap on further colonization.

Our whole-genome analyses demonstrating the linear increase in the mean number of introns with gene size and the overdispersed distribution of introns strongly implicate the operation of some general form of selection for relatively uniform coverage of the coding regions of genes. However, considerable interspecific differences in the spatial geometry of intron positions also imply phylogenetic variation in the internal selective pressures and/or birth/death processes driving such patterns. Among the four species analyzed herein, humans conform most closely with a model for uniform coding-region coverage by introns. Human exons are quite diminutive with only a small minority exceeding 300 bp in length (Berget 1995). Taken at face value, the 125-bp average size of human exons might appear to be inconsistent with a model in which NMD is the sole selective force for exon size, because the few studies that have searched for an upper limit to surveillance span have found this to be in excess of 500 bp in mammalian cell lines. Given the very small number of genes that have been examined, however, most of which involve constructs with single introns, it is not yet possible to rule out the existence significant interlocus variation in the spatial requirements of NMD. Nor can a gradient in NMD efficiency with PTC-EJC distance or an enhanced efficiency of NMD with greater numbers of introns be ruled out.

These caveats aside, it must be acknowledged that additional factors associated with intron processing may contribute to stabilizing selection for small exon size in the human genome. Relatively small long-term effective population sizes enhance the vulnerability of mammalian genomes to the accumulation of introns, perhaps beyond the density necessary for efficient NMD, and also reduce the efficiency of selection against insertions within introns (Lynch 2002). As a consequence, mammalian exons are generally dwarfed by their large (often tens of kb) intervening introns. Such conditions have apparently selected for a mammalian splicing machinery that is dependent on exon (rather than intron) recognition, with significant errors arising when exon length exceeds 300 bp or so (Berget 1995; Sterner, Carlo, and Berget 1996). Thus, even if the efficiency of NMD remains high over a surveillance tract as long as 1,000 bp in mammals, the presence of exons of this length would virtually guarantee a very high incidence of splicing errors. This suggests that selection for small enough exon sizes to minimize the incidence of splicing errors combined with selection for efficient mRNA surveillance capacity may be mutually responsible for the exceptionally regular distribution of relatively small mammalian exons.

Unlike the situation in humans, the more 5'-located introns in C. elegans, D. melanogaster, and S. pombe tend to be shifted in the 5' direction relative to expectations under a model of uniform spacing. A number of selective forces might favor such locations. First, NMD will more strongly select for coverage at the 5' end of a gene if truncated transcripts of short to medium length are more harmful than those that are nearly complete. Second, 5'-located introns are likely to be more integrated with other intron-associated activities, such as gene regulation, the facilitation of transcript elongation, and the guidance of mRNA export. Third, the distribution of introns will also depend on the mechanisms of intron gain and loss if these vary over the length of a gene. For example, it has been suggested that reverse transcription followed by homologous recombination may be a common mechanism of intron loss (Fink 1986). If this is the case, however, exons at the 3' ends of genes are expected to be exceptionally long. Although the final exon of a gene does tend to be unusually long (Hawkins 1988), our data show that this is true only if the noncoding portion of the 3'-most exon is included.

The lower number of introns/coding DNA as well as the higher level of variance in exon size in invertebrates relative to mammals may be a consequence of weaker stabilizing selection for exon size in the former. Intron lengths in the two invertebrates are much smaller than those in humans, and it has been argued that splicing of genes with small introns relies on intron recognition (Berget 1995), which would reduce the need for a high density of relatively evenly spaced introns per coding DNA. Moreover, if as suggested above, there are additional selective pressures favoring 5'-located introns, a lower overall abundance of introns in invertebrates would tend to select for more 5' bias than in mammals, where the first introns are already close to the 5' end because of their higher density.

Other intron-dependent aspects of mRNA processing may have influential roles in determining the spatial distribution of introns. The essential role of spliceosomal introns in the production of alternatively spliced gene products, which are elicited by up to 30% of the genes in metazoan species (Graveley 2001), makes them an obvious candidate. Nevertheless, there are still only a few convincing demonstrations of the adaptive significance of alternative splicing, and a large number of the products of this process may just be inevitable consequences of an imperfect splicing system (Levine and Durbin 2001). Moreover, selection for alternative splicing provides no obvious explanation for the substantial differences in average intron density and dispersion that exist among species. Finally, Castillo-Davis et al. (2002) have shown that although there is a dramatic decline in average intron size with increasing level of gene expression in nematodes and mammals, intron number is independent of gene-expression level. The latter observation is consistent with the hypothesis that factors unassociated with gene regulation impose stabilizing selection for intron number. Thus, we suggest that species-specific patterns of intron locations may be driven primarily by a coevolutionary loop involving the physical limitations on both splicing and mRNA surveillance.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Discussion
 Acknowledgements
 Literature Cited
 
We thank L. Maquat and A. Richardson for many helpful comments, M. K. Sakharkar and colleagues for making their extensive compendium of intron locations available to the general public, the many individuals responsible for the whole-genome sequences that made our empirical studies possible, and the National Institutes of Health for financial support.


    Footnotes
 
E-mail: mlynch{at}bio.indiana.edu. Back


    Literature Cited
 TOP
 Abstract
 Introduction
 Discussion
 Acknowledgements
 Literature Cited
 

    Aravind, L., H. Watanabe, D. J. Lipman, and E. V. Koonin. 2000. Lineage-specific loss and divergence of functionally linked genes in eukaryotes. Proc. Natl. Acad. Sci. USA 97:11319-11324.[Abstract/Free Full Text]

    Archibald, J. M., C. J. O'Kelly, and W. F. Doolittle. 2002. The chaperonin genes of jakobid and jakobid-like flagellates: implications for eukaryotic evolution. Mol. Biol. Evol. 19:422-431.[Abstract/Free Full Text]

    Ares, M., Jr., L. Grate, and M. H. Pauling. 1999. A handful of intron-containing genes produces the lion's share of yeast mRNA. RNA 5:1138-1139.[Free Full Text]

    Berget, S. M. 1995. Exon recognition in vertebrate splicing. J. Biol. Chem. 270:2411-2414.[Free Full Text]

    Brocke, K. S., G. Neu-Yilik, N. H. Gehring, M. W. Hentze, and A. E. Kulozik. 2002. The human intronless melanocortin 4-receptor gene is NMD insensitive. Hum. Mol. Genet. 11:331-335.[Abstract/Free Full Text]

    Castillo-Davis, C. I., S. L. Mekhedov, D. L. Hartl, E. V. Koonin, and F. A. Kondrashov. 2002. Selection for short introns in highly expressed genes. Nat. Genet. 31:415-418.[CrossRef][ISI][Medline]

    Chen, C., A. J. Gentles, J. Jurka, and S. Karlin. 2002. Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22. Proc. Natl. Acad. Sci. USA 99:2930-2935.[Abstract/Free Full Text]

    Cheng, J., P. Belgrader, X. Zhou, and L. E. Maquat. 1994. Introns are cis-effectors of the nonsense-codon-mediated reduction in nuclear mRNA abundance. Mol. Cell. Biol. 14:6317-6325.[Abstract]

    Cooke, C., H. Hans, and J. C. Alwine. 1999. Utilization of splicing elements and polyadenylation signal elements in the coupling of polyadenylation and last-intron removal. Mol. Cell. Biol. 19:4971-4979.[Abstract/Free Full Text]

    Crow, J. F., and M. Kimura. 1970. An introduction to population genetics theory. Harper and Row, New York.

    Dye, M. J., and N. J. Proudfoot. 1999. Terminal exon definition occurs cotranscriptionally and promotes termination of RNA polymerase II. Mol. Cell 3:371-378.[ISI][Medline]

    Fink, G. R. 1986. Pseudogenes in yeast? Cell 49:5-6.[CrossRef][ISI]

    Fong, Y. W., and Q. Zhou. 2001. Stimulatory effect of splicing factors on transcriptional elongation. Nature 414:929-933.[CrossRef][ISI][Medline]

    Gonzalez, C. I., A. Bhattacharya, W. Wang, and S. W. Peltz. 2001. Nonsense-mediated mRNA decay in Saccharomyces cerevisiae. Gene 274:15-25.[CrossRef][ISI][Medline]

    Graveley, B. R. 2001. Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 17:100-107.[CrossRef][ISI][Medline]

    Hawkins, J. D. 1988. A survey on intron and exon lengths. Nucleic Acids Res. 16:9893-9908.[Abstract]

    Hentze, M. W., and A. E. Kulozik. 1999. A perfect message: RNA surveillance and nonsense-mediated decay. Cell 96:307-310.[ISI][Medline]

    Isshiki, M., Y. Yamamoto, H. Satoh, and K. Shimamoto. 2001. Nonsense-mediated decay of mutant waxy mRNA in rice. Plant Physiol. 125:1388-1395.[Abstract/Free Full Text]

    Käufer, N. F., and J. Potashkin. 2000. Analysis of the splicing machinery in fission yeast: a comparison with budding yeast and mammals. Nucleic Acids Res. 28:3003-3010.[Abstract/Free Full Text]

    Kim, V. N., N. Kataoka, and G. Dreyfuss. 2001. Role of the nonsense-mediated decay factor hUpf3 in the splicing-dependent exon-exon junction complex. Science 293:1832-1836.[Abstract/Free Full Text]

    Le Hir, H., D. Gatfield, E. Izaurralde, and M. J. Moore. 2001. The exon-exon junction complex provides a binding platform for factors involved in mRNA export and nonsense-mediated mRNA decay. EMBO J. 20:4987-4997.[Abstract/Free Full Text]

    Levine, A., and R. Durbin. 2001. A computational scan for U12-dependent introns in the human genome sequence. Nucleic Acids Res. 29:4006-4013.[Abstract/Free Full Text]

    Logsdon, J. M., Jr. 1998. The recent origins of spliceosomal introns revisited. Curr. Opin. Genet. Dev. 8:637-648.[CrossRef][ISI][Medline]

    Luo, M. J., and R. Reed. 1999. Splicing is required for rapid and efficient mRNA export in metazoans. Proc. Natl. Acad. Sci. USA 96:14937-14942.[Abstract/Free Full Text]

    Luo, M. J., Z. Zhou, K. Magni, C. Christoforides, J. Rappsilber, M. Mann, and R. Reed. 2001. Pre-mRNA splicing and mRNA export linked by direct interactions between UAP56 and Aly. Nature 413:644-647.[CrossRef][ISI][Medline]

    Lykke-Andersen, J. 2001. mRNA quality control: marking the message for life or death. Curr. Biol. 11:R88-R91.[CrossRef][ISI][Medline]

    Lykke-Andersen, J., M. D. Shu, and J. A. Steitz. 2001. Communication of the position of exon-exon junctions to the mRNA surveillance machinery by the protein RNPS1. Science 293:1836-1839.[Abstract/Free Full Text]

    Lynch, M. 2002. Intron evolution as a population-genetic process. Proc. Natl. Acad. Sci. USA 99:6118-6123.[Abstract/Free Full Text]

    Lynch, M., and A. R. Richardson. 2002. The evolution of spliceosomal introns. Curr. Opin. Genet. Dev. 12:701-710.[CrossRef][ISI][Medline]

    Mango, S. E. 2001. Stop making nonSense: the C. elegans smg genes. Trends Genet. 17:646-653.[CrossRef][ISI][Medline]

    Maniatis, T., and R. Reed. 2002. An extensive network of coupling among gene expression machines. Nature 416:499-506.[CrossRef][ISI][Medline]

    Maquat, L. E., and G. G. Carmichael. 2001. Quality control of mRNA function. Cell 104:173-176.[CrossRef][ISI][Medline]

    Maquat, L. E., and X. Li. 2001. Mammalian heat shock p70 and histone H4 transcripts, which derive from naturally intronless genes, are immune to nonsense-mediated decay. RNA 7:445-456.[Abstract/Free Full Text]

    McCracken, S., M. Lambermon, and B. J. Blencowe. 2002. SRm160 splicing coactivator promotes transcript 3'-end cleavage. Mol. Cell. Biol. 22:148-160.[Abstract/Free Full Text]

    Mendell, J. T., S. M. Medghalchi, R. G. Lake, E. N. Noensie, and H. C. Dietz. 2000. Novel Upf2p orthologues suggest a functional link between translation initiation and nonsense surveillance complexes. Mol. Cell. Biol. 20:8944-8957.[Abstract/Free Full Text]

    Nagy, E., and L. E. Maquat. 1998. A rule for termination-codon position within intron-containing genes: when nonsense affects RNA abundance. Trends Biochem. Sci. 6:198-199.[CrossRef]

    Nesic, D., and L. E. Maquat. 1994. Upstream introns influence the efficiency of final intron removal and RNA 3'-end formation. Genes Dev. 8:363-375.[Abstract]

    Neu-Yilik, G., N. H. Gehring, R. Thermann, U. Frede, M. W. Hentze, and A. E. Kulozik. 2001. Splicing and 3' end formation in the definition of nonsense-mediated decay-competent human ß-globin mRNPs. EMBO J. 20:532-540.[Abstract/Free Full Text]

    Ninio, J. 1991. Transient mutators: a semiquantitative analysis of the influence of translation and transcription errors on mutation rates. Genetics 129:957-962.[Abstract/Free Full Text]

    Niwa, M., C. C. MacDonald, and S. M. Berget. 1992. Are vertebrate exons scanned during splice-site selection? Nature 360:277-280.[CrossRef][ISI][Medline]

    Niwa, M., S. D. Rose, and S. M. Berget. 1990. In vitro polyadenylation is stimulated by the presence of an upstream intron. Genes Dev. 4:1552-1559.[Abstract]

    Nixon, J. E., A. Wang, H. G. Morrison, A. G. McArthur, M. L. Sogin, B. J. Loftus, and J. Samuelson. 2002. A spliceosomal intron in Giardia lamblia. Proc. Natl. Acad. Sci. USA 99:3701-3705.[Abstract/Free Full Text]

    Perrin-Vidoz, L., O. M. Sinilnikova, D. Stoppa-Lyonnet, G. M. Lenoir, and S. Mazoyer. 2002. The nonsense-mediated mRNA decay pathway triggers degradation of most BRCA1 mRNAs bearing premature termination codons. Hum. Mol. Genet. 11:2805-2814.[Abstract/Free Full Text]

    Pulak, R., and P. Anderson. 1993. mRNA surveillance by the Caenorhabditis elegans smg genes. Genes Dev. 7:1885-1897.[Abstract]

    Rajavel, K. S., and E. F. Neufeld. 2001. Nonsense-mediated decay of human HEXA mRNA. Mol. Cell. Biol. 21:5512-5519.[Abstract/Free Full Text]

    Reed, R., and E. Hurt. 2002. A conserved mRNA export machinery coupled to pre-mRNA splicing. Cell 108:523-531.[ISI][Medline]

    Robberson, B. L., G. J. Cote, and S. M. Berget. 1990. Exon definition may facilitate splice site selection in RNAs with multiple exons. Mol. Cell. Biol. 10:84-94.[ISI][Medline]

    Ruiz-Echevarria, M. J., C. I. González, and S. W. Peltz. 1998. Identifying the right stop: determining how the surveillance complex recognizes and degrades an aberrant mRNA. EMBO J. 17:575-589.[Abstract/Free Full Text]

    Sakharkar, M., F. Passetti, J. E. de Souza, M. Long, T. W. Tan, and S. J. de Souza. 2002. ExInt: an exon/intron database. Nucleic Acids Res. 30:191-194.[Abstract/Free Full Text]

    Shaw, R. J., N. D. Bonawitz, and D. Reines. 2002. Use of an in vivo reporter assay to test for transcriptional and translational fidelity in yeast. J. Biol. Chem. 277:24420-24426.[Abstract/Free Full Text]

    Sterner, D. A., T. Carlo, and S. M. Berget. 1996. Architectural limits on split genes. Proc. Natl. Acad. Sci. USA 93:15081-15085.[Abstract/Free Full Text]

    Strasser, K., and E. Hurt. 2001. Splicing factor Sub2p is required for nuclear mRNA export through its interaction with Yra1p. Nature 413:648-652.[CrossRef][ISI][Medline]

    Wang, J., J. P. Gudikote, O. R. Olivas, and M. F. Wilkinson. 2002. Boundary-independent polar nonsense-mediated decay. EMBO Rep. 3:274-279.[Abstract/Free Full Text]

    Wang, W., K. Czaplinski, Y. Rao, and S. W. Peltz. 2001. The role of Upf proteins in modulating the translation read-through of nonsense-containing transcripts. EMBO J. 20:880-890.[Abstract/Free Full Text]

    Wilusz, C. J., W. Wang, and S. W. Peltz. 2001. Curbing the nonsense: the activation and regulation of mRNA surveillance. Genes Dev. 15:2781-2785.[Free Full Text]

    Wood, V., R. Gwilliam, and M.-A. Rajandrean, et al. (120 co-authors). 2002. The genome sequence of Schizosaccharomyces pombe. Nature 415:871-880.[CrossRef][ISI][Medline]

    Zhang, J., X. Sun, Y. Qian, J. P. LaDuca, and L. E. Maquat. 1998. At least one intron is required for the nonsense-mediated decay of triosephosphate isomerase mRNA: a possible link between nuclear splicing and cytoplasmic translation. Mol. Cell. Biol. 18:5272-5283.[Abstract/Free Full Text]

    Zhao, X. F., N. J. Nowak, T. B. Shows, and P. D. Aplan. 2000. MAGOH interacts with a novel RNA-binding protein. Genomics 63:145-148.[CrossRef][ISI][Medline]

Accepted for publication November 22, 2002.