Do Introns Favor or Avoid Regions of Amino Acid Conservation?

Toshinori Endo, Alexei Fedorov, Sandro J. de Souza and Walter Gilbert

*The Biological Laboratories, Harvard University;
{dagger}Department of Bioinformatics, Medical Research Institute, Tokyo Medical and Dental University;
{ddagger}Laboratory of Computational Biology, Ludwig Institute for Cancer Research, São Paulo branch


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
Are intron positions correlated with regions of high amino acid conservation? For a set of ancient conserved proteins, with intronless prokaryotic but intron-containing eukaryotic homologs, multiple sequence alignments identified residues invariant throughout evolution. Intron positions between codons show no preferences. However, introns lying after the first base of a codon prefer conserved regions, markedly in glycines. Because glycines are in excess in conserved regions, this behavior could reflect phase-one introns entering glycine residues randomly in the ancestral sequences. Examination of intron positions within codons of evolutionarily invariable amino acids showed that roughly 50% of these introns are bordered by guanines at both 5'- and 3'-ends, 25% have a G only before the intron, and 5% have a G only after the intron, whereas about 20% are bordered by nonguanine bases.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
The genes of complex eukaryotes are interrupted by introns. Explanations for the origins of these introns are that either they were added (Logsdon Jr. 1998Citation ), as adventitious elements or by splicing between gene duplications (Venkatesh, Ning, and Brenner 1999Citation ), to previously continuous genes or they arose because the gene in question was assembled by exon shuffling (Doolittle 1978Citation ; Gilbert 1978Citation ) or by a combination of the two processes. The distinction between these two views is the sharpest for ancient conserved proteins (ACPs) shared by prokaryotes and eukaryotes. For ACPs, an introns-late model requires that all the introns in the eukaryotic branch were added during evolution because the ancestral form of the gene would be the one that appears in the prokaryotes, and there was no exon shuffling in the eukaryotic form. On an introns-early or mixed model, some or all of the intron positions could have been left over from exon shuffling in the progenote, the introns being lost down the prokaryotic line.

Recent arguments (de Souza et al. 1998Citation ; Roy et al. 1999Citation ) support this mixed model. In a set of ACPs of known three-dimensional structure, phase-zero intron positions, which lie between the codons, were correlated with the boundaries of certain sizes of modules, compact regions of peptide chain, whereas phase-one and -two introns, which lie after the first and second base of a codon, were not. Such a correlation with three-dimensional structure would not be expected if introns had been added to previously existing genes but would follow if the original genes had been assembled through exon shuffling. This suggests that some of the phase-zero introns were ancient but that most introns of phases one and two and many of phase zero were added to the ACP genes.

One explanation for intron structure correlation in an introns-added-late model is that the process by which introns are added to genes, presumably as transposable elements inserting into a DNA or RNA sequence, might be mutagenic and hence change the amino acid sequence of the protein product. Thus, one might expect to see a preference for introns in regions where the amino acid sequence is not critical and a dearth of introns in regions of high amino acid conservation. Such a propensity might, in turn, lead to a correlation between introns and aspects of the three-dimensional structure of the protein, if it were to be the case that the boundaries of modules corresponded to regions of low amino acid conservation and the cores of modules, to regions of high amino acid conservation.

To test this conjecture we have examined the correlation between intron positions and conserved residues. We essentially use the set of ACPs that de Souza et al. (1998)Citation explored. For each member of that set, we took as a reference sequence the sequence corresponding to the three-dimensional structure. We identified a large number of homologs, both prokaryotic and eukaryotic, for each sequence and, by making multiple alignments, identified amino acids that were identical across all the homologs. This is a region of highest conservation, which consists of amino acids that are identical in both the prokaryotic and eukaryotic homologs. We then assigned intron positions in the eukaryotic homologs to the reference sequences by pairwise alignment of the relevant sequences, using a computer program that searches the database and identifies intron positions. We find that for phase-zero intron positions, there is no preference to be flanked by codons of evolutionarily invariable amino acids. However, phase-one intron positions show a significant preference to be found within codons of evolutionarily invariable amino acids.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
Ancient Genes Used in the Study
We used a previously published sample of 44 ancient genes (de Souza et al. 1998Citation ; Roy et al. 1999Citation ; Fedorov et al. 2001Citation ). To define ACPs, we used a stricter criterion than that used by de Souza et al. 1998. Our criterion required that there be at least one prokaryotic-eukaryotic homologous pair with an E-value <10-5 (FASTA) and a global identity greater than 16%. Of the 44 ancient proteins used previously (de Souza et al. 1998Citation ), only 41 fulfilled this criterion. We also excluded the ornithine carbamoyltransferase gene (1ort) from the initial sample of 44 ACR proteins because it is homologous to the aspartate carbamoyltransferase gene (1raa) from the same sample. Table 1 lists the 40 sequences and the number of homologs that were aligned with Clustal W. For each of the 40 ACPs, we took the amino acid sequence from the Protein Data Bank (PDB) file (the same that were used by de Souza et al. 1998Citation ; Roy et al. 1999Citation ; Fedorov et al. 2001Citation ) as a reference sequence, except for hsp 70, where a translated GenBank sequence, Y00371, was used as reference sequence because the homolog in PDB was much shorter than the other homologs.


View this table:
[in this window]
[in a new window]
 
Table 1 List of 40 Ancient Proteins

 
Obtaining Homologous Sequences
To obtain homologous sequences, an amino acid sequence database was created from GenBank release 110. Sequences homologous to the reference sequences were identified by FASTA, version 3.1t13 (Pearson and Lipman 1988Citation ), using the criterion of an E-value <10-5 against this database. The sequences were then further filtered by a criterion of 16% identity for the maximum-alignable region with their reference sequences after the multiple alignment. These criteria were imposed as an iterative process. A more stringent criterion failed to identify many of the ancient homologs, whereas a less stringent criterion caused very bad alignments in which no or very few conserved sites could be identified. The E-value cutoff alone is not enough to identify the homologs, because the search frequently identifies a small block with high similarity, which does not necessarily mean a protein homologous to the query.

Multiple Alignment and Identification of Conserved Sites at Three Levels
Multiple alignments of the homologous sequences were performed using Clustal W Version 1.74 (Thompson, Higgins, and Gibson 1994Citation ). We marked, on the reference sequences, the sites where all the sequences have the same amino acid residue. Table 1 also gives the number of homologs that were aligned by Clustal W. Sequences that yielded very large gaps or other very bad alignments were discarded by visual inspection. All studied multiple alignments are available from our website: http://mcb.harvard.edu/gilbert/invariant_sites.

Mapping of Intron Positions
An intron-exon database (EID; Saxonov et al. 2000Citation ) derived from GenBank release 105 was the source of introns. This relatively old version of GenBank was chosen specifically to be very close to the previously published set of intron positions (de Souza et al. 1998Citation ; Roy et al. 1999Citation ). To map introns from homologous genes onto a reference sequence, we used our INTRONMAP program (Fedorov et al. 2001Citation ). This program precisely maps intron positions, taking into account intron phases.

Intron Phases
Intron phases are defined as the position of an intron within a codon. Phases zero, one, and two are, according to the normal definitions, introns lying before a codon or after the first or second base, respectively, but we have also used the term phase three to describe an intron that lies after a codon.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
Table 2 shows the number of introns found in the invariant residues in phases zero, one, and two and also lists introns in phase three, after a residue, to display any boundary effects on intron positions before or after an evolutionarily invariable site. Because the invariable amino acid has always been the same, if an intron was there from the beginning of evolution, it would always have been either before, after, or within this particular amino acid. Only 15% of all the residues are invariantly conserved. Table 3 compares the observed and expected totals. There is a large excess of introns in phase one with an extremely high {chi}2, but there is not much deviation from randomness for introns that lie next to codons or in phase two.


View this table:
[in this window]
[in a new window]
 
Table 2 Intron Positions in Identically Conserved Amino Acid Residues

 

View this table:
[in this window]
[in a new window]
 
Table 3 Deviations of the Intron Positions in Conserved Residues from Expectation

 
Figure 1 shows that the phase-one introns lie primarily in glycine codons. However, glycine itself is in striking excess among the invariant residues. Figure 2 compares the relative excesses and depletions of each of the conserved amino acids. Highly conserved residues, most essential for the protein conformation and functioning, show a preference for glycine, histidine, proline, and tryptophan, amino acids that are likely to play an important role in maintaining structure or function. Glycine is in the greatest excess, reflecting its position as the smallest amino acid playing an essential role in residue packing. For these ACPs, phase-one and phase-two intron positions are likely to be the result of the addition of introns (de Souza et al. 1998Citation ; Roy et al. 1999Citation ). A phase-one intron in glycine lies between two guanine residues. Because the residue has always been glycine at this position, such an intron would have been added between two guanine residues.



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 1.—Phase-one intron positions in identically conserved amino acid residues. The numbers of phase-one introns are shown for each residue (out of a total of 95)

 


View larger version (18K):
[in this window]
[in a new window]
 
Fig. 2.—The excesses or depletions of each amino acid among the identically conserved residues compared with their background frequency in the reference sequences for the 40 proteins

 
Are all introns added into GG sequences? In these conserved regions, the invariant residues have always been the same since the progenote, so one knows something about the relevant codons. To account for the excesses and depletions shown in figure 2 , we normalized the frequency of introns in invariant residues to the frequency of that residue in the reference sequences. Table 4 lists those normalized percentages. Thus, one can estimate the probabilities of the introns entering into a sequence whose codon frequencies correspond to the amino acid frequencies in the reference sequences. Consider phase one. Only the glycine codon has a GG sequence around the phase-one position: 49% of all the phase-one introns could enter glycine codons. How many phase-one introns would have entered such that they had a guanine before them or a guanine after them? Table 4 shows 78% with guanine before and 55% with guanine after. Lastly, 16% of such introns would have entered dinucleotide sequences with no guanines.


View this table:
[in this window]
[in a new window]
 
Table 4 Percentages of Introns in the Conserved Residues Normalized by the Overall Amino Acid Residue Frequencies

 
The frequencies for the phase-zero and the phase-two introns provide some confirmation of these estimates. The base immediately following a phase-zero intron is determined; so the table of frequencies implies that about 47% of phase-zero introns have a guanine after them. Phase-two introns follow a guanine about 47% to 72% of the time (because the arginine or serine codons are undetermined). Overall, these guanine compositions after the phase-zero or before the phase-two introns are consistent with the numbers deduced from the phase-one introns, about 75% with G before and about 50% with G after.

We conclude that introns do not avoid conserved regions, but, on the contrary, there is an excess of phase-one introns in the conserved regions, and there is no particular bias for phase-zero or phase-two introns. The excess of phase-one introns in the invariant residues is related to the excess of glycines in the conserved regions. If the introns had integrated randomly into glycines in the overall sequence, there would be an excess of introns almost equal to this excess of introns in the invariant residues. There is no support for the argument that introns favoring unconserved regions are the basis for the correlation with protein structure.

The data demonstrate that added introns show a strong bias for guanine residues, but not a total requirement. Although about 75% enter after a guanine residue and 50% lie between two guanines, still a small fraction, about 20%, does not enter at guanines. Because, in general, the sequences in the exons around the introns do not show such an extreme bias for guanines (Long et al. 1998Citation ), the exon sequences must be able to mutate away without affecting the splicing.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 
T.E. was supported by the Japan Society for the Promotion of Science. S.J.d.S. was supported by Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (Sao Paulo, Brazil) and the PEW-Latin American Fellows Program.


    Footnotes
 
Naruya Saitou, Reviewing Editor

Keywords: intron position exon splicing evolution conserved protein amino acid insertion Back

Address for correspondence and reprints: Walter Gilbert, The Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, Massachusetts 02138. gilbert{at}nucleus.harvard.edu . Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Acknowledgements
 References
 

    de Souza S. J., M. Long, R. J. Klein, S. Roy, S. Lin, W. Gilbert, 1998 Toward a resolution of the introns early/late debate: only phase zero introns are correlated with the structure of ancient proteins Proc. Natl. Acad. Sci. USA 95:5094-5095[Abstract/Free Full Text]

    Doolittle R., 1978 Genes in pieces: were they ever together? Nature 272:581-582.[ISI]

    Fedorov A., X. Cao, S. Saxonov, S. J. de Souza, S. W. Roy, W. Gilbert, 2001 Intron distribution difference for 276 ancient and 131 modern genes suggests the existence of ancient introns Proc. Natl. Acad. Sci. USA 98:13177–13182

    Gilbert W., 1978 Why genes in pieces? Nature 271:501.[ISI][Medline]

    Logsdon J. M. Jr., 1998 The recent origins of spliceosomal introns revisited Curr. Opin. Genet. Dev 8:637-648[ISI][Medline]

    Long M., S. J. de Souza, C. Rosenberg, W. Gilbert, 1998 Relationship between "proto-splice sites" and intron phases: evidence from dicodon analysis Proc. Natl. Acad. Sci. USA 95:219-223[Abstract/Free Full Text]

    Pearson W. R., D. J. Lipman, 1988 Improved tools for biological sequence comparison Proc. Natl. Acad. Sci. USA 85:2444-2448[Abstract]

    Roy S. W., M. Nosaka, S. J. de Souza, W. Gilbert, 1999 Centripetal modules and ancient introns Gene 238:85-91[ISI][Medline]

    Saxonov S., I. Daizadeh, A. Fedorov, W. Gilbert, 2000 EID: the exon-intron database—an exhaustive database of protein coding intron-containing genes Nucleic Acids Res 28:185-190[Abstract/Free Full Text]

    Thompson J., D. Higgins, T. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680[Abstract]

    Venkatesh B., Y. Ning, S. Brenner, 1999 Late changes in spliceosomal introns define clades in vertebrate evolution Proc. Natl. Acad. Sci. USA 96:10267-10271[Abstract/Free Full Text]

Accepted for publication December 4, 2001.