* Departamento de Bioquímica y Fisiología y Genética Molecular-IBGM, Universidad de Valladolid-CSIC, Valladolid, Spain
Departamento de Genética, Universidad de Sevilla, Sevilla, Spain
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: lipocalin calycin molecular evolution gene phylogeny exon-intron intron evolution
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
An extensive literature supports the conservation of exon-intron structure in clades of orthologous genes (COGs) (Rokas, Kathirithamby, and Holland 1999; Wada et al. 2002), as well as in families of paralogous genes (Krem and Di Cera 2001) and protein superfamilies (Betts et al. 2001). These findings support the use of gene features as sources for phylogenetic inference (Rokas and Holland 2000; Krem and Di Cera 2001). In a previous report Salier (2000) proposed a scenario for the evolution of the lipocalin gene family by studying the gene structure and chromosomal location of 15 lipocalins. However, this view of lipocalin evolution is very dependent on the concepts of kernel versus outlier subfamilies, and it conflicts with our proposed evolutionary history of lipocalins (Gutiérrez, Ganfornina, and Sánchez 2000). To reassess our hypothesis of lipocalin evolution, we have used gene structure features as characters to build phylogenetic trees through different tree-reconstruction methods.
In this report we present the gene structure data of three insect lipocalins that we have been studying for their role in nervous system development (Ganfornina, Sánchez, and Bastiani 1995; Sánchez, Ganfornina, and Bastiani 1995; Sánchez et al. 2000b). The position and phase of introns in a number of lipocalins are used to reconstruct a phylogeny by maximum parsimony methods and by a distance matrix built with a measure of gene structure similarity (Betts et al. 2001). We also analyze the variability in introns present in the C-termini of lipocalins belonging to different COGs, and compare lipocalin intron arrangement with tertiary structure. We test the conservation of intron arrangement within the calycins, a proposed structural superfamily (reviewed by Flower, North, and Sansom 2000). Finally, we analyze the congruence of phylogenies based on protein sequence and gene structure, and build a consensus tree to refine our hypothesis on lipocalin evolution.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Sequence Searches and Alignments
We searched for lipocalin genes whose intron-exon structure has been confirmed by the knowledge of their mRNA sequence. No deduced intron-exon arrangement was included in the analysis to avoid "noise" produced by poorly predicted splice sites, and to discard pseudogenes. Using the same seeding process and selection criteria previously described (Ganfornina et al., 2000; Gutiérrez, Ganfornina, and Sánchez 2000), a search for lipocalin cDNA and EST sequences was performed using the Blast program (Altschul et al. 1990) in the GenBank database available December 13, 2001. Thirty-seven of the sequences retrieved contained the complete CDS of the lipocalins and had the corresponding genomic sequence available on the databases. These genes are shown in table 1. We evaluated the presence, location, and phase of introns for these genes, and made a selection (asterisks in table 1) based on two criteria: (1) being the representative of a lipocalin COG (to avoid sampling bias), and (2) showing an intron pattern unique in the family. Thus we are accounting for the overall gene structure variation present in the lipocalin family.
|
Phylogenetic Analyses
Phylogenetic analyses based on protein sequences were carried out using the maximum likelihood method with the MOLPHY 2.3 software ( Adachi and Hasegawa 1996) as previously reported (Ganfornina et al. 2000). Bootstrap support for tree branches was estimated using the resampling log likelihood method (Hasegawa and Kishino 1994) to calculate local bootstrap proportions (LBP).
We have used intron positions of 23 representative lipocalins as phylogenetic characters. We built three input matrices based on three intron character states: (1) the presence/absence of a given intron, (2) the intron phase, and (3) the intron position in the alignment. Two procedures were carried out: The first was a maximum parsimony analysis using the intron presence matrix as input. Characters were considered as unordered. We made heuristic tree searches by the TBR method of PAUP* (Swofford 1998). A majority rule consensus tree was constructed from the most parsimonious trees found in the analysis. The second procedure started with the construction of a distance matrix based on a measure of gene structure similarity (Betts et al. 2001) that uses the presence, location, and phase of intron matrices described above to estimate the exon-intron similarity between two of the aligned proteins.
|
![]() |
Results and Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Before the present analysis, sampling of metazoan lipocalin genes was strongly biased toward the chordate phylum. Only one gene from arthropods was reported (Li and Riddiford 1992). Therefore, we set out to study the gene structure of other known arthropodan lipocalins.
Exon-Intron Arrangement of the Lazarillo Genes in Schistocerca and Drosophila
The gene structure of Lazarillo, a lipocalin found in the grasshopper Schistocerca americana (reviewed by Sánchez, Ganfornina, and Bastiani 2000a), would be of great value in providing insight into lipocalin evolution because of the ancestral position of orthopteroids within the arthropod lineage (Caterino, Cho, and Sperling 2000).
The ORF of lipocalin genes is interrupted by 6 introns at the most. These introns (named AF) are represented in a model lipocalin depicted in figure 2A. The predicted location of the six introns in the grasshopper Lazarillo gene was deduced by locating intron positions in a multiple protein sequence alignment of Lazarillo with other lipocalins of known gene structure. We then designed Lazarillo primers that would PCR amplify specific introns from genomic DNA. The primer sets are shown numbered under the lipocalin model in figure 2A. The PCR amplifications using grasshopper DNA appear in the ethidium bromide gel shown in figure 2B. Each numbered lane refers to the set of primers used. These amplifications revealed the presence of three introns in the CDS of the Lazarillo gene (fig. 2C) that corresponded to introns A, C, and D of the model lipocalin gene. Intron size was estimated by band size for introns A and C, and by complete sequencing for the short intron D. Sequencing the PCR products defined the exact location and phase of the Lazarillo introns (see table 2). These intronic sequences are deposited in GenBank (Accession Numbers: AY197702, AY197703, AY197704, and AY197705).
|
|
Phylogenetic Analysis of Lipocalins Based on Gene Structure
In addition to the already characterized lipocalin genes (Salier 2000) and the arthropodan Lazarillo genes reported above, we searched for other lipocalin genes whose intron-exon structure was confirmed by the knowledge of their mRNA sequence. All the lipocalin genes found are listed in table 1, with genes selected for the analysis marked with asterisks (23 representatives; see Materials and Methods).
We found a protoctist gene (from Dictyostelium discoideum, EST #C24642), a plant gene (from Arabidopsis thaliana, mRNA Acc. Number AY062789), and another Drosophila gene (Karl, EST # NM_132520). The Dictyostelium and Arabidopsis genes are of singular value for our evolutionary analysis because they are the only representatives of lipocalins from unicellular eukaryotes and plants.
Alignment of Lipocalin Gene Structures
The intronic architecture of the selected lipocalin genes was mapped onto a multiple protein sequence alignment in the context of the overall secondary structure of an archetypal lipocalin (fig. 3). Noteworthy, there is a strong conservation of the location and phase of introns, a finding also reported in other gene structure analyses (Igarashi et al. 1992; Holzfeind and Redl 1994; Toh et al. 1996; Lindqvist et al. 1999; Salier 2000). This conservation is evident among COG members, but also among paralogous lipocalins. Some intron positions and phases are very well conserved (e.g., intron A), while others show slight variations (e.g., B and C). Some introns are present in most lipocalins (e.g., introns A and C) while others are present only in a subset of them (e.g., introns D, E, and F).
|
Taking into account the presence, position and phase of introns, we performed both maximum parsimony and distance-based phylogenetic reconstructions. The resulting trees (fig. 4) are rooted with the Dictyostelium lipocalin for its presence in an ancient organismal lineage (whose origin predates the arrival of metazoans), and because of the ancestral character of this protein sequence as judged by its similarity to bacterial lipocalins.
|
Distances Phylogeny
A distance-based phylogenetic reconstruction was carried out by computing a distance matrix with gene structure data (intron presence, location, and phase). These data were combined to produce a quantitative measure of gene structure similarity (Betts et al., 2001; see Materials and Methods). The Neighbor-Joining (NJ) tree (Saitou and Nei 1987) rooted with the Dictyostelium lipocalin is shown in figure 4B. Similar to the parsimony tree, the NJ tree relates monophyletically most arthropodan lipocalins with ApoD, and segregates the Drosophila NLaz and RBP as genes with unique exon-intron structures. The Arabidopsis and Dictyostelium lipocalins remain at the base of the tree, and the set of lipocalins belonging to clades IV-XIII are forming a monophyletic group, also related to the 6-intron C8GC, a1mg and ERBP. Despite displaying short branch lengths, this tree also establishes relationships among different lipocalin COGs, as can be seen in the cladogram shown in figure 4B.
Gene Structure versus Protein Sequence Phylogenies
The gene structure-inferred view of lipocalin evolution shares basic topological features with the protein sequence-based phylogeny (see Gutiérrez, Ganfornina, and Sánchez 2000). Although in principle there are no reasons to expect congruence between these two trees, it is clear that in both phylogenetic reconstructions the arthropodan lipocalins are related to ApoDs, and they appear related to protoctist and plant lipocalins; RBPs form a separate group, related to some insect lipocalins; and the rest of lipocalins form a well supported monophyletic group. To further test this, we built a ML tree using the protein sequence alignment from which the gene structure matrices were derived, and rooted this tree with the Dictyostelium lipocalin (fig. 5A). We used the program RadCon (Thorley and Page 2000) to evaluate the congruence of the protein sequence ML and the gene structure NJ trees. Both source trees are well resolved: their cladistic information content, a normalized measure of how much a tree reduces uncertainty regarding phylogenetic relationships (Thorley, Wilkinson, and Charleston 1998), is 0.98 for the gene structure NJ tree, and 1.00 for the protein sequence ML tree.
|
Thus, two sets of independent characters have produced the same phylogenetic relationships between extant lipocalins.
Phylogenetic Distribution of Intron Numbers Within the Lipocalin Family
Another finding revealed by the intron arrangement phylogeny is that lipocalins that have originated more recently contain more introns in their CDS. In figure 5C we mapped the number of exons onto an updated version of the ML-based lipocalin protein phylogeny (see Gutiérrez, Ganfornina, and Sánchez [2000] for clade ascription). The ancient unicellular eukaryotic and plant lipocalins are encoded by 2 exons; the arthropodan lipocalins by 45 exons; and the chordate lipocalins by 47 exons. Introns E and F are absent in nonchordate lipocalins, whereas introns AD show much wider phylogenetic distributions.
A first look at these data might suggest an evolutionary trend to gain introns. In this hypothesis, the origin of introns A and D could be placed early in eukaryotic evolution, introns B and C originated at the base of the metazoan lineage, and introns E and F appeared later during early chordate radiation. The acquisition of introns was accompanied by diverse intron losses in different branches, giving rise to the pattern observed today.
However, we have to be critical when interpreting these observations in the context of lipocalin introns origin and evolution. First, the current insufficient sampling of lipocalins outside the metazoan kingdom generates uncertainty about the very assumption of homology of introns A and D in Dictyostelium and Arabidopsis, respectively. Second, the set of metazoan lipocalins available encompasses only two phyla within the kingdom; any proposal about which set of introns was present in the common ancestor of all metazoans awaits confirmation coming from other phyla. A scenario with a set of four ancient introns and subsequent losses in different lineages (Fedorov et al. 2001; Roy et al. 2002) would be as probable as a scenario with fewer or no ancient introns and a prevalence of intron gain at preferred "hot spots" (or proto-splice sites; Dibb and Newman 1989; Logsdon 1998).
Nevertheless, the extensive sampling of lipocalins in the chordate phylum allows us to make a stronger case for the acquisition of introns E and F during early chordate radiation. Both introns are absent in all arthropod lipocalins and in ApoD, the lipocalin COG that branches off at the base of the chordate lipocalin subtree in our two independent phylogenetic reconstructions (figs. 4B and 5C). Therefore, intron gain within the chordate lineage is the most parsimonious explanation for the current distribution of introns E and F.
In summary, although many questions about the origin of lipocalin introns and their subsequent evolution remain unanswered, a combination of ancient and recent introns is the most plausible scenario. Our results show that, independent of their origin, the variations in gene structure can be used to reconstruct the history of descent of lipocalin genes.
N-termini Conservation versus C-Termini Variability?
It is remarkable that the introns specific to chordate lipocalins are located in the C-termini of the proteins, whereas introns in their N-terminal region are the most conserved in the family (see fig. 3). This polarity, also noticed by other researchers (Salier 2000), is related neither to a particular distribution of lipocalins length nor to a C-terminalspecific protein sequence variability. Rather, we propose it might be related to a propensity for intron gain/loss in this gene region. The analysis of the 3' region of lipocalins bearing 56 introns reveals that several lipocalins of particular chordate lineages (e.g., PGDS, VEG, NGAL; data not shown) show introns in the 3'-UTR that are located 07 nucleotides away from the stop codon. These introns would be equivalent to intron F if they happened to be in the CDS. Any form of intron sliding (Stoltzfus et al. 1997), or any frameshifting mutation that moves the stop codon in the 3' direction, could include/exclude a given intron in the gene CDS, generating an apparent intron gain/loss. Moreover, a puzzling case of C-terminus variability comes from the comparison of mouse and rat ERBP (fig. 6A). Intron F of mouse ERBP locates 9 nucleotides away from the stop codon. The rat ERBP gene has an insertion that accommodates a short exon and another intron (alternatively, the mouse ERBP could have experienced an equivalent deletion). Were it not for the existence of an in-frame stop codon in the short exon present in the rat ERBP gene, we would have a unique lipocalin with 7 introns.
|
Gene Structure and the Three-Dimensional Structure of Lipocalins and Calycins
We mapped the location of exon boundaries in the tertiary structure of lipocalins that belong to different phylogenetic clades (IcyA, RBP, BLB, NGAL, MUP, and ERBP). Most lipocalin introns are located in the boundaries of ß strands (see fig. 3). In spite of a certain variability in number and position, introns AD seem to demarcate the lipocalin ß barrel, whereas introns E and F are present in the C-terminal flexible region.
A way of testing a relationship between lipocalin intron-exon boundaries and tertiary structure would be to analyze the gene structure of proteins with a tertiary structure like that of lipocalins. A similar structure and a marginal sequence similarity have been used to propose a structural superfamily, the calycins, that relates lipocalins to proteins such as FABP, CRBP, avidin, and a group of protease inhibitors (Flower, North, and Sansom 2000). We find no gene structure similarity after aligning representatives of these proteins with the lipocalins and comparing intron positions (fig. 6B). This finding suggests that (1) we do not have compelling evidence for a relationship between intron-exon arrangement and the tertiary structure of these ß barrelbased proteins; (2) the evolutionary relationship of lipocalins with the other proposed calycins, already questioned after analyzing their protein sequence (Ganfornina et al. 2000), remains to be demonstrated; and (3) the homology of introns AF in lipocalins, the foundation for the phylogenetic inferences that we present in this work, is a reasonable assumption: the pattern and properties of intron-exon boundaries are good markers of the lipocalins history of descent.
Concluding Remarks: Evolutionary Hypothesis for the Lipocalin Gene Family
In conclusion, gene structure is well preserved among lipocalins, and our results validate its use for the reconstruction of lipocalin evolution. The congruence of phylogenetic trees built from two independent sets of data (protein sequence and gene structure) increases the verisimilitude of both reconstructions of the lipocalins history. Furthermore, in the future we can use gene structure data to assay the lipocalin nature of novel proteins whose amino acid sequence and/or protein structure show similarity to lipocalins.
Our results give support to the following hypothesis about the evolutionary history of lipocalins: Bacterial lipocalins were inherited by unicellular eukaryotes and passed on to both plants and metazoans. The primitive metazoans spread a low number of ancient lipocalins into some of their successors, the arthropods and chordates, although these proteins might have been unexploited and subsequently lost in other phyla. The primordial arthropod and chordate lipocalins were likely similar to the Lazarillo and ApoD lipocalins now present in these phyla. Alongside the chordate radiation, the ApoD-like ancestral lipocalin suffered duplications. On the one hand, it gave rise to the ancestor of RBPs, and on the other hand, to one or more ancestors of all other paralogous groups of lipocalins that diverged into the current diverse catalog of chordate lipocalins.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
E-mail: opabinia{at}ibgm.uva.es.
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Adachi, J., and M. Hasegawa. 1996. MOLPHY version 2.3: programs in molecular phylogenetics based on maximum likelihood. The Institute of Statistical Mathematics, Tokyo.
Adams, E. N. 1972. Consensus techniques and the comparison of taxonomic trees. Syst. Zool. 21:390-397.[ISI]
Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410.[CrossRef][ISI][Medline]
Betts, M. J., R. Guigo, P. Agarwal, and R. B. Russell. 2001. Exon structure conservation despite low sequence similarity: a relic of dramatic events in evolution? EMBO J. 20:5354-5360.
Bishop, R. E. 2000. The bacterial lipocalins. Biochim. Biophys. Acta 1482:73-83.[ISI][Medline]
Caterino, M. S., S. Cho, and F. A. Sperling. 2000. The current state of insect molecular systematics: a thriving tower of Babel. Annu. Rev. Entomol. 45:1-54.[CrossRef][ISI][Medline]
Dibb N. J., and A. J. Newman. 1989. Evidence that introns arose at proto-splice sites. EMBO J. 8:2015-2021.[Abstract]
Estabrook, G. F. 1992. Evaluating undirected positional congruence of individual taxa between two estimates of the phylogenetic tree for a group of taxa. Syst. Biol. 41:172-177.[ISI]
Fedorov, A., X. Cao, S. Saxonov, S. J. de Souza, S. W. Roy, and W. Gilbert. 2001. Intron distribution difference for 276 ancient and 131 modern genes suggests the existence of ancient introns. 98:13177-13182.
Felsenstein, J. 1993. PHYLIP (phylogeny inference package). Version 3.5c. Distributed by the author, Department of Genetics, University of Washington, Seattle.
Flower, D. R., A. C. T. North, and T. K. Attwood. 1993. Structure and sequence relationships in the lipocalins and related proteins. Protein Sci. 2:753-761.
Flower D. R., A. C. North, and C. E. Sansom. 2000. The lipocalin protein family: structural and sequence overview. Biochim. Biophys. Acta 1482:9-24.[ISI][Medline]
Ganfornina, M. D., G. Gutiérrez, M. J. Bastiani, and D. Sánchez. 2000. A phylogenetic analysis of the lipocalin protein family. Mol. Biol. Evol. 17:114-126.
Ganfornina, M. D., D. Sánchez, and M. J. Bastiani. 1995. Lazarillo, a new GPI-linked surface lipocalin, is restricted to a subset of neurons in the grasshopper embryo. Development 121:123-134.
Gutiérrez, G., M. D. Ganfornina, and D. Sánchez. 2000. Evolution of the lipocalin family as inferred from a protein sequence phylogeny. Biochim. Biophys. Acta 1482:35-45.[ISI][Medline]
Hasegawa, M., and H. Kishino. 1994. Accuracies of the simple methods for estimating the bootstrap probability of a maximum likelihood tree. Mol. Biol. Evol. 11:142-145.
Holzfeind, P., and B. Redl. 1994. Structural organization of the gene encoding the human lipocalin tear prealbumin and synthesis of the recombinant protein in Escherichia coli. Gene 139:177-183.[CrossRef][ISI][Medline]
Igarashi, M., A. Nagata, H. Toh, Y. Urade, and O. Hayaishi. 1992. Structural organization of the gene for prostaglandin D synthase in the rat brain. Proc. Natl. Acad. Sci. USA 89:5376-5380.[Abstract]
Krem, M. M., and E. Di Cera. 2001. Molecular markers of serine protease evolution. EMBO J. 20:3036-3045.
Li, W., and L. M. Riddiford. 1992. Two distinct genes encode two major isoelectric forms of insecticyanin in the tobacco hornworm, Manduca sexta. Eur. J. Biochem. 205:491-499.[Abstract]
Lindqvist, A., P. Rouet, J.-P. Salier, and B. Akerstrom. 1999. The alpha1-microglobulin/bikunin gene: characterization in mouse and evolution. Gene 234:329-336.[CrossRef][ISI][Medline]
Logsdon, J. M., Jr. 1998. The recent origins of spliceosomal introns revisited. Curr. Opin. Genet. Dev. 8:637-648.[CrossRef][ISI][Medline]
Ohno, S. 1999. Gene duplication and the uniqueness of vertebrate genomes circa 19701999. Semin. Cell Dev. Biol. 10:517-522.[CrossRef][ISI][Medline]
Rokas, A., and P. W. Holland. 2000. Rare genomic changes as a tool for phylogenetics. Trends Ecol. Evol. 15:454-459.[CrossRef][ISI][Medline]
Rokas, A., J. Kathirithamby, and P. W. Holland. 1999. Intron insertion as a phylogenetic character: the engrailed homeobox of Strepsiptera does not indicate affinity with Diptera. Insect Mol. Biol. 8:527-530.[CrossRef][ISI][Medline]
Roy, S. W., A. Fedorov, and W. Gilbert. 2002. The signal of ancient introns is obscured by intron density and homolog number. Proc. Natl. Acad. Sci. USA 99:15513-15517.
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425.[Abstract]
Salier, J.-P. 2000. Chromosomal location, exon/intron organization and evolution of lipocalin genes. Biochim. Biophys. Acta 1482:25-34.[ISI][Medline]
Sánchez, D., M. D. Ganfornina, and M. J. Bastiani. 1995. Developmental expression of the lipocalin Lazarillo and its role in axonal pathfinding in the grasshopper embryo. Development 121:135-147.
Sánchez, D., M. D. Ganfornina, and M. J. Bastiani. 2000a. Lazarillo, a neuronal lipocalin in grasshoppers with a role in axon guidance. Biochim. Biophys. Acta 1482:102-109.[ISI][Medline]
Sánchez, D., M. D. Ganfornina, S. Torres-Schumann, S. D. Speese, J. M. Lora, and M. J. Bastiani. 2000b. Characterization of two novel lipocalins expressed in the Drosophila embryonic nervous system. Int. J. Dev. Biol. 44:349-359.[ISI][Medline]
Stoltzfus, A., J. M. Logsdon, Jr., J. D. Palmer, and W. F. Doolittle. 1997. Intron "sliding" and the diversity of intron positions. Proc. Natl. Acad. Sci. USA 94:10739-10744.
Swofford, D. L. 1998. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Suderland, Mass.
Thompson, J. D., T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins. 1997. The ClustalX Windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tool. Nucleic Acids Res. 24:4876-4882.[CrossRef]
Thorley J. L., and R. D. M. Page. 2000. RadCon: phylogenetic tree comparison and consensus. Bioinformatics 16:486-487.[Abstract]
Thorley, J. L., M. Wilkinson, and M. A. Charleston. 1998. The information content of consensus trees. Pp. 9198 in A. Rizzi, M. Vichi, and H.-H. Bock, eds. Advances in data science and classification. Springer-Verlag, Berlin.
Toh, H., H. Kubodera, N. Nakajima, T. Sekiya, N. Eguchi, T. Tanaka, Y. Urade, and O. Hayaishi. 1996. Glutathione-independent prostaglandin D synthase as a lead molecule for designing new functional proteins. Protein Eng. 9:1067-1082.[ISI][Medline]
Wada, H., M. Kobayashi, R. Sato, N. Satoh, H. Miyasaka, and Y. Shirayama. 2002. Dynamic insertion-deletion of introns in deuterostome EF-1 alpha genes. J. Mol. Evol. 54:118-128.[ISI][Medline]