*Theoretical Biology and Biophysics, Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico;
Department of Biological Sciences, Louisiana State University at Baton Rouge;
Institute for Genomic Research, Gaithersburg, Maryland;
§Genomics Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico;
and
||The Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, Massachusetts
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The paucity of genomes available from crown eukaryotes leads to common challenges for phylogenetics, molecular evolution, and functional genomics analysis, since parameters of evolutionary models, including topology and rates, will tend to have high variances as a result. This situation is particularly unacceptable for functional genomics analysis: with the completion of sequencing of the human genome, one of the most important problems of the coming century will be to develop a complete understanding of how that sequence functions to carry out essential life processes, and a major route to that understanding will be comparative analysis of sequence biodiversity. Despite the accomplishments of genomics research and the associated development of strategies, techniques, and tools for large-scale sequencing, these advances have had very limited influence on the way molecular-based evolutionary studies are conducted. In order to reduce the variance of evolutionary model parameters and increase the accuracy of comparative analysis, we need data sets from large genomic regions with much more extensive sampling of divergent taxa than in data sets that are currently available.
To address this problem, we describe here the potential challenges of evolutionary genomics, which we define as the application of genome research strategies to comparative studies in evolution. We consider possible solutions to these challenges, outline a research design for applying evolutionary genomics to vertebrate mitochondria, and describe simulated sequence assembly experiments to verify the feasibility of our design.
A key element of our plan is to reduce costs associated with cloning and handling of materials by pooling DNA samples from different gene regions and diverse species prior to cloning. Thus, the costs specific to individual samples are limited to those that occur prior to cloning, and the costs of all subsequent steps are shared by all samples. Although breaking the direct association between sequences and samples is a counterintuitive approach, these associations can be re-created using automated assembly programs in combination with preexisting sequence information which can be used as an evolutionary reference. In cases where there is ambiguous assignment of sequences to samples, limited PCR and sequencing can be performed on the original samples to resolve these uncertainties. The simulated resampling and assembly experiments indicate that this strategy will be practical and cost-effective. Individuals from the same species can also be pooled but are unlikely to be separable into unique contiguous sequences, or haplotypes; rather, they would contribute to estimates of intraspecific variation.
Our focus is thus on large-scale sequence analysis of samples from divergent taxa, which we call sequence biodiversity (or genomic biodiversity in the case of complete organelle or nuclear genomes) analysis in order to distinguish it from genomic diversity studies that focus on intraspecific variation (usually in humans). While a great deal of biodiversity is currently being explored at deep taxonomic levels with the sequencing of complete bacterial genomes, these taxa are generally too divergent to be useful in evaluating many important evolutionary processes which occur on a much shorter timescale. Even with the current array of complete genomes, it is expected that around half of the genes in the human genome will be so diverged that they will not be obviously homologous to any genes of known function. Thus, what is clearly needed is greater focus on sequence biodiversity among taxa much more closely related to humans, that is, on the near-human evolutionary environment.
![]() |
Areas of Impact |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Pattern and Processes of Sequence Evolution
The study of patterns and processes of genome and sequence evolution will benefit significantly from evolutionary genomics. Currently, it is difficult to obtain suitable data sets that cover large amounts of sequence from different genes and common divergent biota. Without such data sets, one cannot do more than estimate gross differences in overall rates between sites or between subsections of phylogenetic trees, and the sensitivity of coevolutionary analysis is limited. The large increase in homologous sequences from a broad range of taxa will lead to more accurate estimation of evolutionary dynamics at individual nucleotide and amino acid positions, which is essential to detailed understanding of the forces of molecular evolution and their relationship to structure and function. These improvements in accuracy will come from the diversity of taxa directly, from the improved accuracy in topological inference, and from having abundant sequences from identical taxa for comparison. Recent progress has been made in analyzing evolutionary behavior at individual sites, both by limiting the parameters to the equilibrium amino acid frequencies (Bruno 1996
) and by optimizing functions of underlying physicochemical properties (Koshi, Mindell, and Goldstein 1999
; Yang 2000
). These techniques are limited, however, by the paucity of data sets that are sufficiently large for the analysis of accurate site-specific information. Likewise, analysis of the interaction between sites, or coevolution, has recently progressed to incorporate a wider diversity of coevolutionary maximum-likelihood models with reliable statistics (Pollock and Taylor 1997
; Pollock, Taylor, and Goldman 1999
; W. J. Bruno, personal communication). In each case, accurate inferences about the patterns and processes of sequence evolution are limited by the amount of available data. Evolutionary genomics will also provide site-specific information on nucleotide and codon usage bias, insertion and deletion processes, and selective processes such as adaptation, coevolution, and functional divergence.
Functional and Structural Genomics
The increased accuracy in the estimation of evolutionary processes will enable improved correlation of natural substitutions to function and three-dimensional structure. There has been a great deal of success with attempts to make such correlations using currently available data sets (Malcolm et al. 1990
; Irwin and Wilson 1991
; Goldman, Thorne, and Jones 1996
; Karplus et al. 1997
; O'Brien, Wienberg, and Lyons 1997
; Eisen 1998
; Golding and Dean 1998
; Clark 1999
; Cort et al. 1999
; D'Onofrio et al. 1999
; Frishman, Goldstein, and Pollock 2000
), and improved sampling of biodiversity will certainly lead to more abundant and accurate hypotheses for testing based on improved models. For example, catalytic properties of ancient ribonucleases were studied by predicting the ancestral sequences using 15 existing artiodactyl sequences (Jermann et al. 1995
). Since ancestral reconstructions of this kind are known to be highly inaccurate with so few sequences (Yang, Kumar, and Nei 1995
), the association of a large increase in activity with ruminant digestion, and the timing of that increase in activity, could only be helped by more accurate predictions from a larger number of ancestral sequences. The results of natural evolutionary experiments in mutagenesis and selection are inferable only indirectly, and the quality of those inferences will improve dramatically through more detailed examination of the extant products.
Functionally conserved regions of regulatory and coding regions of genes can also be efficiently identified by comparing sequences from divergent taxa, a process called evolutionary filtering (Zurawski and Clegg 1987
). As with other evolutionary analyses, the power of evolutionary filtering increases rapidly with the number of related sequences examined. In the protein structural realm, it is common practice to infer the functional importance of structural regions from the sequence conservation in that region, and internalization of conserved hydrophobic regions is one of the strongest components of successful structure prediction (Livingstone and Barton 1996
; Thompson and Goldstein 1996
). The functional importance of structural regions can also be inferred from sequence hypervariability, as with the antigen recognition sites of HLA, plant disease resistance genes, fertilization genes, and surface antigens on viruses such as HIV and influenza. Recent work has also linked changes in evolutionary conservation and variability to functional divergence between duplicated paralogs (Zhang and Gu 1998
; Gu 1999
). Thus, even simple models defining differences in evolutionary rate alone, focusing on extremes of the distribution, are extremely important.
More detailed phylogenetic analyses have correlated finer structural details (such as positions of catalytic and ligand-binding sites, secondary structure features, and subunit interaction surfaces) to differences in the rate and pattern of evolutionary substitution (Goldman, Thorne, and Jones 1996, 1998
; Koshi and Goldstein 1996
; Thompson and Goldstein 1996, 1997
; Thorne, Goldman, and Jones 1996
; Koshi, Mindell, and Goldstein 1999
; Dean and Golding 2000
). There is every reason to believe that as the amount of evolutionary information increases dramatically, so will the accuracy in predicting structural and functional features.
The potential importance of these predictions to functional genomics cannot be understated. When the human genome is completely annotated, it can be expected that up to half of the genes will not be homologous to any genes of known function, and that many of the genes which are homologous to known genes will be far enough diverged that their exact function will be ambiguous. Large-scale analysis of expression patterns will aid in functional prediction in an efficient manner, but expensive and time-consuming experiments will eventually need to be performed. Predictions based on evolutionary genomics will provide a means of narrowing experimental possibilities to a feasible number and will allow better prediction of the possible effects of substitutions on structure and function, even when the exact structure is unknown or the gene is only remotely related to a gene of known structure.
Quantitative and Population Genetics
The inclusion of multiple individuals of the same species can provide information on the distributions and patterns of genetic polymorphisms. The simultaneous sequencing of multiple individuals has been discussed in the context of human genome research (Weber and Myers 1997
), and sample pooling strategies for mutation detection have been explored (Amos, Frazier, and Wang 2000). Quantitative trait locus mapping, marker-assisted breeding, and other research using polymorphic genetic markers (i.e., single-nucleotide polymorphisms, simple sequence repeats, microsatellites, etc.) benefit from an increase in density of polymorphic markers that come from evolutionary genomics. These additional markers can be incorporated in the design of genome scanning for mapping studies, resulting in increased statistical power.
Improved estimates of genetic diversity resulting from evolutionary genomics will also benefit population genetics research. The pattern and distribution of nucleotide diversity reflect the major forces acting on populations: mutation, selection, drift, and migration. Heterogeneity in patterns and processes of molecular evolution across genomes can be determined only through examination of multiple sequences from different genomic regions. Past studies based on extensive intraspecific-intragenomic sampling have already led to better characterization of the correlation of polymorphism and recombination (Begun and Aquadro 1992
), which motivated the theory of background selection (Charlesworth, Morgan, and Charlesworth 1993
; Charlesworth 1994
; Charlesworth, Charlesworth, and Morgan 1995
), provided a clearer understanding of the differential mutation rates between sex chromosomes and autosomes (Filatov et al. 2000), and provided better estimates of differential mutation rates between sexes (Huttley et al. 2000). A number of statistical measures and tests of natural selection have been developed based on the pattern and distribution of nucleotide diversity within and among genomes (reviewed in Tajima 1993
; Clegg 1997
; Wayne and Simonsen 1998
). The ability to accurately estimate population genetics parameters and the power to discriminate among competing evolutionary hypotheses are directly related to the sampling employed in population genetics studies, with larger samples leading to greater accuracy and more power (Simonsen, Churchill, and Aquadro 1995
). Therefore, the marginal cost of including additional individuals in evolutionary genomics projects can be weighted against the benefits of more accurate characterization of nucleotide diversity and, hence, more accurate estimates of population genetics parameters over multiple loci.
![]() |
A Strategy for Moving from Small-Scale to Large-Scale Surveys of Sequence Biodiversity |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Cloning and Sequencing Template Preparation
One factor that distinguishes a gene-based program from a genomics-based program is the size of the DNA sequence region that is examined at one time. A principal challenge is to take advantage of genomic approaches when the sequence region from any single taxon is relatively short. One way to overcome this challenge is to pool sequence regions across individuals or taxa so that the aggregate sequence corresponds to amounts well suited to genomic methods. For example, pooling 1020 animal mitochondrial genomes, typically each about 16 kb in length, would yield an aggregate sequence comparable to a BAC, or about 0.10.2 the size of many bacterial genomes. Pooling ratios should be based on estimates of the number of molecules rather than DNA concentration in order to keep the concentration of each site constant in the face of length variation (e.g., differences in intron length or hypervariable regions). Pooling samples reduces the number of times labor-intensive steps involved in cloning (fractionation, size selection, ligation, transformation, etc.) have to be performed and allows for larger numbers of templates to be processed simultaneously without the need to track individuals or taxa during the sequencing phase. Pooling samples itself presents challenges with respect to sequence assembly and assignment of a specific sequence to a specific taxon or haplotype. These challenges are readily dealt with through proper taxon sampling and direct PCR of the original samples.
Sequencing
The standard procedures of genomic sequencing, including large-scale automatic sequencing of random shotgun clones, are utilized in evolutionary genomic research. Specific details such as the appropriate mixture of single- and double-ended sequencing, insert size, amount of coverage, gap closure methods, and association of contigs with taxa may vary from project to project and can be optimized to minimize cost. Our experience and limited simulation studies have indicated, however, that cost savings associated with optimizing these details are small and may be offset by costs associated with tracking clones and modifying protocols already in place at genome centers. Thus, our preferred strategy, in line with current protocols at many genomic centers, is to randomly clone inserts of 23 kb in length and sequence them automatically at both ends to an average redundancy of sevenfold coverage of each genomic region. This strategy minimizes costs associated with steps requiring human thought and labor, particularly the choice of clone-specific primers and final gap closure.
A critical consideration is that of identifying and minimizing the effects of sequencing errors, since they will bias estimates of variation and potentially create problems in assembly and annotation. There are several distinct sources of sequence error, depending on the source of DNA. Simply, these can be divided into error from PCR in those cases where PCR is used to generate material for cloning, and error typically associated with DNA sequencing itself. The error associated with DNA sequencing is reduced by multiple coverage of all regions sequenced. This multiple coverage for most sequence regions, sevenfold on average, is an inherent aspect of the sequencing of random shotgun clones. Furthermore, sequencing errors generally result in lower confidence in the sequencing calls (i.e., lower quality values) compared with correctly called bases, and standard assembly programs use this quality information from redundantly sequenced bases to reduce error propagation in the assembly process (Weber and Myers 1997
).
PCR error due to the misincorporation of nucleotides during polymerization is expected to be similar in distribution to sequencing error. In contrast to sequencing error, however, the quality values associated with PCR errors will be indistinguishable from correct sequence, since incorrect bases will be incorporated prior to cloning. Given a large number of initial starting templates, most misincorporated bases will be of low frequency and, thus, distinguishable from the majority of authentic variation at most sites. In projects with one individual per species, there will be little natural variation (none in the case of mitochondria in the absence of heteroplasmy), and these cases can be used to obtain an accurate estimate of PCR error in the system. This estimate can then be used to correct estimates of sequence variation (particularly important in the low-frequency range) in projects with multiple individuals per species, and to calculate the probability that any given variant is real as opposed to an artifact of PCR. For heterozygotes, both nucleotides segregating at a site should be well represented in the raw sequences and mostly distinguishable from PCR error.
Postsequencing Data Processing
Evolutionary genomics requires the same vector clipping, sequence fragment assembly, and other raw sequence data processing employed in genomics research and can be achieved using available software (i.e., Phred/Phrap/Consed, Staden Package, and/or the TIGR Assembler; Bonfield, Smith, and Staden 1995
; Bonfield and Staden 1995
; Sutton et al. 1995
; Staden 1996
; Ewing and Green 1998
; Ewing et al. 1998
). In projects with multiple individuals per species, mutation detection software (e.g., Bonfield, Rada, and Staden 1998
) may prove useful in assigning variants to species contigs. Raw data processing that requires special applications may include sequence splitting at primers to break up chimeras resulting from ligation of PCR fragments from different genomic regions or taxa. This sequence splitting can be easily accomplished using a Perl script after vector clipping and before sequence assembly.
The task of sequence annotation is greatly simplified for evolutionary genomics projects compared to standard genomics projects because a great deal of information is available a priori with respect to the gene content and other features of the sequences being examined. This prior knowledge and availability of comparative data makes evolutionary genomics extremely well suited for annotation using case-based reasoning (Overton and Haas 1998
), since there will already be a moderate number of homologous sequences available. Although appropriate software is not currently available, this prior knowledge can theoretically be used to guide sequence assembly, allowing for the ordering and linking of contigs with extremely short overlaps and thus reducing the amount of sequencing redundancy required for a minimal final gap closure effort. Subsequent steps, such as data preparation for GenBank submission, can be accomplished using available tools.
Evolutionary data analysis for inferring phylogenetic relationships, characterizing patterns of sequence evolution, and estimating population genetic parameters can also be accomplished using standard methods and available software. The only difference will be an increase in the numbers and lengths of sequences analyzed. Faster computers, improvements in algorithms (e.g., Lewis 1998
), and parallel computing implementation of algorithms will continue to decrease the real time of analysis. With a rapid increase in evolutionary genomics data, there will be, however, a qualitative change in the analytical potential of the data that should spur development of novel analytical approaches.
![]() |
The Case for an Examination of Vertebrate Mitochondrial Genomes |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Mitochondrial genomes also have a strong advantage over nuclear genes in that they are unlikely to have experienced many intraspecific recombination events (but see recent controversy; Arctander 1999
; Awadalla, Eyre-Walker, and Smith 1999
; Merriweather and Kaestle 1999
; Awadalla, Eyre-Walker, and Maynard Smith 2000
; Kivisild et al. 2000
). Thus, there is more likely to be a single mitochondrial phylogeny, as opposed to nuclear gene or gene segment phylogenies, which may be composites of different phylogenies. Within the vertebrates, there are few known rearrangements of genes, and none of protein-coding genes (Curole and Kocher 1999
), so phylogenies are also unlikely to be confounded by inaccurate reconstruction of those events.
A detailed review concerning the effects of the current set of complete mitochondrial genomes on questions in vertebrate phylogenetics has been written (Curole and Kocher 1999
), but in summary, as of March 2000, there were 69 complete vertebrate mitochondrial genomes publicly available, with slightly more than half coming from mammals (fig. 2
). While many phylogenetic questions have been resolved with complete vertebrate mitochondrial genomes, many more are currently ambiguous or directly at odds with morphological and other sequence data. Thus, increasing the vertebrate mitochondrial genome data set, particularly by breaking up many of the long unbroken branches (e.g., within the rodents, bats, marsupials, snakes, lizards, turtles, and amphibia), is likely to have a large impact on confidence in the resolution of the tree structure.
|
![]() |
Assessing the Feasibility of the Evolutionary Genomics Strategy: Experiments with Existing Mitochondrial Genomes |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We tested our ability to assemble real sequences by resampling from known genomes. Sequences of 10 complete mitochondrial genomes available in GenBank were sampled following the protocol outlined above. These mitochondrial genomes were those of the human (H. sapiens; Anderson et al. 1981
), the mouse (Mus musculus; Bibb et al. 1981
), the cow (Bos taurus; Anderson et al. 1982b
), the gorilla (Gorilla gorilla; Horai et al. 1995
), the rat (Rattus norvegicus; Gadaleta et al. 1989
), the finback whale (Balaenoptera physalus; Arnason, Gullberg, and Widegren 1991
), the horse (Equus caballus; Xu and Arnason 1994
), the domestic cat (Felis catus; Lopez, Cevario, and O'Brien 1996
), the armadillo (Dasypus novemcinctus; Arnason, Gullberg, and Janke 1999
), and the white rhinoceros (Ceratotherium simum; Xu and Arnason 1997
), for a total of 183,298 bp.
To simulate the cloning process, insert size and raw sequence reads from both ends were sampled uniformly within a small range of length (2,000 kb ± 15% for insert length, 500 bp ± 5% for sequence read length) and a small range of relative concentrations of samples (plus or minus 15%). These distributions do not reflect the exact distributions of read lengths and insert sizes in genomic cloning, but they are roughly compatible with ranges observed (unpublished data); these parameters have been shown in previous simulations not to have a strong effect on simulated assembly results (Weber and Myers 1997
). Standard genome projects do not have to deal with mixing of samples, so we do not have data on what the range of relative concentrations of samples will be, but the range used was intended to reasonably incorporate the variability that might be expected from DNA concentration estimates. The expected effect of a large underestimate of DNA concentration in a particular sample is that that sample will be sequenced at lower-than-optimal redundancy, and it will not be possible to fully assemble that genome. It is entirely feasible, however, that such a sample could be added at an appropriate concentration to a subsequent round for completion, while the first round could be terminated early on completion of the other genomes, thus only minimally affecting the overall cost of sequencing. Overestimates of DNA concentration in a single sample would have to be very large to dramatically affect the sequencing strategy and would generally simply lead to wasted sequence effort in direct proportion to the percentage of the overestimate. In extreme cases, sequencing would be terminated after assembly of the high-concentration sample, and the lower-concentration samples would then be recloned without that sample.
In our simulation experiment, we were able to reassemble 6 of the 10 genomes (human, cow, armadillo, whale, mouse, and horse) correctly with no gaps after sampling at 7.0-fold average coverage. There were single gaps in the rat, cat, gorilla, and rhinoceros sequences of lengths 118, 56, 63, and 90 bp, respectively, for a mean gap probability of 0.4 and a mean size of 81.5. The rhinoceros sequence was divided into two contigs, the smaller of which was 1,560 bp and bounded by the gap on one side and a 13 bp overlap on the other which was not sufficiently long to join it with the larger contig. This result is in line with expectations of this experiment for random DNA sequences, and the gaps are of a sufficient size that they could easily be closed by short PCR amplification and sequencing. Thus, the assembly was not adversely affected by the relatedness of these sequences or variation in sample concentrations, nor was it dramatically affected by substantial repetitive elements in the control regions of the horse, the rhinoceros, the cat, and the armadillo.
The taxa in this experiment were chosen to be representative of how each round in an evolutionary genomics-based vertebrate mitochondrial sequencing project might be chosen. All taxa were from separate genera, but some taxon pairs, particularly the human and the gorilla, are relatively close. Are these taxa in fact representative? In order to answer this, we determined the length distribution of regions of identity for mitochondrial genomes from taxon pairs over a range of genetic distances (fig. 3 ). These ranged from an interclass comparison of a mammal (H. sapiens) and a bird (Gallus gallus), to the closest interspecies comparison available from the 69 taxa in figure 2 , which involves Equus asinus (donkey) and E. caballus. In intragenomic comparisons within the human, gorilla, cat, and mouse genomes, identical stretches have a maximum between 14 and 17, with the exception of some repetitive expansions in the control region of the cat (Felis catus; data not shown). Thus, for pairwise intergenomic comparisons (fig. 3 ), the stretches of length 15 or more are of the greatest interest.
|
Since there are roughly 4,000 species of mammals, 7,000 birds, 20,000 fish, and thousands of reptiles and amphibians, the possibilities for sampling sufficiently distinct taxa within the vertebrates are not limited; it seems feasible to avoid combining species as close as the human-gorilla pair or the horse-donkey pair, meaning that identical segments longer than 80 bp would be unlikely. Furthermore, samples from many thousands of these species are available from museums for PCR sampling, if not mitochondrial purification. A pilot project is now underway to implement the procedures we have outlined.
![]() |
Evolutionary Genomics Is Cost-Effective |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It is our intention that assembled sequence should be made available in a dedicated phylogeny-oriented database, in addition to deposition in GenBank, in order to give added value in the form of comparative functional annotation, phylogenetic filtering, and easier access to mitochondrial-specific features. Again, a coordinated and centralized data-processing effort will avoid wasting resources through redundant efforts in multiple laboratories processing data to a point where it can be analyzed. Another large benefit of a centralized effort will be that all sequences can be tightly linked to some form of museum-deposited voucher as a matter of course, and DNA samples can also be regularly deposited with museums. Thus, one important function of a dedicated phylogenetic database will be to link directly back to the original museum information on all taxa included. This aspect of genomic analysis is often neglected by molecular biologists (few of the current complete mitochondrial genomes cite deposition of a voucher specimen), but it is extremely important to taxonomists and conservation biologists. Such linkage to original sources may not appear central to the techniques of evolutionary genomics, but it is central to creating and maintaining the resources used and to adding the full potential value of the sequence products.
![]() |
Summary |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Given moderate funding, we believe that those thousands of taxa could be sequenced rapidly at roughly one-fifth the cost of conventional approaches. By comparison, the human genome involves about 130 times the sequencing effort with 10-fold coverage (ignoring preliminary mapping costs and duplication of efforts by noncooperating ventures). In order to put this into perspective, a single ABI 3700 automatic DNA sequencer running at full capacity could sequence 23 mitochondrial genomes per week and complete 2,000 genomes in 20 months. Thus, the full capacity of the DOE Joint Genome Institute could complete 2,000 vertebrate mitochondrial genomes in about a week, and Celera Corporation or the Human Genome Project could complete them in a matter of days. Considering this, we suggest that a moderate-sized program be initiated to sequence 2,000 or more complete genomes from vertebrate mitochondria to fully demonstrate the potential benefits of evolutionary genomics and genomic biodiversity. This would include perhaps 800, or one-fifth, of the mammals in order to focus more on the human evolutionary environment, and the remainder would be from birds, reptiles, amphibians, and fish. Ideally, such a sequencing effort would be conducted in conjunction with a dedicated collaborative bioinformatics program and would itself be viewed as a pilot for continuing communitywide large-scale nuclear evolutionary genomics and sequence biodiversity projects.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: evolutionary genomics
genomic biodiversity
functional genomics
comparative genomics
molecular evolution
phylogenetics
2 Address for correspondence and reprints: David D. Pollock, Department of Biological Sciences, Louisiana State University, Baton Rouge, Louisiana 70803. E-mail: daviddpollock{at}yahoo.com
![]() |
literature cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Abrahams, J. P., A. G. Leslie, R. Lutter, and J. E. Walker. 1994. Structure at 2.8 A resolution of F1-ATPase from bovine heart mitochondria. Nature 370:621628
Adams, M. D., S. E. Celniker, R. A. Holt et al. (195 co-authors). 2000. The genome sequence of Drosophila melanogaster. Science 287:21852195
Amos, C. I., M. L. Frazier, and W. Wang. 2000. DNA pooling in mutation detection with reference to sequence analysis. Am. J. Hum. Genet. 66:16891692[ISI][Medline]
Anderson, S., A. T. Bankier, B. G. Barrell, M. H. L. De Bruijn, A. R. Coulson, J. Drouin, I. C. Eperon, D. P. Nierlich, and B. A. Roe. 1981. Sequence and organization of the human mitochondrial genome. Nature 290:457465
Anderson, S., A. T. Bankier, B. G. Barrell, M. H. L. De Bruijn, A. R. Coulson, J. Drouin, I. C. Eperon, D. P. Nierlich, and B. A. Roe. 1982a. Comparison of the human and bovine mitochondrial genomes. In P. Slonimski, P. Borst, and G. Attardi, eds. Cold Spring Harbor monograph series. Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y
Anderson, S., M. H. L. De Bruijn, A. R. Coulson, I. C. Eperon, F. Sanger, and I. G. Young. 1982b. Complete sequence of bovine mitochondrial DNA: conserved features of the mammalian mitochondrial genome. J. Mol. Biol. 156:683718
Arctander, P. 1999. Mitochondrial recombination? Science 284:20902091
Arnason, U., A. Gullberg, and A. Janke. 1999. The mitochondrial DNA molecule of the aardvark, Orycteropus afer, and the position of the Tubulidentata in the eutherian tree. Proc. R. Soc. Lond. B Biol. Sci. 266:339345[ISI][Medline]
Arnason, U., A. Gullberg, and B. Widegren. 1991. The complete nucleotide sequence of the mitochondrial DNA of the fin whale, Balaenoptera physalus. J. Mol. Evol. 33:556568[ISI][Medline]
Arnason, U., and E. Johnsson. 1992. The complete mitochondrial DNA sequence of the harbor seal, Phoca vitulina. J. Mol. Evol. 34:493505[ISI][Medline]
Awadalla, P., A. Eyre-Walker, and J. Maynard Smith. 2000. Questioning evidence for recombination in human mitochondrial DNA: response. Science 288:1931a
Awadalla, P., A. Eyre-Walker, and J. M. Smith. 1999. Linkage disequilibrium and recombination in hominid mitochondrial DNA. Science 286:25242525
Ball, C. A., K. Dolinski, S. S. Dwight et al. (17 co-authors). 2000. Integrating functional genomic information into the Saccharomyces Genome Database. Nucleic Acids Res. 28:7780
Begun, D. J., and C. F. Aquadro. 1992. Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 356:519520
Bibb, M. J., R. A. Van Etten, C. T. Wright, M. W. Walberg, and D. A. Clayton. 1981. Sequence and gene organization of mouse mitochondrial DNA. Cell 26:167180
Bonfield, J. K., C. Rada, and R. Staden. 1998. Automated detection of point mutations using fluorescent sequence trace subtraction. Nucleic Acids Res. 26:34043409
Bonfield, J. K., K. F. Smith, and R. Staden. 1995. A new DNA sequence assembly program. Nucleic Acids Res. 23:49924999[Abstract]
Bonfield, J. K., and R. Staden. 1995. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Res. 23:14061410[Abstract]
Bruno, W. J. 1996. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13:13681374
C. ELEGANS Consortium. 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282:20122018
Cao, Y., J. Adachi, A. Janke, S. Pääbo, and M. Hasegawa. 1994. Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J. Mol. Evol. 39:519527[ISI][Medline]
Cao, Y., A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Paabo, and M. Hasegawa. 1998. Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J. Mol. Evol. 47:307322[ISI][Medline]
Chang, Y.-S., F.-L. Huang, and T.-B. Lo. 1994. The complete nucleotide sequence and gene organization of carp (Cyprinus carpio) mitochondrial genome. J. Mol. Evol. 38:138155[ISI][Medline]
Charlesworth, B. 1994. The effect of background selection on weakly-selected, linked variants. Genet. Res. 63:213227[ISI][Medline]
Charlesworth, B., M. T. Morgan, and D. Charlesworth. 1993. The effect of deleterious mutations on neutral molecular variation. Genetics 134:12891303
Charlesworth, D., B. Charlesworth, and M. T. Morgan. 1995. The pattern of neutral molecular variation under the background selection model. Genetics 141:16191632
Cheng, S., S.-Y. Chang, P. Gravitt, and R. Respess. 1994a. Long PCR. Nature 369:684685
Cheng, S., C. Fockler, W. M. Barnes, and R. Higuchi. 1994b. Effective amplification of long targets from cloned inserts and human genomic DNA. Proc. Natl. Acad. Sci. USA 91:56955699
Chothia, C., and A. M. Lesk. 1987. The evolution of protein structures. Cold Spring Harb. Symp. Quant. Biol. 52:399406[ISI][Medline]
Clark, M. S. 1999. Comparative genomics: the key to understanding the Human Genome Project. Bioessays 21:121130
Clegg, M. T. 1997. Plant genetic diversity and the struggle to measure selection. J. Hered. 88:17[Abstract]
Cort, J. R., E. V. Koonin, P. A. Bash, and M. A. Kennedy. 1999. A phylogenetic approach to target selection for structural genomics: solution structure of YciH. Nucleic Acids Res. 27:40184027
Cummings, M. P., S. P. Otto, and J. Wakeley. 1995. Sampling properties of DNA sequence data in phylogenetic analysis. Mol. Biol. Evol. 12:814822[Abstract]
. 1999. Genes and other samples of DNA sequence data for phylogenetic inference. Biol. Bull. 196:345350
Curole, J. P., and T. D. Kocher. 1999. Mitogenomics: digging deeper with complete mitochondrial genomes. Trends Ecol. Evol. 14:394398[ISI][Medline]
Dean, A. M., and G. B. Golding. 2000. Enzyme evolution explained (sort of). Pac. Symp. Biocomput. 5:617
Desjardins, P., and R. Morais. 1990. Sequence and gene organization of the chicken mitochondrial genome: a novel gene order in higher vertebrates. J. Mol. Biol. 212:599635[ISI][Medline]
D'Onofrio, G., K. Jabbari, H. Musto, F. Alvarez-Valin, S. Cruveiller, and G. Bernardi. 1999. Evolutionary genomics of vertebrates and its implications. Ann. N.Y. Acad. Sci. 870:8194
Dowling, T. E., C. Moritz, J. D. Palmer, and L. H. Rieseberg. 1996. Nucleic acids III: analysis of fragments and restriction sites. Sinauer, Sunderland, Mass
Eisen, J. A. 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8:163167
Ewing, B., and P. Green. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186194
Ewing, B., L. Hillier, M. C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175185
Filatov, D. A., F. Moneger, I. Negrutiu, and D. Charlesworth. 2000. Low variability in a Y-linked plant gene and its implications for Y-chromosome evolution. Nature 404:388390
Fleischmann, R. D., M. D. Adams, O. White et al. (40 co-authors). 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae. Science 269:496498, 507512
Frishman, D., R. J. Goldstein, and D. D. Pollock. 2000. Protein evolution and structural genomics. Pac. Symp. Biocomput. 5:35
Gadaleta, G., G. Pepe, G. De Candia, C. Quagliarello, E. Sbisa, and C. Saccone. 1989. The complete nucleotide sequence of the Rattus norvegicus mitochondrial genome: cryptic signals revealed by comparative analysis between vertebrates. J. Mol. Evol. 28:497516[ISI][Medline]
Golding, G. B., and A. M. Dean. 1998. The structural basis of molecular adaptation. Mol. Biol. Evol.15:355369
Goldman, N., J. L. Thorne, and D. T. Jones. 1996. Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J. Mol. Biol. 263:196208[ISI][Medline]
. 1998. Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149:445458
Graybeal, A. 1998. Is it better to add taxa or characters to a difficult phylogenetic problem? Syst. Biol.47:917
Gu, X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol. 16:16641674
Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130131
. 1998. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst. Biol. 47:38[ISI][Medline]
Horai, S., K. Hayasaka, R. Kondo, K. Tsugane, and N. Takahata. 1995. Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs. Proc. Natl. Acad. Sci. USA 92:532536
Huttley, G. A., I. B. Jakobsen, S. R. Wilson, and S. Easteal. 2000. How important is DNA replication for mutagenesis? Mol. Biol. Evol. 17:929937
Irwin, D. M., and A. C. Wilson. 1991. Structure and evolution of cow stomach lysozyme genes. FASEB J. 5:A1527
Iwata, S., J. W. Lee, K. Okada, J. K. Lee, M. Iwata, B. Rasmussen, T. A. Link, S. Ramaswamy, and B. K. Jap. 1998. Complete structure of the 11-subunit bovine mitochondrial cytochrome bc1 complex. Science 281:6471
Jermann, T. M., J. G. Opitz, J. Stackhouse, and S. A. Benner. 1995. Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature 374:5759
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8:275282
Karplus, K., K. Sjolander, C. Barrett, M. Cline, D. Haussler, R. Hughey, L. Holm, and C. Sander. 1997. Predicting protein structure using hidden Markov models. Proteins XX(Suppl.):134139
Kim, J. 1998. Large-scale phylogenies and measuring the performance of phylogenetic estimators. Syst. Biol.47:4360
Kivisild, T., R. Villems, L. B. Jorde, M. Bamshad, S. Kumar, P. Hedrick, T. Dowling, M. Stoneking, T. J. Parsons, and J. A. Irwin. 2000. Questioning evidence for recombination in human mitochondrial DNA. Science 288:1931a
Koshi, J. M., and R. A. Goldstein. 1996. Correlating structure-dependent mutation matrices with physical-chemical properties. Pac. Symp. Biocomput. 1:488499
Koshi, J. M., D. P. Mindell, and R. A. Goldstein. 1999. Using physical-chemistry-based substitution models in phylogenetic analyses of HIV-1 subtypes. Mol. Biol. Evol. 16:173179[Abstract]
Lesk, A. M., and C. Chothia. 1980. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol. 136:225270[ISI][Medline]
Lewis, P. O. 1998. A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. Mol. Biol. Evol. 15:277283[Abstract]
Livingstone, C. D., and G. J. Barton. 1996. Identification of functional residues and secondary structure from protein multiple sequence alignment. Methods Enzymol. 266:497512[ISI][Medline]
Lopez, J. V., S. Cevario, and S. J. O'Brien. 1996. Complete nucleotide sequences of the domestic cat (Felis catus) mitochondrial genome and a transposed mtDNA repeat, Numt, in the nuclear genome. Genomics 33:229246
Malcolm, B. A., K. P. Wilson, B. W. Matthews, J. F. Kirsch, and A. C. Wilson. 1990. Ancestral lysozymes reconstructed neutrality tested and thermostability linked to hydrocarbon packing. Nature 345:8689
Merriweather, D. A., and F. A. Kaestle. 1999. Mitochondrial recombination? (continued). Science 285:837
Meyer, A. 1994. Shortcomings of the cytochrome b gene as a molecular marker. Trends Ecol. Evol. 9:278280[ISI]
Mindell, D. P., M. D. Sorenson, D. E. Dimcheff, M. Hasegawa, J. C. Ast, and T. Yuri. 1999. Interordinal relationships of birds and other reptiles based on whole mitochondrial genomes. Syst. Biol. 48:138152[ISI][Medline]
Miya, M., and M. Nishida. 1999. Organization of the mitochondrial genome of a deep-sea fish, Gonostoma gracile (Teleostei: Stomiiformes): first example of transfer RNA gene rearrangements in bony fishes. Mar. Biotech. 1:416426[ISI]
Nelson, W. S., P. A. Prodohl, and J. C. Avise. 1996. Development and application of long-PCR for the assay of full-length animal mitochondrial DNA. Mol. Ecol. 5:807810[ISI][Medline]
O'Brien, S. J., J. Wienberg, and L. A. Lyons. 1997. Comparative genomics: lessons from cats. Trends Genet. 13:393399[ISI][Medline]
Orengo, C. A., F. M. Pearl, J. E. Bray, A. E. Todd, A. C. Martin, L. Le Conte, and J. M. Thornton. 1999. The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res. 27:275279
Otto, S. P., M. P. Cummings, and J. Wakeley. 1996. Inferring phylogenies from DNA sequence data: the effects of sampling. Pp. 103115 in P. H. Harvey, A. J. Leigh-Brown, J. Maynard-Smith, and S. Nee, eds. New uses for new phylogenies. Oxford University Press, Oxford, England
Overton, G. C., and J. Haas. 1998. Case-based reasoning driven gene annotation. Pp. 6586 in S. L. Salzberg, D. B. Searls, and S. Kasif, eds. Computational methods in molecular biology. Elsevier, Amsterdam
Poe, S., and D. L. Swofford. 1999. Taxon sampling revisited. Nature 398:299300
Pollock, D. D. 1998. Increased accuracy in analytical molecular distance estimation. Theor. Popul. Biol. 54:7890[ISI][Medline]
Pollock, D. D., and W. J. Bruno. 2000. Assessing an unknown evolutionary process: effect of increasing site-specific knowledge through taxon addition. Mol. Biol. Evol. 17:18541858
Pollock, D. D., and W. R. Taylor. 1997. Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng. 10:647657[Abstract]
Pollock, D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287:187198[ISI][Medline]
Roe, B. A., D. P. Ma, R. K. Wilson, and J. F. H. Wong. 1985. The complete nucleotide sequence of the Xenopus laevis mitochondrial genome. J. Biol. Chem. 260:97599774
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425[Abstract]
Shirakihara, Y., A. G. Leslie, J. P. Abrahams, J. E. Walker, T. Veda, Y. Sekimoto, M. Kambara, K. Saika, Y. Kagawa, and M. Yoshida. 1997. The crystal structure of the nucleotide-free alpha 3 beta 3 subcomplex of F1-ATPase from the thermophilic Bacillus PS3 is a symmetric trimer. Structure 5:825836
Simonsen, K. L., G. A. Churchill, and C. F. Aquadro. 1995. Properties of statistical tests of neutrality for DNA polymorphism data. Genetics 141:413429
Staden, R. 1996. The Staden sequence analysis package. Mol. Biotech. 5:233241[ISI][Medline]
Sutton, G., O. White, M. Adams, and A. Kerlavage. 1995. TIGR Assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci. Tech. 1:919
Tajima, F. 1993. Simple methods for testing the molecular evolutionary clock hypothesis. Genetics 135:599607
Takeyasu, K., H. Omote, S. Nettikadan, F. Tokumasu, A. Iwamoto-Kihara, and M. Futai. 1996. Molecular imaging of Escherichia coli F0F1-ATPase in reconstituted membranes using atomic force microscopy. FEBS Lett. 392:110113[ISI][Medline]
Thompson, M. J., and R. A. Goldstein. 1996. Predicting solvent accessibility: higher accuracy using Bayesian statistics and optimized residue substitution classes. Proteins 25:3847
. 1997. Predicting protein secondary structure with probabilistic schemata of evolutionarily derived information. Protein Sci. 6:19631975
Thorne, J. L., N. Goldman, and D. T. Jones. 1996. Combining protein evolution and secondary structure. Mol. Biol. Evol. 13:666673[Abstract]
Tsukihara, T., H. Aoyama, E. Yamashita, T. Tomizaki, H. Yamaguchi, K. Shinzawa-Itoh, R. Nakashima, R. Yaono, and S. Yoshikawa. 1996. The whole structure of the 13-subunit oxidized cytochrome c oxidase at 2.8 A. Science 272:11361144
Tzeng, C.-S., C.-F. Hui, S.-C. Shen, and P. C. Huang. 1992. The complete nucleotide sequence of the Crossostoma lacustre mitochondrial genome: conservation and variations among vertebrates. Nucleic Acids Res. 20:48534858[Abstract]
Uhlin, U., G. B. Cox, and J. M. Guss. 1997. Crystal structure of the epsilon subunit of the proton-translocating ATP synthase from Escherichia coli. Structure 5:12191230
Venter, J. C., M. D. Adams, G. G. Sutton, A. R. Kerlavage, H. O. Smith, and M. Hunkapiller. 1998. Shotgun sequencing of the human genome. Science 280:15401542
Venter, J. C., H. O. Smith, and L. Hood. 1996. A new strategy for genome sequencing. Nature 381:364366
Vik, S. B., J. C. Long, T. Wada, and D. Zhang. 2000. A model for the structure of subunit a of the Escherichia coli ATP synthase and its role in proton translocation. Biochim. Biophys. Acta 1458:457466
Wallace, D. C., M. T. Lott, M. D. Brown, K. Hupoonen, and A. Torroni. 1995. Report of the committee on human mitochondrial DNA. Pp. 910954 in A. J. Cutichia, ed. Human gene mapping 1995: a compendium. Johns Hopkins University Press, Baltimore, Md
Wayne, M. L., and K. L. Simonsen. 1998. Statistical tests of neutrality in the age of weak selection. Trends Ecol. Evol. 13:236240[ISI]
Weber, J. L., and E. W. Myers. 1997. Human whole-genome shotgun sequencing. Genome Res. 7:401409
Xu, X., and U. Arnason. 1994. The complete mitochondrial DNA sequence of the horse, Equus caballus: extensive heteroplasmy of the control region. Gene 148:357362
. 1997. The complete mitochondrial DNA sequence of the white rhinoceros, Ceratotherium simum, and comparison with the mtDNA sequence of the Indian rhinoceros, Rhinoceros unicornis. Mol. Phylogenet. Evol. 7:189194
Yang, Z. 2000. Relating physicochemical properties of amino acids to variable nucleotide substitution patterns among sites. Pac. Symp. Biocomput. 5:7889
Yang, Z., S. Kumar, and M. Nei. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:16411650
Yoshikawa, S., K. Shinzawa-Itoh, and T. Tsukihara. 1998. Crystal structure of bovine heart cytochrome c oxidase at 2.8 A resolution. J Bioenerg. Biomembr. 30:714[ISI][Medline]
Yoshikawa, S., T. Tsukhihara, and K. Shinzawa-Itoh. 1996. Crystal structure of fully oxidized cytochrome c-oxidase from the bovine heart at 2.8 A resolution. Biokhimiia 61:19311940
Zardoya, R., and A. Meyer. 1996. Phylogenetic performance of mitochondrial protein-coding genes in resolving relationships among vertebrates. Mol. Biol. Evol.13:933942
Zhang, J., and X. Gu. 1998. Correlation between the substitution rate and rate variation among sites in protein evolution. Genetics 149:16151625
Zurawski, G., and M. T. Clegg. 1987. Evolution of higher-plant chloroplast DNA-encoded genes implications for structure-function and phylogenetic studies. Annu. Rev. Plant Physiol. 38:391418[ISI]