A Case for Evolutionary Genomics and the Comprehensive Examination of Sequence Biodiversity

David D. Pollock, Jonathan A. Eisen, Norman A. Doggett and Michael P. Cummings

*Theoretical Biology and Biophysics, Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico;
{dagger}Department of Biological Sciences, Louisiana State University at Baton Rouge;
{ddagger}Institute for Genomic Research, Gaithersburg, Maryland;
§Genomics Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico; and
||The Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, Massachusetts


    Abstract
 TOP
 Abstract
 Introduction
 Areas of Impact
 A Strategy for Moving...
 The Case for an...
 Assessing the Feasibility of...
 Evolutionary Genomics Is Cost...
 Summary
 Acknowledgements
 literature cited
 
Comparative analysis is one of the most powerful methods available for understanding the diverse and complex systems found in biology, but it is often limited by a lack of comprehensive taxonomic sampling. Despite the recent development of powerful genome technologies capable of producing sequence data in large quantities (witness the recently completed first draft of the human genome), there has been relatively little change in how evolutionary studies are conducted. The application of genomic methods to evolutionary biology is a challenge, in part because gene segments from different organisms are manipulated separately, requiring individual purification, cloning, and sequencing. We suggest that a feasible approach to collecting genome-scale data sets for evolutionary biology (i.e., evolutionary genomics) may consist of combination of DNA samples prior to cloning and sequencing, followed by computational reconstruction of the original sequences. This approach will allow the full benefit of automated protocols developed by genome projects to be realized; taxon sampling levels can easily increase to thousands for targeted genomes and genomic regions. Sequence diversity at this level will dramatically improve the quality and accuracy of phylogenetic inference, as well as the accuracy and resolution of comparative evolutionary studies. In particular, it will be possible to make accurate estimates of normal evolution in the context of constant structural and functional constraints (i.e., site-specific substitution probabilities), along with accurate estimates of changes in evolutionary patterns, including pairwise coevolution between sites, adaptive bursts, and changes in selective constraints. These estimates can then be used to understand and predict the effects of protein structure and function on sequence evolution and to predict unknown details of protein structure, function, and functional divergence. In order to demonstrate the practicality of these ideas and the potential benefit for functional genomic analysis, we describe a pilot project we are conducting to simultaneously sequence large numbers of vertebrate mitochondrial genomes.


    Introduction
 TOP
 Abstract
 Introduction
 Areas of Impact
 A Strategy for Moving...
 The Case for an...
 Assessing the Feasibility of...
 Evolutionary Genomics Is Cost...
 Summary
 Acknowledgements
 literature cited
 
Following the first complete genome sequence from a free-living organism (Fleischmann et al. 1995Citation ), the complete genomes of a number of crown eukaryotes have been sequenced (i.e., Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and a draft of Homo sapiens [C. elegans Consortium 1998Citation ; Adams et al. 2000Citation ; Ball et al. 2000Citation ]). The sequencing of D. melanogaster was particularly notable in that it was completed via whole-genome shotgun sequencing, a method previously thought applicable only to bacterial-sized genomes (Fleischmann et al. 1995Citation ; Adams et al. 2000Citation ). The human genome is also being sequenced utilizing a total-genome shotgun approach in combination with the standard directed approach (Venter, Smith, and Hood 1996Citation ; Weber and Myers 1997Citation ; Venter et al. 1998Citation ). While the data from these genomes can provide many insights into evolution, the inherent comparative nature of evolutionary biology limits the evolutionary questions that can be addressed with only these few genomes.

The paucity of genomes available from crown eukaryotes leads to common challenges for phylogenetics, molecular evolution, and functional genomics analysis, since parameters of evolutionary models, including topology and rates, will tend to have high variances as a result. This situation is particularly unacceptable for functional genomics analysis: with the completion of sequencing of the human genome, one of the most important problems of the coming century will be to develop a complete understanding of how that sequence functions to carry out essential life processes, and a major route to that understanding will be comparative analysis of sequence biodiversity. Despite the accomplishments of genomics research and the associated development of strategies, techniques, and tools for large-scale sequencing, these advances have had very limited influence on the way molecular-based evolutionary studies are conducted. In order to reduce the variance of evolutionary model parameters and increase the accuracy of comparative analysis, we need data sets from large genomic regions with much more extensive sampling of divergent taxa than in data sets that are currently available.

To address this problem, we describe here the potential challenges of evolutionary genomics, which we define as the application of genome research strategies to comparative studies in evolution. We consider possible solutions to these challenges, outline a research design for applying evolutionary genomics to vertebrate mitochondria, and describe simulated sequence assembly experiments to verify the feasibility of our design.

A key element of our plan is to reduce costs associated with cloning and handling of materials by pooling DNA samples from different gene regions and diverse species prior to cloning. Thus, the costs specific to individual samples are limited to those that occur prior to cloning, and the costs of all subsequent steps are shared by all samples. Although breaking the direct association between sequences and samples is a counterintuitive approach, these associations can be re-created using automated assembly programs in combination with preexisting sequence information which can be used as an evolutionary reference. In cases where there is ambiguous assignment of sequences to samples, limited PCR and sequencing can be performed on the original samples to resolve these uncertainties. The simulated resampling and assembly experiments indicate that this strategy will be practical and cost-effective. Individuals from the same species can also be pooled but are unlikely to be separable into unique contiguous sequences, or haplotypes; rather, they would contribute to estimates of intraspecific variation.

Our focus is thus on large-scale sequence analysis of samples from divergent taxa, which we call sequence biodiversity (or genomic biodiversity in the case of complete organelle or nuclear genomes) analysis in order to distinguish it from genomic diversity studies that focus on intraspecific variation (usually in humans). While a great deal of biodiversity is currently being explored at deep taxonomic levels with the sequencing of complete bacterial genomes, these taxa are generally too divergent to be useful in evaluating many important evolutionary processes which occur on a much shorter timescale. Even with the current array of complete genomes, it is expected that around half of the genes in the human genome will be so diverged that they will not be obviously homologous to any genes of known function. Thus, what is clearly needed is greater focus on sequence biodiversity among taxa much more closely related to humans, that is, on the near-human evolutionary environment.


    Areas of Impact
 TOP
 Abstract
 Introduction
 Areas of Impact
 A Strategy for Moving...
 The Case for an...
 Assessing the Feasibility of...
 Evolutionary Genomics Is Cost...
 Summary
 Acknowledgements
 literature cited
 
The primary products of a large-scale approach to evolutionary genomics are listed in table 1 . The immediate products are sequences of large regions from many genomes. While there are many areas of research that would benefit from large-scale sequencing of diverse biota, we briefly describe here the possible impact on several of the most important: phylogenetics, the study of patterns and processes of sequence evolution, functional and structural genomics, and quantitative and population genetics.


View this table:
[in this window]
[in a new window]
 
Table 1 Primary Products of Large-Scale Evolutionary Genomic Sequencing

 
Phylogenetics
Inferences concerning the historical relationships between organisms (phylogenetics) are a prerequisite for accurate comparative analyses. The accuracy of these inferences will be greatly enhanced by the data generated through evolutionary genomics in two principal ways: greater sequence information for each taxon and higher density taxon sampling. The benefits of larger amounts of sequence data per taxon for the accurate inference of phylogenetic relationships has been clearly established in studies examining the sampling properties of DNA sequence data in phylogenetic analysis (Cummings, Otto, and Wakeley 1995, 1999Citation ; Otto, Cummings, and Wakeley 1996Citation ). These studies demonstrated that accurate estimation of phylogenetic relationships required large amounts of sequence information relative to most molecular systematic studies and that significant improvements were obtained with complete mitochondrial genome sequences. The weakness of phylogenies based on single genes (Cao et al. 1994, 1998Citation ; Meyer 1994Citation ; Zardoya and Meyer 1996Citation ) and the positive effects of increased taxon sampling have also been demonstrated for both inference of taxonomic relationships and estimation of model parameters (Hillis 1996, 1998Citation ; Graybeal 1998Citation ; Kim 1998Citation ; Poe and Swofford 1999Citation ; Pollock and Bruno 2000Citation ). It has also been shown that grouping sites into clusters with similar substitution probabilities can lead to more accurate phylogenetic reconstruction (Pollock 1998Citation ), and increased taxon sampling will allow this to be done accurately in the absence of prior knowledge (Pollock and Bruno 2000Citation ). It is therefore expected with more extensive taxonomic sampling of large genomic regions that the site-specific features of evolutionary models will become more detailed and well defined and that this will lead to improved estimates of tree topology, ancestral characters, and dates of divergence between clades.

Pattern and Processes of Sequence Evolution
The study of patterns and processes of genome and sequence evolution will benefit significantly from evolutionary genomics. Currently, it is difficult to obtain suitable data sets that cover large amounts of sequence from different genes and common divergent biota. Without such data sets, one cannot do more than estimate gross differences in overall rates between sites or between subsections of phylogenetic trees, and the sensitivity of coevolutionary analysis is limited. The large increase in homologous sequences from a broad range of taxa will lead to more accurate estimation of evolutionary dynamics at individual nucleotide and amino acid positions, which is essential to detailed understanding of the forces of molecular evolution and their relationship to structure and function. These improvements in accuracy will come from the diversity of taxa directly, from the improved accuracy in topological inference, and from having abundant sequences from identical taxa for comparison. Recent progress has been made in analyzing evolutionary behavior at individual sites, both by limiting the parameters to the equilibrium amino acid frequencies (Bruno 1996Citation ) and by optimizing functions of underlying physicochemical properties (Koshi, Mindell, and Goldstein 1999Citation ; Yang 2000Citation ). These techniques are limited, however, by the paucity of data sets that are sufficiently large for the analysis of accurate site-specific information. Likewise, analysis of the interaction between sites, or coevolution, has recently progressed to incorporate a wider diversity of coevolutionary maximum-likelihood models with reliable statistics (Pollock and Taylor 1997Citation ; Pollock, Taylor, and Goldman 1999Citation ; W. J. Bruno, personal communication). In each case, accurate inferences about the patterns and processes of sequence evolution are limited by the amount of available data. Evolutionary genomics will also provide site-specific information on nucleotide and codon usage bias, insertion and deletion processes, and selective processes such as adaptation, coevolution, and functional divergence.

Functional and Structural Genomics
The increased accuracy in the estimation of evolutionary processes will enable improved correlation of natural substitutions to function and three-dimensional structure. There has been a great deal of success with attempts to make such correlations using currently available data sets (Malcolm et al. 1990Citation ; Irwin and Wilson 1991Citation ; Goldman, Thorne, and Jones 1996Citation ; Karplus et al. 1997Citation ; O'Brien, Wienberg, and Lyons 1997Citation ; Eisen 1998Citation ; Golding and Dean 1998Citation ; Clark 1999Citation ; Cort et al. 1999Citation ; D'Onofrio et al. 1999Citation ; Frishman, Goldstein, and Pollock 2000Citation ), and improved sampling of biodiversity will certainly lead to more abundant and accurate hypotheses for testing based on improved models. For example, catalytic properties of ancient ribonucleases were studied by predicting the ancestral sequences using 15 existing artiodactyl sequences (Jermann et al. 1995Citation ). Since ancestral reconstructions of this kind are known to be highly inaccurate with so few sequences (Yang, Kumar, and Nei 1995Citation ), the association of a large increase in activity with ruminant digestion, and the timing of that increase in activity, could only be helped by more accurate predictions from a larger number of ancestral sequences. The results of natural evolutionary experiments in mutagenesis and selection are inferable only indirectly, and the quality of those inferences will improve dramatically through more detailed examination of the extant products.

Functionally conserved regions of regulatory and coding regions of genes can also be efficiently identified by comparing sequences from divergent taxa, a process called evolutionary filtering (Zurawski and Clegg 1987Citation ). As with other evolutionary analyses, the power of evolutionary filtering increases rapidly with the number of related sequences examined. In the protein structural realm, it is common practice to infer the functional importance of structural regions from the sequence conservation in that region, and internalization of conserved hydrophobic regions is one of the strongest components of successful structure prediction (Livingstone and Barton 1996Citation ; Thompson and Goldstein 1996Citation ). The functional importance of structural regions can also be inferred from sequence hypervariability, as with the antigen recognition sites of HLA, plant disease resistance genes, fertilization genes, and surface antigens on viruses such as HIV and influenza. Recent work has also linked changes in evolutionary conservation and variability to functional divergence between duplicated paralogs (Zhang and Gu 1998Citation ; Gu 1999Citation ). Thus, even simple models defining differences in evolutionary rate alone, focusing on extremes of the distribution, are extremely important.

More detailed phylogenetic analyses have correlated finer structural details (such as positions of catalytic and ligand-binding sites, secondary structure features, and subunit interaction surfaces) to differences in the rate and pattern of evolutionary substitution (Goldman, Thorne, and Jones 1996, 1998Citation ; Koshi and Goldstein 1996Citation ; Thompson and Goldstein 1996, 1997Citation ; Thorne, Goldman, and Jones 1996Citation ; Koshi, Mindell, and Goldstein 1999Citation ; Dean and Golding 2000Citation ). There is every reason to believe that as the amount of evolutionary information increases dramatically, so will the accuracy in predicting structural and functional features.

The potential importance of these predictions to functional genomics cannot be understated. When the human genome is completely annotated, it can be expected that up to half of the genes will not be homologous to any genes of known function, and that many of the genes which are homologous to known genes will be far enough diverged that their exact function will be ambiguous. Large-scale analysis of expression patterns will aid in functional prediction in an efficient manner, but expensive and time-consuming experiments will eventually need to be performed. Predictions based on evolutionary genomics will provide a means of narrowing experimental possibilities to a feasible number and will allow better prediction of the possible effects of substitutions on structure and function, even when the exact structure is unknown or the gene is only remotely related to a gene of known structure.

Quantitative and Population Genetics
The inclusion of multiple individuals of the same species can provide information on the distributions and patterns of genetic polymorphisms. The simultaneous sequencing of multiple individuals has been discussed in the context of human genome research (Weber and Myers 1997Citation ), and sample pooling strategies for mutation detection have been explored (Amos, Frazier, and Wang 2000). Quantitative trait locus mapping, marker-assisted breeding, and other research using polymorphic genetic markers (i.e., single-nucleotide polymorphisms, simple sequence repeats, microsatellites, etc.) benefit from an increase in density of polymorphic markers that come from evolutionary genomics. These additional markers can be incorporated in the design of genome scanning for mapping studies, resulting in increased statistical power.

Improved estimates of genetic diversity resulting from evolutionary genomics will also benefit population genetics research. The pattern and distribution of nucleotide diversity reflect the major forces acting on populations: mutation, selection, drift, and migration. Heterogeneity in patterns and processes of molecular evolution across genomes can be determined only through examination of multiple sequences from different genomic regions. Past studies based on extensive intraspecific-intragenomic sampling have already led to better characterization of the correlation of polymorphism and recombination (Begun and Aquadro 1992Citation ), which motivated the theory of background selection (Charlesworth, Morgan, and Charlesworth 1993Citation ; Charlesworth 1994Citation ; Charlesworth, Charlesworth, and Morgan 1995Citation ), provided a clearer understanding of the differential mutation rates between sex chromosomes and autosomes (Filatov et al. 2000), and provided better estimates of differential mutation rates between sexes (Huttley et al. 2000). A number of statistical measures and tests of natural selection have been developed based on the pattern and distribution of nucleotide diversity within and among genomes (reviewed in Tajima 1993Citation ; Clegg 1997Citation ; Wayne and Simonsen 1998Citation ). The ability to accurately estimate population genetics parameters and the power to discriminate among competing evolutionary hypotheses are directly related to the sampling employed in population genetics studies, with larger samples leading to greater accuracy and more power (Simonsen, Churchill, and Aquadro 1995Citation ). Therefore, the marginal cost of including additional individuals in evolutionary genomics projects can be weighted against the benefits of more accurate characterization of nucleotide diversity and, hence, more accurate estimates of population genetics parameters over multiple loci.


    A Strategy for Moving from Small-Scale to Large-Scale Surveys of Sequence Biodiversity
 TOP
 Abstract
 Introduction
 Areas of Impact
 A Strategy for Moving...
 The Case for an...
 Assessing the Feasibility of...
 Evolutionary Genomics Is Cost...
 Summary
 Acknowledgements
 literature cited
 
The current research design model in molecular-based studies of evolution can be described as a gene-based strategy characterized by low-throughput sequencing of one or a few short regions at a time. This approach is inherently limited and does not take full advantage of developments in high-throughput technologies and associated cost efficiencies. Some principal characteristics of gene-based and genome-based strategies are compared in table 2 . There are a number of procedural steps in any genomic study. These can be viewed as problems or challenges to be overcome, the solutions of which should be optimized to minimize cost without sacrificing accuracy. Analogous to arguments for whole-genome shotgun sequencing (Fleischmann et al. 1995Citation ; Venter, Smith, and Hood 1996Citation ; Weber and Myers 1997Citation ; Venter et al. 1998Citation ), we describe, in general terms, some major problems associated with moving from gene-based to genome-based research, and we discuss their solutions. We briefly describe a strategy for evolutionary genomic sequencing (fig. 1 ) and argue that this approach is less costly and will result in more rapid acquisition of useful information than the gene-based strategies currently dominating evolutionary studies.


View this table:
[in this window]
[in a new window]
 
Table 2 Comparison of Gene-Based and Genome-Based Strategies

 


View larger version (22K):
[in this window]
[in a new window]
 
Fig. 1.—Flow chart for the proposed cloning, sequencing, and assembly procedure

 
Acquisition and Preparation of Samples
In contrast to typical single-taxon genomic projects, which generally require a single sample of a common and readily available species, evolutionary genomics requires multiple samples, often from a wide diversity of taxa, some of which may not be readily available. While DNA purification using CsCl density gradient centrifugation or differential lysis may be preferable for organelle genomes when sufficient tissue is available, PCR amplification can be applied more generally and can be used when tissue samples from particular taxa are rare or expensive to collect. PCR amplification of mitochondrial genomes or large regions of nuclear DNA in a single large piece or a few smaller pieces has already been well demonstrated (Chang, Huang, and Lo 1994Citation ; Cheng et al. 1994a, 1994bCitation ; Nelson, Prodohl, and Avise 1996Citation ; Mindell et al. 1999Citation ; Miya and Nishida 1999Citation ), and these long-PCR techniques can be applied to tissues in a wide variety of preservation states. Curators and staff at several natural history museums and zoological parks (e.g., Louisiana State University, the University of California at Berkeley, the University of New Mexico, and San Diego Zoological Park) have well-documented frozen tissue collections that are sources of samples suitable for evolutionary studies.

Cloning and Sequencing Template Preparation
One factor that distinguishes a gene-based program from a genomics-based program is the size of the DNA sequence region that is examined at one time. A principal challenge is to take advantage of genomic approaches when the sequence region from any single taxon is relatively short. One way to overcome this challenge is to pool sequence regions across individuals or taxa so that the aggregate sequence corresponds to amounts well suited to genomic methods. For example, pooling 10–20 animal mitochondrial genomes, typically each about 16 kb in length, would yield an aggregate sequence comparable to a BAC, or about 0.1–0.2 the size of many bacterial genomes. Pooling ratios should be based on estimates of the number of molecules rather than DNA concentration in order to keep the concentration of each site constant in the face of length variation (e.g., differences in intron length or hypervariable regions). Pooling samples reduces the number of times labor-intensive steps involved in cloning (fractionation, size selection, ligation, transformation, etc.) have to be performed and allows for larger numbers of templates to be processed simultaneously without the need to track individuals or taxa during the sequencing phase. Pooling samples itself presents challenges with respect to sequence assembly and assignment of a specific sequence to a specific taxon or haplotype. These challenges are readily dealt with through proper taxon sampling and direct PCR of the original samples.

Sequencing
The standard procedures of genomic sequencing, including large-scale automatic sequencing of random shotgun clones, are utilized in evolutionary genomic research. Specific details such as the appropriate mixture of single- and double-ended sequencing, insert size, amount of coverage, gap closure methods, and association of contigs with taxa may vary from project to project and can be optimized to minimize cost. Our experience and limited simulation studies have indicated, however, that cost savings associated with optimizing these details are small and may be offset by costs associated with tracking clones and modifying protocols already in place at genome centers. Thus, our preferred strategy, in line with current protocols at many genomic centers, is to randomly clone inserts of 2–3 kb in length and sequence them automatically at both ends to an average redundancy of sevenfold coverage of each genomic region. This strategy minimizes costs associated with steps requiring human thought and labor, particularly the choice of clone-specific primers and final gap closure.

A critical consideration is that of identifying and minimizing the effects of sequencing errors, since they will bias estimates of variation and potentially create problems in assembly and annotation. There are several distinct sources of sequence error, depending on the source of DNA. Simply, these can be divided into error from PCR in those cases where PCR is used to generate material for cloning, and error typically associated with DNA sequencing itself. The error associated with DNA sequencing is reduced by multiple coverage of all regions sequenced. This multiple coverage for most sequence regions, sevenfold on average, is an inherent aspect of the sequencing of random shotgun clones. Furthermore, sequencing errors generally result in lower confidence in the sequencing calls (i.e., lower quality values) compared with correctly called bases, and standard assembly programs use this quality information from redundantly sequenced bases to reduce error propagation in the assembly process (Weber and Myers 1997Citation ).

PCR error due to the misincorporation of nucleotides during polymerization is expected to be similar in distribution to sequencing error. In contrast to sequencing error, however, the quality values associated with PCR errors will be indistinguishable from correct sequence, since incorrect bases will be incorporated prior to cloning. Given a large number of initial starting templates, most misincorporated bases will be of low frequency and, thus, distinguishable from the majority of authentic variation at most sites. In projects with one individual per species, there will be little natural variation (none in the case of mitochondria in the absence of heteroplasmy), and these cases can be used to obtain an accurate estimate of PCR error in the system. This estimate can then be used to correct estimates of sequence variation (particularly important in the low-frequency range) in projects with multiple individuals per species, and to calculate the probability that any given variant is real as opposed to an artifact of PCR. For heterozygotes, both nucleotides segregating at a site should be well represented in the raw sequences and mostly distinguishable from PCR error.

Postsequencing Data Processing
Evolutionary genomics requires the same vector clipping, sequence fragment assembly, and other raw sequence data processing employed in genomics research and can be achieved using available software (i.e., Phred/Phrap/Consed, Staden Package, and/or the TIGR Assembler; Bonfield, Smith, and Staden 1995Citation ; Bonfield and Staden 1995Citation ; Sutton et al. 1995Citation ; Staden 1996Citation ; Ewing and Green 1998Citation ; Ewing et al. 1998Citation ). In projects with multiple individuals per species, mutation detection software (e.g., Bonfield, Rada, and Staden 1998Citation ) may prove useful in assigning variants to species contigs. Raw data processing that requires special applications may include sequence splitting at primers to break up chimeras resulting from ligation of PCR fragments from different genomic regions or taxa. This sequence splitting can be easily accomplished using a Perl script after vector clipping and before sequence assembly.

The task of sequence annotation is greatly simplified for evolutionary genomics projects compared to standard genomics projects because a great deal of information is available a priori with respect to the gene content and other features of the sequences being examined. This prior knowledge and availability of comparative data makes evolutionary genomics extremely well suited for annotation using case-based reasoning (Overton and Haas 1998Citation ), since there will already be a moderate number of homologous sequences available. Although appropriate software is not currently available, this prior knowledge can theoretically be used to guide sequence assembly, allowing for the ordering and linking of contigs with extremely short overlaps and thus reducing the amount of sequencing redundancy required for a minimal final gap closure effort. Subsequent steps, such as data preparation for GenBank submission, can be accomplished using available tools.

Evolutionary data analysis for inferring phylogenetic relationships, characterizing patterns of sequence evolution, and estimating population genetic parameters can also be accomplished using standard methods and available software. The only difference will be an increase in the numbers and lengths of sequences analyzed. Faster computers, improvements in algorithms (e.g., Lewis 1998Citation ), and parallel computing implementation of algorithms will continue to decrease the real time of analysis. With a rapid increase in evolutionary genomics data, there will be, however, a qualitative change in the analytical potential of the data that should spur development of novel analytical approaches.


    The Case for an Examination of Vertebrate Mitochondrial Genomes
 TOP
 Abstract
 Introduction
 Areas of Impact
 A Strategy for Moving...
 The Case for an...
 Assessing the Feasibility of...
 Evolutionary Genomics Is Cost...
 Summary
 Acknowledgements
 literature cited
 
In order to give a concrete demonstration of our approach and the potential benefit for functional genomic analysis, we are implementing a pilot project to simultaneously sequence large numbers of vertebrate mitochondrial genomes. Mitochondrial genes and genomes have long been a major focus in molecular evolution, and these genomes are an excellent candidate for working out the details and demonstrating the power of evolutionary genomics. They have the advantage that they are present in high concentrations in many tissues, they are reliably amplified by PCR, and they can easily be enriched by purification of the mitochondria prior to DNA extraction (e.g., Dowling et al. 1996Citation ). The vertebrate focus of our model experiment is of rare benefit in that, unlike many primarily bacterial data sets, proteins within the vertebrate mitochondria are unlikely to have diverged to such an extent that the structural context has dramatically altered over the data set (Lesk and Chothia 1980Citation ; Chothia and Lesk 1987Citation ; Orengo et al. 1999Citation ). Thus, evolutionary analysis will not be blurred by mixing sites with widely diverged evolutionary and coevolutionary dynamics. In addition, the relatively high degree of amino acid conservation will reduce the amount of ambiguous sequence alignment, thus reducing the probability of incorrectly inferring stationary evolutionary processes for nonhomologous sites.

Mitochondrial genomes also have a strong advantage over nuclear genes in that they are unlikely to have experienced many intraspecific recombination events (but see recent controversy; Arctander 1999Citation ; Awadalla, Eyre-Walker, and Smith 1999Citation ; Merriweather and Kaestle 1999Citation ; Awadalla, Eyre-Walker, and Maynard Smith 2000Citation ; Kivisild et al. 2000Citation ). Thus, there is more likely to be a single mitochondrial phylogeny, as opposed to nuclear gene or gene segment phylogenies, which may be composites of different phylogenies. Within the vertebrates, there are few known rearrangements of genes, and none of protein-coding genes (Curole and Kocher 1999Citation ), so phylogenies are also unlikely to be confounded by inaccurate reconstruction of those events.

A detailed review concerning the effects of the current set of complete mitochondrial genomes on questions in vertebrate phylogenetics has been written (Curole and Kocher 1999Citation ), but in summary, as of March 2000, there were 69 complete vertebrate mitochondrial genomes publicly available, with slightly more than half coming from mammals (fig. 2 ). While many phylogenetic questions have been resolved with complete vertebrate mitochondrial genomes, many more are currently ambiguous or directly at odds with morphological and other sequence data. Thus, increasing the vertebrate mitochondrial genome data set, particularly by breaking up many of the long unbroken branches (e.g., within the rodents, bats, marsupials, snakes, lizards, turtles, and amphibia), is likely to have a large impact on confidence in the resolution of the tree structure.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 2.—Approximate phylogenetic tree of amino acid sequences for 69 vertebrate mitochondrial genomes. Taxa are identified by their generic name, in italics, and a subset of identifiable and recognizable taxonomic clusters are labeled on the right. The tree was reconstructed using Jones-Taylor-Thornton matrix-based distances (Jones, Taylor, and Thornton 1992Citation ) and the neighbor-joining algorithm (Saitou and Nei 1987Citation ). Despite the visual rooting, the tree is an unrooted tree and is presented only as a visual approximation of the relationships among the currently available mitochondria. Bootstrap values are not given, and many of the branching relationships shown are undoubtedly incorrect

 
The genomes of vertebrate mitochondria are also appropriate in that they contain a diversity of genes with different amounts of structural and functional information available, and with segments experiencing different structural and functional contexts and, thus, different types of selective pressure. With the cloning and sequencing of the first few mitochondrial genomes (Anderson et al. 1981, 1982a, 1982bCitation ; Bibb et al. 1981Citation ; Roe et al. 1985Citation ; Gadaleta et al. 1989Citation ; Desjardins and Morais 1990Citation ; Arnason, Gullberg, and Widegren 1991Citation ; Arnason and Johnsson 1992Citation ; Tzeng et al. 1992Citation ), it became clear that in addition to a control region, the vertebrate mitochondrial genome generally contains 12S and 16S ribosomal RNAs, 22 transfer RNAs, and 13 protein-coding genes that vary in length and average rate of evolution. These proteins are subunits of four different molecular complexes involved in oxidative phosphorylation and ATP synthesis: NADH reductase (NAD), cytochrome oxidase (CO), cytochrome bc1 (CYTB), and ATP synthase (ATP). These complexes are large and have many nuclear-encoded subunits in addition to the mitochondrion-encoded proteins. Three of them, CO, CYTB, and ATP, are complete or have had many subunits crystallized, although the mitochondrion-encoded subunits of ATP (subunits a and b; ATP6 and ATP8) are the two major proteins in the complex which remain to be crystallized (Abrahams et al. 1994Citation ; Takeyasu et al. 1996Citation ; Tsukihara et al. 1996Citation ; Yoshikawa, Tsukhihara, and Shinzawa-Itoh 1996Citation ; Shirakihara et al. 1997Citation ; Uhlin, Cox, and Guss 1997Citation ; Iwata et al. 1998Citation ; Yoshikawa, Shinzawa-Itoh, and Tsukihara 1998Citation ; Vik et al. 2000Citation ). These protein subunits thus represent a wide variety of evolutionary rates, structural and functional contexts (including alpha helices, beta sheets, turns, random coils, and transmembrane regions), and examples of interaction between positions within and between protein domains and subunits. Defects in these genes have also been linked to neurological diseases, aging, and cell death, and thus there is potential to accurately relate genomic biodiversity to both normal and disease-related intraspecific genomic diversity in humans (Wallace et al. 1995Citation ).


    Assessing the Feasibility of the Evolutionary Genomics Strategy: Experiments with Existing Mitochondrial Genomes
 TOP
 Abstract
 Introduction
 Areas of Impact
 A Strategy for Moving...
 The Case for an...
 Assessing the Feasibility of...
 Evolutionary Genomics Is Cost...
 Summary
 Acknowledgements
 literature cited
 
The potential assembly problems with our evolutionary genomics strategy are different from those of standard genomic projects in that misleading regions of identity in the sequence data arise not from repetitive elements, but from homologous regions in divergent taxa. The nature of the problem is very similar, however, in that these regions can lead to misassembling of contigs if they are not accounted for. In order to assess the feasibility of an evolutionary genomics strategy, computer simulations were conducted.

We tested our ability to assemble real sequences by resampling from known genomes. Sequences of 10 complete mitochondrial genomes available in GenBank were sampled following the protocol outlined above. These mitochondrial genomes were those of the human (H. sapiens; Anderson et al. 1981Citation ), the mouse (Mus musculus; Bibb et al. 1981Citation ), the cow (Bos taurus; Anderson et al. 1982bCitation ), the gorilla (Gorilla gorilla; Horai et al. 1995Citation ), the rat (Rattus norvegicus; Gadaleta et al. 1989Citation ), the finback whale (Balaenoptera physalus; Arnason, Gullberg, and Widegren 1991Citation ), the horse (Equus caballus; Xu and Arnason 1994Citation ), the domestic cat (Felis catus; Lopez, Cevario, and O'Brien 1996Citation ), the armadillo (Dasypus novemcinctus; Arnason, Gullberg, and Janke 1999Citation ), and the white rhinoceros (Ceratotherium simum; Xu and Arnason 1997Citation ), for a total of 183,298 bp.

To simulate the cloning process, insert size and raw sequence reads from both ends were sampled uniformly within a small range of length (2,000 kb ± 15% for insert length, 500 bp ± 5% for sequence read length) and a small range of relative concentrations of samples (plus or minus 15%). These distributions do not reflect the exact distributions of read lengths and insert sizes in genomic cloning, but they are roughly compatible with ranges observed (unpublished data); these parameters have been shown in previous simulations not to have a strong effect on simulated assembly results (Weber and Myers 1997Citation ). Standard genome projects do not have to deal with mixing of samples, so we do not have data on what the range of relative concentrations of samples will be, but the range used was intended to reasonably incorporate the variability that might be expected from DNA concentration estimates. The expected effect of a large underestimate of DNA concentration in a particular sample is that that sample will be sequenced at lower-than-optimal redundancy, and it will not be possible to fully assemble that genome. It is entirely feasible, however, that such a sample could be added at an appropriate concentration to a subsequent round for completion, while the first round could be terminated early on completion of the other genomes, thus only minimally affecting the overall cost of sequencing. Overestimates of DNA concentration in a single sample would have to be very large to dramatically affect the sequencing strategy and would generally simply lead to wasted sequence effort in direct proportion to the percentage of the overestimate. In extreme cases, sequencing would be terminated after assembly of the high-concentration sample, and the lower-concentration samples would then be recloned without that sample.

In our simulation experiment, we were able to reassemble 6 of the 10 genomes (human, cow, armadillo, whale, mouse, and horse) correctly with no gaps after sampling at 7.0-fold average coverage. There were single gaps in the rat, cat, gorilla, and rhinoceros sequences of lengths 118, 56, 63, and 90 bp, respectively, for a mean gap probability of 0.4 and a mean size of 81.5. The rhinoceros sequence was divided into two contigs, the smaller of which was 1,560 bp and bounded by the gap on one side and a 13 bp overlap on the other which was not sufficiently long to join it with the larger contig. This result is in line with expectations of this experiment for random DNA sequences, and the gaps are of a sufficient size that they could easily be closed by short PCR amplification and sequencing. Thus, the assembly was not adversely affected by the relatedness of these sequences or variation in sample concentrations, nor was it dramatically affected by substantial repetitive elements in the control regions of the horse, the rhinoceros, the cat, and the armadillo.

The taxa in this experiment were chosen to be representative of how each round in an evolutionary genomics-based vertebrate mitochondrial sequencing project might be chosen. All taxa were from separate genera, but some taxon pairs, particularly the human and the gorilla, are relatively close. Are these taxa in fact representative? In order to answer this, we determined the length distribution of regions of identity for mitochondrial genomes from taxon pairs over a range of genetic distances (fig. 3 ). These ranged from an interclass comparison of a mammal (H. sapiens) and a bird (Gallus gallus), to the closest interspecies comparison available from the 69 taxa in figure 2 , which involves Equus asinus (donkey) and E. caballus. In intragenomic comparisons within the human, gorilla, cat, and mouse genomes, identical stretches have a maximum between 14 and 17, with the exception of some repetitive expansions in the control region of the cat (Felis catus; data not shown). Thus, for pairwise intergenomic comparisons (fig. 3 ), the stretches of length 15 or more are of the greatest interest.



View larger version (39K):
[in this window]
[in a new window]
 
Fig. 3.—Distribution of identical segments in pairwise taxon comparisons. The number of identical segments is shown as a moving average (window size = 5 nt) for each segment length. The length of identical segments was defined as the largest possible contiguous stretch of identical nucleotides for any comparison. Comparisons were made over the entire lengths of each genome

 
As expected, the most divergent comparison, that between the chicken and the human, tapered off most quickly, and there were no identical stretches longer than 35 nt (fig. 3 ). More closely related taxonomic pairs had more identical segments of all lengths, with the largest numbers coming from the intraequine comparison. No comparisons other than those of the intraequine (horse-donkey) and the great ape (human-gorilla) pairs had identical segments longer than 78 nt; the intraequine comparison had another 13 regions of identity of up to 93 nt, and 6 that were longer than 93 nt (lengths 97, 108, 116, 127, 128, and 205 nt), while the great ape comparison had only four regions of identity longer than 78 nt (lengths 85, 97, 127, and 164). These distributions do not theoretically present a great challenge for existing assemblers, which are designed to deal with much longer repeat sequences. Even the longest identical segment in the closest taxon pair was somewhat less than half the length of a sequence read, and less than one-tenth the length of the average cloned insert size.

Since there are roughly 4,000 species of mammals, 7,000 birds, 20,000 fish, and thousands of reptiles and amphibians, the possibilities for sampling sufficiently distinct taxa within the vertebrates are not limited; it seems feasible to avoid combining species as close as the human-gorilla pair or the horse-donkey pair, meaning that identical segments longer than 80 bp would be unlikely. Furthermore, samples from many thousands of these species are available from museums for PCR sampling, if not mitochondrial purification. A pilot project is now underway to implement the procedures we have outlined.


    Evolutionary Genomics Is Cost-Effective
 TOP
 Abstract
 Introduction
 Areas of Impact
 A Strategy for Moving...
 The Case for an...
 Assessing the Feasibility of...
 Evolutionary Genomics Is Cost...
 Summary
 Acknowledgements
 literature cited
 
Any efficient method of sequencing can be used to generate the approximately 2.2 million nucleotides of raw sequence necessary in a typical experiment to simultaneously complete 20 mitochondrial genomes, and multiple laboratories worldwide could contribute to both the DNA purification and the sequencing stages. The key to cost savings is that once the cloning is done, all sequencing can be done automatically with the same two primers per clone, and there is no need to track clones. Thus, costs per individual genome are limited to sample acquisition, PCR (if applicable), and DNA purification. Assembly, filling of gaps, and verification of ambiguous organism assignment would probably be performed most efficiently at one central laboratory. Sequence can initially be assembled automatically using existing assembly programs, although this may be done more efficiently with yet-to-be-written specialty programs. A focused, large-scale evolutionary genomics effort will also avoid excess costs caused by piecemeal and sometimes redundant noncollaborative efforts in multiple laboratories.

It is our intention that assembled sequence should be made available in a dedicated phylogeny-oriented database, in addition to deposition in GenBank, in order to give added value in the form of comparative functional annotation, phylogenetic filtering, and easier access to mitochondrial-specific features. Again, a coordinated and centralized data-processing effort will avoid wasting resources through redundant efforts in multiple laboratories processing data to a point where it can be analyzed. Another large benefit of a centralized effort will be that all sequences can be tightly linked to some form of museum-deposited voucher as a matter of course, and DNA samples can also be regularly deposited with museums. Thus, one important function of a dedicated phylogenetic database will be to link directly back to the original museum information on all taxa included. This aspect of genomic analysis is often neglected by molecular biologists (few of the current complete mitochondrial genomes cite deposition of a voucher specimen), but it is extremely important to taxonomists and conservation biologists. Such linkage to original sources may not appear central to the techniques of evolutionary genomics, but it is central to creating and maintaining the resources used and to adding the full potential value of the sequence products.


    Summary
 TOP
 Abstract
 Introduction
 Areas of Impact
 A Strategy for Moving...
 The Case for an...
 Assessing the Feasibility of...
 Evolutionary Genomics Is Cost...
 Summary
 Acknowledgements
 literature cited
 
We have demonstrated here that the strategy we have outlined for evolutionary genomics is theoretically viable, practical, and likely to be cost-effective. We have focused on the sampling of sequence and genomic biodiversity because the potential of these data for phylogenetics, molecular evolution, and functional genomics is compelling. At a small marginal cost increase, samples from multiple individuals in each species can also be pooled, and the intraspecific variation data obtained will then produce a large benefit for population genetics analyses. Likewise, rather than restrict the possibilities, our example of the vertebrate mitochondrial genome is intended to show the feasibility of the strategy in a specific context. The work with mitochondria is relatively straightforward, and there are fewer interpretational problems with the lack of recombination and heteroplasmy. Using the methods we have outlined here, the number of vertebrate mitochondrial genomes sequenced could easily be increased from the nearly one hundred today to thousands or more.

Given moderate funding, we believe that those thousands of taxa could be sequenced rapidly at roughly one-fifth the cost of conventional approaches. By comparison, the human genome involves about 130 times the sequencing effort with 10-fold coverage (ignoring preliminary mapping costs and duplication of efforts by noncooperating ventures). In order to put this into perspective, a single ABI 3700 automatic DNA sequencer running at full capacity could sequence 23 mitochondrial genomes per week and complete 2,000 genomes in 20 months. Thus, the full capacity of the DOE Joint Genome Institute could complete 2,000 vertebrate mitochondrial genomes in about a week, and Celera Corporation or the Human Genome Project could complete them in a matter of days. Considering this, we suggest that a moderate-sized program be initiated to sequence 2,000 or more complete genomes from vertebrate mitochondria to fully demonstrate the potential benefits of evolutionary genomics and genomic biodiversity. This would include perhaps 800, or one-fifth, of the mammals in order to focus more on the human evolutionary environment, and the remainder would be from birds, reptiles, amphibians, and fish. Ideally, such a sequencing effort would be conducted in conjunction with a dedicated collaborative bioinformatics program and would itself be viewed as a pilot for continuing communitywide large-scale nuclear evolutionary genomics and sequence biodiversity projects.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Areas of Impact
 A Strategy for Moving...
 The Case for an...
 Assessing the Feasibility of...
 Evolutionary Genomics Is Cost...
 Summary
 Acknowledgements
 literature cited
 
Natural history museums at Louisiana State University and the University of New Mexico contributed invaluable cooperation in making available the details of their frozen tissue collections for our analyses. C.-B. Stewart and D. Mindell both contributed expert consultation in mtDNA analysis, and we thank S. Easteal and an anonymous reviewer for helpful and insightful comments on the manuscript. D.D.P. was funded by a Director's Fellowship from Los Alamos National Laboratory. M.P.C. was funded by grants from NASA, NSF, and the Alfred P. Sloan Foundation.


    Footnotes
 
Simon Easteal, Reviewing Editor

1 Keywords: evolutionary genomics genomic biodiversity functional genomics comparative genomics molecular evolution phylogenetics Back

2 Address for correspondence and reprints: David D. Pollock, Department of Biological Sciences, Louisiana State University, Baton Rouge, Louisiana 70803. E-mail: daviddpollock{at}yahoo.com Back


    literature cited
 TOP
 Abstract
 Introduction
 Areas of Impact
 A Strategy for Moving...
 The Case for an...
 Assessing the Feasibility of...
 Evolutionary Genomics Is Cost...
 Summary
 Acknowledgements
 literature cited
 

    Abrahams, J. P., A. G. Leslie, R. Lutter, and J. E. Walker. 1994. Structure at 2.8 A resolution of F1-ATPase from bovine heart mitochondria. Nature 370:621–628

    Adams, M. D., S. E. Celniker, R. A. Holt et al. (195 co-authors). 2000. The genome sequence of Drosophila melanogaster. Science 287:2185–2195

    Amos, C. I., M. L. Frazier, and W. Wang. 2000. DNA pooling in mutation detection with reference to sequence analysis. Am. J. Hum. Genet. 66:1689–1692[ISI][Medline]

    Anderson, S., A. T. Bankier, B. G. Barrell, M. H. L. De Bruijn, A. R. Coulson, J. Drouin, I. C. Eperon, D. P. Nierlich, and B. A. Roe. 1981. Sequence and organization of the human mitochondrial genome. Nature 290:457–465

    Anderson, S., A. T. Bankier, B. G. Barrell, M. H. L. De Bruijn, A. R. Coulson, J. Drouin, I. C. Eperon, D. P. Nierlich, and B. A. Roe. 1982a. Comparison of the human and bovine mitochondrial genomes. In P. Slonimski, P. Borst, and G. Attardi, eds. Cold Spring Harbor monograph series. Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y

    Anderson, S., M. H. L. De Bruijn, A. R. Coulson, I. C. Eperon, F. Sanger, and I. G. Young. 1982b. Complete sequence of bovine mitochondrial DNA: conserved features of the mammalian mitochondrial genome. J. Mol. Biol. 156:683–718

    Arctander, P. 1999. Mitochondrial recombination? Science 284:2090–2091

    Arnason, U., A. Gullberg, and A. Janke. 1999. The mitochondrial DNA molecule of the aardvark, Orycteropus afer, and the position of the Tubulidentata in the eutherian tree. Proc. R. Soc. Lond. B Biol. Sci. 266:339–345[ISI][Medline]

    Arnason, U., A. Gullberg, and B. Widegren. 1991. The complete nucleotide sequence of the mitochondrial DNA of the fin whale, Balaenoptera physalus. J. Mol. Evol. 33:556–568[ISI][Medline]

    Arnason, U., and E. Johnsson. 1992. The complete mitochondrial DNA sequence of the harbor seal, Phoca vitulina. J. Mol. Evol. 34:493–505[ISI][Medline]

    Awadalla, P., A. Eyre-Walker, and J. Maynard Smith. 2000. Questioning evidence for recombination in human mitochondrial DNA: response. Science 288:1931a

    Awadalla, P., A. Eyre-Walker, and J. M. Smith. 1999. Linkage disequilibrium and recombination in hominid mitochondrial DNA. Science 286:2524–2525

    Ball, C. A., K. Dolinski, S. S. Dwight et al. (17 co-authors). 2000. Integrating functional genomic information into the Saccharomyces Genome Database. Nucleic Acids Res. 28:77–80[Abstract/Free Full Text]

    Begun, D. J., and C. F. Aquadro. 1992. Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 356:519–520

    Bibb, M. J., R. A. Van Etten, C. T. Wright, M. W. Walberg, and D. A. Clayton. 1981. Sequence and gene organization of mouse mitochondrial DNA. Cell 26:167–180

    Bonfield, J. K., C. Rada, and R. Staden. 1998. Automated detection of point mutations using fluorescent sequence trace subtraction. Nucleic Acids Res. 26:3404–3409[Abstract/Free Full Text]

    Bonfield, J. K., K. F. Smith, and R. Staden. 1995. A new DNA sequence assembly program. Nucleic Acids Res. 23:4992–4999[Abstract]

    Bonfield, J. K., and R. Staden. 1995. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Res. 23:1406–1410[Abstract]

    Bruno, W. J. 1996. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13:1368–1374[Abstract/Free Full Text]

    C. ELEGANS Consortium. 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282:2012–2018

    Cao, Y., J. Adachi, A. Janke, S. Pääbo, and M. Hasegawa. 1994. Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J. Mol. Evol. 39:519–527[ISI][Medline]

    Cao, Y., A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Paabo, and M. Hasegawa. 1998. Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J. Mol. Evol. 47:307–322[ISI][Medline]

    Chang, Y.-S., F.-L. Huang, and T.-B. Lo. 1994. The complete nucleotide sequence and gene organization of carp (Cyprinus carpio) mitochondrial genome. J. Mol. Evol. 38:138–155[ISI][Medline]

    Charlesworth, B. 1994. The effect of background selection on weakly-selected, linked variants. Genet. Res. 63:213–227[ISI][Medline]

    Charlesworth, B., M. T. Morgan, and D. Charlesworth. 1993. The effect of deleterious mutations on neutral molecular variation. Genetics 134:1289–1303

    Charlesworth, D., B. Charlesworth, and M. T. Morgan. 1995. The pattern of neutral molecular variation under the background selection model. Genetics 141:1619–1632

    Cheng, S., S.-Y. Chang, P. Gravitt, and R. Respess. 1994a. Long PCR. Nature 369:684–685

    Cheng, S., C. Fockler, W. M. Barnes, and R. Higuchi. 1994b. Effective amplification of long targets from cloned inserts and human genomic DNA. Proc. Natl. Acad. Sci. USA 91:5695–5699

    Chothia, C., and A. M. Lesk. 1987. The evolution of protein structures. Cold Spring Harb. Symp. Quant. Biol. 52:399–406[ISI][Medline]

    Clark, M. S. 1999. Comparative genomics: the key to understanding the Human Genome Project. Bioessays 21:121–130

    Clegg, M. T. 1997. Plant genetic diversity and the struggle to measure selection. J. Hered. 88:1–7[Abstract]

    Cort, J. R., E. V. Koonin, P. A. Bash, and M. A. Kennedy. 1999. A phylogenetic approach to target selection for structural genomics: solution structure of YciH. Nucleic Acids Res. 27:4018–4027[Abstract/Free Full Text]

    Cummings, M. P., S. P. Otto, and J. Wakeley. 1995. Sampling properties of DNA sequence data in phylogenetic analysis. Mol. Biol. Evol. 12:814–822[Abstract]

    ———. 1999. Genes and other samples of DNA sequence data for phylogenetic inference. Biol. Bull. 196:345–350[Free Full Text]

    Curole, J. P., and T. D. Kocher. 1999. Mitogenomics: digging deeper with complete mitochondrial genomes. Trends Ecol. Evol. 14:394–398[ISI][Medline]

    Dean, A. M., and G. B. Golding. 2000. Enzyme evolution explained (sort of). Pac. Symp. Biocomput. 5:6–17

    Desjardins, P., and R. Morais. 1990. Sequence and gene organization of the chicken mitochondrial genome: a novel gene order in higher vertebrates. J. Mol. Biol. 212:599–635[ISI][Medline]

    D'Onofrio, G., K. Jabbari, H. Musto, F. Alvarez-Valin, S. Cruveiller, and G. Bernardi. 1999. Evolutionary genomics of vertebrates and its implications. Ann. N.Y. Acad. Sci. 870:81–94[Abstract/Free Full Text]

    Dowling, T. E., C. Moritz, J. D. Palmer, and L. H. Rieseberg. 1996. Nucleic acids III: analysis of fragments and restriction sites. Sinauer, Sunderland, Mass

    Eisen, J. A. 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8:163–167[Free Full Text]

    Ewing, B., and P. Green. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186–194[Abstract/Free Full Text]

    Ewing, B., L. Hillier, M. C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175–185[Abstract/Free Full Text]

    Filatov, D. A., F. Moneger, I. Negrutiu, and D. Charlesworth. 2000. Low variability in a Y-linked plant gene and its implications for Y-chromosome evolution. Nature 404:388–390

    Fleischmann, R. D., M. D. Adams, O. White et al. (40 co-authors). 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae. Science 269:496–498, 507–512

    Frishman, D., R. J. Goldstein, and D. D. Pollock. 2000. Protein evolution and structural genomics. Pac. Symp. Biocomput. 5:3–5

    Gadaleta, G., G. Pepe, G. De Candia, C. Quagliarello, E. Sbisa, and C. Saccone. 1989. The complete nucleotide sequence of the Rattus norvegicus mitochondrial genome: cryptic signals revealed by comparative analysis between vertebrates. J. Mol. Evol. 28:497–516[ISI][Medline]

    Golding, G. B., and A. M. Dean. 1998. The structural basis of molecular adaptation. Mol. Biol. Evol.15:355–369

    Goldman, N., J. L. Thorne, and D. T. Jones. 1996. Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J. Mol. Biol. 263:196–208[ISI][Medline]

    ———. 1998. Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149:445–458

    Graybeal, A. 1998. Is it better to add taxa or characters to a difficult phylogenetic problem? Syst. Biol.47:9–17

    Gu, X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol. 16:1664–1674[Abstract/Free Full Text]

    Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130–131

    ———. 1998. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst. Biol. 47:3–8[ISI][Medline]

    Horai, S., K. Hayasaka, R. Kondo, K. Tsugane, and N. Takahata. 1995. Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs. Proc. Natl. Acad. Sci. USA 92:532–536

    Huttley, G. A., I. B. Jakobsen, S. R. Wilson, and S. Easteal. 2000. How important is DNA replication for mutagenesis? Mol. Biol. Evol. 17:929–937[Abstract/Free Full Text]

    Irwin, D. M., and A. C. Wilson. 1991. Structure and evolution of cow stomach lysozyme genes. FASEB J. 5:A1527

    Iwata, S., J. W. Lee, K. Okada, J. K. Lee, M. Iwata, B. Rasmussen, T. A. Link, S. Ramaswamy, and B. K. Jap. 1998. Complete structure of the 11-subunit bovine mitochondrial cytochrome bc1 complex. Science 281:64–71

    Jermann, T. M., J. G. Opitz, J. Stackhouse, and S. A. Benner. 1995. Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature 374:57–59

    Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8:275–282

    Karplus, K., K. Sjolander, C. Barrett, M. Cline, D. Haussler, R. Hughey, L. Holm, and C. Sander. 1997. Predicting protein structure using hidden Markov models. Proteins XX(Suppl.):134–139

    Kim, J. 1998. Large-scale phylogenies and measuring the performance of phylogenetic estimators. Syst. Biol.47:43–60

    Kivisild, T., R. Villems, L. B. Jorde, M. Bamshad, S. Kumar, P. Hedrick, T. Dowling, M. Stoneking, T. J. Parsons, and J. A. Irwin. 2000. Questioning evidence for recombination in human mitochondrial DNA. Science 288:1931a

    Koshi, J. M., and R. A. Goldstein. 1996. Correlating structure-dependent mutation matrices with physical-chemical properties. Pac. Symp. Biocomput. 1:488–499

    Koshi, J. M., D. P. Mindell, and R. A. Goldstein. 1999. Using physical-chemistry-based substitution models in phylogenetic analyses of HIV-1 subtypes. Mol. Biol. Evol. 16:173–179[Abstract]

    Lesk, A. M., and C. Chothia. 1980. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol. 136:225–270[ISI][Medline]

    Lewis, P. O. 1998. A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. Mol. Biol. Evol. 15:277–283[Abstract]

    Livingstone, C. D., and G. J. Barton. 1996. Identification of functional residues and secondary structure from protein multiple sequence alignment. Methods Enzymol. 266:497–512[ISI][Medline]

    Lopez, J. V., S. Cevario, and S. J. O'Brien. 1996. Complete nucleotide sequences of the domestic cat (Felis catus) mitochondrial genome and a transposed mtDNA repeat, Numt, in the nuclear genome. Genomics 33:229–246

    Malcolm, B. A., K. P. Wilson, B. W. Matthews, J. F. Kirsch, and A. C. Wilson. 1990. Ancestral lysozymes reconstructed neutrality tested and thermostability linked to hydrocarbon packing. Nature 345:86–89

    Merriweather, D. A., and F. A. Kaestle. 1999. Mitochondrial recombination? (continued). Science 285:837

    Meyer, A. 1994. Shortcomings of the cytochrome b gene as a molecular marker. Trends Ecol. Evol. 9:278–280[ISI]

    Mindell, D. P., M. D. Sorenson, D. E. Dimcheff, M. Hasegawa, J. C. Ast, and T. Yuri. 1999. Interordinal relationships of birds and other reptiles based on whole mitochondrial genomes. Syst. Biol. 48:138–152[ISI][Medline]

    Miya, M., and M. Nishida. 1999. Organization of the mitochondrial genome of a deep-sea fish, Gonostoma gracile (Teleostei: Stomiiformes): first example of transfer RNA gene rearrangements in bony fishes. Mar. Biotech. 1:416–426[ISI]

    Nelson, W. S., P. A. Prodohl, and J. C. Avise. 1996. Development and application of long-PCR for the assay of full-length animal mitochondrial DNA. Mol. Ecol. 5:807–810[ISI][Medline]

    O'Brien, S. J., J. Wienberg, and L. A. Lyons. 1997. Comparative genomics: lessons from cats. Trends Genet. 13:393–399[ISI][Medline]

    Orengo, C. A., F. M. Pearl, J. E. Bray, A. E. Todd, A. C. Martin, L. Le Conte, and J. M. Thornton. 1999. The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res. 27:275–279[Abstract/Free Full Text]

    Otto, S. P., M. P. Cummings, and J. Wakeley. 1996. Inferring phylogenies from DNA sequence data: the effects of sampling. Pp. 103–115 in P. H. Harvey, A. J. Leigh-Brown, J. Maynard-Smith, and S. Nee, eds. New uses for new phylogenies. Oxford University Press, Oxford, England

    Overton, G. C., and J. Haas. 1998. Case-based reasoning driven gene annotation. Pp. 65–86 in S. L. Salzberg, D. B. Searls, and S. Kasif, eds. Computational methods in molecular biology. Elsevier, Amsterdam

    Poe, S., and D. L. Swofford. 1999. Taxon sampling revisited. Nature 398:299–300

    Pollock, D. D. 1998. Increased accuracy in analytical molecular distance estimation. Theor. Popul. Biol. 54:78–90[ISI][Medline]

    Pollock, D. D., and W. J. Bruno. 2000. Assessing an unknown evolutionary process: effect of increasing site-specific knowledge through taxon addition. Mol. Biol. Evol. 17:1854–1858[Abstract/Free Full Text]

    Pollock, D. D., and W. R. Taylor. 1997. Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng. 10:647–657[Abstract]

    Pollock, D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287:187–198[ISI][Medline]

    Roe, B. A., D. P. Ma, R. K. Wilson, and J. F. H. Wong. 1985. The complete nucleotide sequence of the Xenopus laevis mitochondrial genome. J. Biol. Chem. 260:9759–9774[Abstract/Free Full Text]

    Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406–425[Abstract]

    Shirakihara, Y., A. G. Leslie, J. P. Abrahams, J. E. Walker, T. Veda, Y. Sekimoto, M. Kambara, K. Saika, Y. Kagawa, and M. Yoshida. 1997. The crystal structure of the nucleotide-free alpha 3 beta 3 subcomplex of F1-ATPase from the thermophilic Bacillus PS3 is a symmetric trimer. Structure 5:825–836

    Simonsen, K. L., G. A. Churchill, and C. F. Aquadro. 1995. Properties of statistical tests of neutrality for DNA polymorphism data. Genetics 141:413–429

    Staden, R. 1996. The Staden sequence analysis package. Mol. Biotech. 5:233–241[ISI][Medline]

    Sutton, G., O. White, M. Adams, and A. Kerlavage. 1995. TIGR Assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci. Tech. 1:9–19

    Tajima, F. 1993. Simple methods for testing the molecular evolutionary clock hypothesis. Genetics 135:599–607

    Takeyasu, K., H. Omote, S. Nettikadan, F. Tokumasu, A. Iwamoto-Kihara, and M. Futai. 1996. Molecular imaging of Escherichia coli F0F1-ATPase in reconstituted membranes using atomic force microscopy. FEBS Lett. 392:110–113[ISI][Medline]

    Thompson, M. J., and R. A. Goldstein. 1996. Predicting solvent accessibility: higher accuracy using Bayesian statistics and optimized residue substitution classes. Proteins 25:38–47

    ———. 1997. Predicting protein secondary structure with probabilistic schemata of evolutionarily derived information. Protein Sci. 6:1963–1975[Abstract/Free Full Text]

    Thorne, J. L., N. Goldman, and D. T. Jones. 1996. Combining protein evolution and secondary structure. Mol. Biol. Evol. 13:666–673[Abstract]

    Tsukihara, T., H. Aoyama, E. Yamashita, T. Tomizaki, H. Yamaguchi, K. Shinzawa-Itoh, R. Nakashima, R. Yaono, and S. Yoshikawa. 1996. The whole structure of the 13-subunit oxidized cytochrome c oxidase at 2.8 A. Science 272:1136–1144

    Tzeng, C.-S., C.-F. Hui, S.-C. Shen, and P. C. Huang. 1992. The complete nucleotide sequence of the Crossostoma lacustre mitochondrial genome: conservation and variations among vertebrates. Nucleic Acids Res. 20:4853–4858[Abstract]

    Uhlin, U., G. B. Cox, and J. M. Guss. 1997. Crystal structure of the epsilon subunit of the proton-translocating ATP synthase from Escherichia coli. Structure 5:1219–1230

    Venter, J. C., M. D. Adams, G. G. Sutton, A. R. Kerlavage, H. O. Smith, and M. Hunkapiller. 1998. Shotgun sequencing of the human genome. Science 280:1540–1542

    Venter, J. C., H. O. Smith, and L. Hood. 1996. A new strategy for genome sequencing. Nature 381:364–366

    Vik, S. B., J. C. Long, T. Wada, and D. Zhang. 2000. A model for the structure of subunit a of the Escherichia coli ATP synthase and its role in proton translocation. Biochim. Biophys. Acta 1458:457–466

    Wallace, D. C., M. T. Lott, M. D. Brown, K. Hupoonen, and A. Torroni. 1995. Report of the committee on human mitochondrial DNA. Pp. 910–954 in A. J. Cutichia, ed. Human gene mapping 1995: a compendium. Johns Hopkins University Press, Baltimore, Md

    Wayne, M. L., and K. L. Simonsen. 1998. Statistical tests of neutrality in the age of weak selection. Trends Ecol. Evol. 13:236–240[ISI]

    Weber, J. L., and E. W. Myers. 1997. Human whole-genome shotgun sequencing. Genome Res. 7:401–409[Free Full Text]

    Xu, X., and U. Arnason. 1994. The complete mitochondrial DNA sequence of the horse, Equus caballus: extensive heteroplasmy of the control region. Gene 148:357–362

    ———. 1997. The complete mitochondrial DNA sequence of the white rhinoceros, Ceratotherium simum, and comparison with the mtDNA sequence of the Indian rhinoceros, Rhinoceros unicornis. Mol. Phylogenet. Evol. 7:189–194

    Yang, Z. 2000. Relating physicochemical properties of amino acids to variable nucleotide substitution patterns among sites. Pac. Symp. Biocomput. 5:78–89

    Yang, Z., S. Kumar, and M. Nei. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:1641–1650

    Yoshikawa, S., K. Shinzawa-Itoh, and T. Tsukihara. 1998. Crystal structure of bovine heart cytochrome c oxidase at 2.8 A resolution. J Bioenerg. Biomembr. 30:7–14[ISI][Medline]

    Yoshikawa, S., T. Tsukhihara, and K. Shinzawa-Itoh. 1996. Crystal structure of fully oxidized cytochrome c-oxidase from the bovine heart at 2.8 A resolution. Biokhimiia 61:1931–1940

    Zardoya, R., and A. Meyer. 1996. Phylogenetic performance of mitochondrial protein-coding genes in resolving relationships among vertebrates. Mol. Biol. Evol.13:933–942

    Zhang, J., and X. Gu. 1998. Correlation between the substitution rate and rate variation among sites in protein evolution. Genetics 149:1615–1625

    Zurawski, G., and M. T. Clegg. 1987. Evolution of higher-plant chloroplast DNA-encoded genes implications for structure-function and phylogenetic studies. Annu. Rev. Plant Physiol. 38:391–418[ISI]

Accepted for publication August 15, 2000.