Bias in Phylogenetic Reconstruction of Vertebrate Rhodopsin Sequences

Belinda S. W. Chang1,3, and Dana L. Campbell2,

Department of Organismic and Evolutionary Biology, Harvard University

Abstract

Two spurious nodes were found in phylogenetic analyses of vertebrate rhodopsin sequences in comparison with well-established vertebrate relationships. These spurious reconstructions were well supported in bootstrap analyses and occurred independently of the method of phylogenetic analysis used (parsimony, distance, or likelihood). Use of this data set of vertebrate rhodopsin sequences allowed us to exploit established vertebrate relationships, as well as the considerable amount known about the molecular evolution of this gene, in order to identify important factors contributing to the spurious reconstructions. Simulation studies using parametric bootstrapping indicate that it is unlikely that the spurious nodes in the parsimony analyses are due to long branches or other topological effects. Rather, they appear to be due to base compositional bias at third positions, codon bias, and convergent evolution at nucleotide positions encoding the hydrophobic residues isoleucine, leucine, and valine. LogDet distance methods, as well as maximum-likelihood methods which allow for nonstationary changes in base composition, reduce but do not entirely eliminate support for the spurious resolutions. Inclusion of five additional rhodopsin sequences in the phylogenetic analyses largely corrected one of the spurious reconstructions while leaving the other unaffected. The additional sequences not only were more proximal to the corrected node, but were also found to have intermediate levels of base composition and codon bias as compared with neighboring sequences on the tree. This study shows that the spurious reconstructions can be corrected either by excluding third positions, as well as those encoding the amino acids Ile, Val, and Leu (which may not be ideal, as these sites can contain useful phylogenetic signal for other parts of the tree), or by the addition of sequences that reduce problems associated with convergent evolution.

Introduction

Phylogenetic analysis is a complex problem in inference. It is therefore not surprising that all existing phylogenetic methods are known to fail under some conditions and for a variety of reasons. In recent years, several issues have emerged as particularly thorny. Use of an oversimplified model of molecular evolution or strong violation of the assumptions of a model can result in convergence to an incorrect topology with greater certainty as sequence length increases (i.e., inconsistency). This type of problem is particularly relevant to parsimony analyses, especially in cases in which some branches are much longer than others, a problem which has been dubbed "long-branch attraction" (Felsenstein 1978Citation ). Phylogenetic methods based on explicit models of evolution, such as distance and maximum likelihood, tend to be less vulnerable to this type of problem, but even these are known to display inconsistency under conditions where their assumptions are strongly violated (Hillis, Huelsenbeck, and Cunningham 1994Citation ; Gaut, and Lewis 1995Citation ; Yang 1996Citation ; Huelsenbeck 1997Citation ; Sullivan and Swofford 1997Citation ; Huelsenbeck 1998Citation ). In addition, although taxon sampling has long been an important issue in phylogenetic analyses, it remains difficult to establish reasonable guidelines for sampling and to assess the effects that it may have on the accuracy of tree topologies (Hillis 1996, 1998Citation ; Poe 1998Citation ; Rannala et al. 1998Citation ).

Determining the particular conditions under which phylogenetic methods fail is critical to both understanding their limitations and developing new, improved models and algorithms better suited to the analysis of molecular data. For example, third positions have been thought to be problematic in many data sets due to the effects of base compositional bias (Saccone, Pesole, and Preparata 1989Citation ; Sidow and Wilson 1990Citation ; Sogin, Hinkle, and Lelpe 1993Citation ). This has led to the development of models that incorporate nonstationary changes in base composition for both distance and likelihood phylogenetic methods (Lockhart et al. 1994Citation ; Steel 1994Citation ; Galtier and Gouy 1995, 1998Citation ). However, in practice, it is often difficult to identify concrete examples of failure of phylogenetic methods in real data sets and to pinpoint the reasons for that failure. Another common feature of molecular data sets that may cause phylogenetic methods to fail is variation in codon bias across the tree, but examples of this in real data sets have yet to be isolated, and the challenges they pose for phylogenetic reconstruction have only just begun to be addressed (Goldman and Yang 1994Citation ; Muse 1996Citation ; Yang 1997Citation ).

Rhodopsin is an ideal genetic system for exploring issues in phylogenetic reconstruction, because it has been cloned from a variety of species, and much is known about its function and molecular evolution (Chang et al. 1995, 1996Citation ; Baylor 1996Citation ; Baylor and Burns 1998Citation ; Bowmaker 1998Citation ; Sakmar 1998Citation ; Townson et al. 1998Citation ). Rhodopsin is a single-copy nuclear gene encoding a seven-transmembrane G-protein–coupled receptor which forms the first step in the visual transduction cascade in the photoreceptors of the eye (Nathans 1992Citation ). In vertebrates, it is expressed at high levels in a single cell type, rod photoreceptor cells (Khorana 1992Citation ; Chang et al. 1996Citation ; Baylor and Burns 1998Citation ; Sakmar 1998Citation ). Rhodopsin has been found to exist in more than one copy only in rare instances, for example, in polyploid animals such as the carp, Cyprinus carpio (Larhammar and Risinger 1994Citation ). Most important for this study, phylogenetic relationships among vertebrates, for which rhodopsin sequences are available, have been well-characterized using fossil, morphological, and molecular data (Carroll 1997Citation ).

This study takes advantage of well-established vertebrate relationships to examine in detail molecular evolutionary forces which result in spurious reconstructions in a data set of vertebrate rhodopsin sequences. Once these factors have been identified, methods are explored to reduce their effects and eliminate the spurious reconstructions.

Materials and Methods

Rhodopsin sequences were obtained from the GenBank database via NCBI's website (http://www.ncbi.nlm.nih.gov/genbank/). GenBank accession numbers for all the sequences used are given in table 1 . Rhodopsin cDNA sequences were aligned using CLUSTAL W and modified by hand to allow only gaps between codons. This file was then translated to yield an equivalently aligned amino acid rhodopsin data set. Parsimony, distance, and maximum-likelihood phylogenetic analyses were performed using a beta-test version of PAUP*, version 4 (Swofford 1999Citation ). Trees were rooted using the lamprey sequence as an outgroup. In addition, many of the analyses also included four paralogous rodlike cone opsin genes (GenBank accession numbers: gekko blue, M92035; chick green, M92038; goldfish green1, L11865; goldfish green2, L11866) as outgroup sequences in order to confirm the position of the root (Chang et al. 1995Citation ). The results of these analyses confirmed the position of the lamprey as the most basally diverging vertebrate rhodopsin.


View this table:
[in this window]
[in a new window]
 
Table 1 Rhodopsin Sequences Used in this Study

 
In order to determine the best model for distance and likelihood analyses, likelihood scores were determined for five different models: JC (Jukes and Cantor 1969Citation ), K2P (Kimura 1980Citation ), F81 (Felsenstein 1981Citation ), HKY85 (Hasegawa, Kishino, and Yano 1985Citation ), and GTR (Yang 1994Citation ). Additionally, the effect of incorporating among-sites rate heterogeneity (using the {Gamma}-distribution; Yang 1993Citation ) into each of the models was examined. Likelihood ratio tests were used to compare likelihood scores obtained for pairs of nested models to determine which model best fit our sequence data (Felsenstein 1991Citation ; Yang, Goldman, and Friday 1994Citation ).

In addition to equally weighted parsimony analyses, 2:1 transversion (Tv) : transition (Ts) weighting was also used. Although other weighting schemes were explored, they produced less reliable trees (data not shown). In addition, this weighting scheme reflects the likelihood estimate of the Tv/Ts ratio (1.5). Distance bootstrap analyses were performed using the HKY85+{Gamma}, HKY85, and K2P models and the neighbor-joining algorithm.

In order to assess phylogenetic signal in the data set, 10,000 random trees were generated in PAUP* to calculate g1 statistics (Hillis and Huelsenbeck 1992Citation ). In addition, two measures of codon bias, scaled {chi}2 and effective number of codons (ENC) (Shields et al. 1988Citation ; Wright 1990Citation ), were calculated to assess codon usage in each taxon. Measures of nucleotide and codon bias were calculated using the program MEA (generously provided by its author, E. Moriyama).

To test for long-branch attraction (Huelsenbeck 1998Citation ), 100 data sets were simulated by parametric bootstrapping using the program SIMINATOR (Huelsenbeck, Hillis, and Jones 1996Citation ) with parameters estimated from the original rhodopsin sequence data set. The simulated data sets were subsequently analyzed using equally weighted maximum parsimony in PAUP*, version 4, with 100 replications of (nonparametric) bootstrapping, 10 random-addition replicates each.

Results

Phylogenetic Analyses
Phylogenetic analyses were performed on a data set of 20 vertebrate rhodopsin nucleotide sequences (table 1 ). Although this data set showed high levels of genetic variation (table 2 ) and generally performed well in reconstructing traditional relationships among vertebrates, phylogenetic analyses consistently show substantial bootstrap support for two groupings which contradict established vertebrate relationships: reptiles and amphibians form a clade (fig. 1B ), instead of the more traditional reptiles and mammals (fig. 1A ), and alligator and anolis form a clade (fig. 2B ), instead of alligator and chicken (fig. 2A ). In parsimony analysis with equal weights (table 3 ), bootstrap support was 86% for the reconstruction of amphibians as the sister group to reptiles (this node is hereinafter referred to as amph+rept) and 72% for the grouping of alligator + anolis as the sister lineage to the chicken (this node is hereinafter referred to as gator+anol). Support for these resolutions is robust to changes in the relative weightings of transversions and transitions: 91% for rept+amph and 65% for gator+anol with 2-to-1 Tv/Ts (table 3 ). Less than 5% bootstrap support was seen for more accepted resolutions of these nodes. This is in contrast to the robust support for established relationships elsewhere in the tree (fig. 3 ). Note that this data set, like many other molecular data sets, does not recover the Glires clade (rodents + rabbits), but instead places the rodents basal to a clade containing artiodactyls and other mammals. On the other hand, most morphological data recover the Glires (de Jong 1998Citation ). The Glires controversy is beyond the scope of this paper and does not influence its major observations.


View this table:
[in this window]
[in a new window]
 
Table 2 Ranges of Pairwise Nucleotide Distances (HKY85-corrected) Among Major Vertebrate Groups

 


View larger version (10K):
[in this window]
[in a new window]
 
Fig. 1.—Alternate resolutions of the reptile-amphibian-mammal trichotomy. A, Expected resolution based on well-established vertebrate relationships. B, Spurious resolution supported by the rhodopsin data set

 


View larger version (9K):
[in this window]
[in a new window]
 
Fig. 2.—Alternate resolutions of the chicken-alligator-anolis trichotomy. A, Expected resolution based on well-established relationships. B, Spurious resolution supported by the rhodopsin data set

 

View this table:
[in this window]
[in a new window]
 
Table 3 Bootstrap Support (%) for Alternative Resolutions of the Incongruent Nodes in the Original Data Set

 


View larger version (24K):
[in this window]
[in a new window]
 
Fig. 3.—Maximum-parsimony bootstrap phylogeny of the 20 rhodopsin nucleotide sequences in the original data set. One hundred bootstrap replications of 20 vertebrate rhodopsin sequences were performed, with 2-to-1 weighting of transversions to transitions

 
Another unusual aspect of this data set is that despite the substantial divergences among sequences (table 2 ), useful phylogenetic signal has been retained in third positions. Not only do third positions contribute the largest numbers of informative sites (out of 568 total informative sites, 315 were in third positions, 151 were in first positions, and 102 were in second positions), but they also contain enough signal that when analyzed alone (fig. 4 ), they recover a tree that is almost as well supported as the tree with all three codon positions included. In addition, analyses of the degree of skewness of a distribution of lengths of 10,000 randomly generated trees imply that some phylogenetic signal does reside in third positions (all positions: g1 = -0.69; third positions only: g1 = -0.69; first + second positions only: g1 = -0.80).



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 4.—Maximum-parsimony bootstrap phylogeny of third positions only; 100 replications with 2-to-1 weighting of transversions to transitions

 
Although there is useful phylogenetic signal retained at third positions with respect to many nodes in the tree, this signal also appears to be underlying some of the support for the problematic reconstructions. Bootstrap support for these incongruent resolutions almost completely disappeared when third positions were excluded from parsimony analysis (<5% for rept+amph, 10% for gator+anol), an effect that is robust to changes in transversion-transition weighting (table 3 ). Furthermore, excluding third positions had the effect of increasing support for the more established chicken + alligator grouping (hereinafter chick+gator) to 72%, in contrast to the <5% bootstrap support shown when all positions were included in the analysis. Support for the reptile + mammal grouping (hereinafter rept+mamm) also increased, but not as much (16%), when third positions were excluded.

When analyzed alone, third positions showed substantial support for the spurious resolutions (68% for rept+amph, 41% for gator+anol) and no support for the well-corroborated relationships, an effect which was robust to changes in Tv/Ts weighting (table 3 and fig. 4 ). Analyses of the amino acid sequences, which should be free of the base compositional and codon bias effects particularly problematic for third-base positions and transitions, did not show any support for the incongruent relationships (table 3 ). However, the bootstrap phylogeny based on amino acids was rather poorly resolved in general (fig. 5 ).



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 5.—Maximum-parsimony bootstrap phylogeny of rhodopsin amino acid sequences. Equally weighted parsimony analysis with 100 bootstrap replications

 
Distance analyses which did not incorporate nonstationary changes in base composition did not fare much better than parsimony for this data set, and also tended to recover the problematic nodes with substantial bootstrap support (71% for rept+amph and 47% for gator+anol, HKY85+{Gamma} model; table 3 ). These bootstrap values remained quite stable, even when the correction for rate heterogeneity was not included in the analysis or when models with fewer parameters were used (table 3 ).

Given the variation in base composition in this data set, especially at third positions (see table 1 ), analyses using LogDet/paralinear distance methods (Lake 1994Citation ; Lockhart et al. 1994Citation ; Steel 1994Citation ) were performed. These methods allow for nonstationary changes in base composition among sequences in a phylogeny and would be expected to perform better for data sets where this is a problem. Phylogenetic bootstrap analyses using LogDet distances did show reduced support for the problematic reconstructions (56% for rept+amph and 38% for gator+anol; table 3 and fig. 6 ). Moreover, for one of the problematic nodes, there was also slightly increased support for the correct reconstruction (38% for chick+gator; table 3 ).



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 6.—Neighbor-joining bootstrap phylogeny using LogDet distances (1,000 replications)

 
Maximum-likelihood methods were also explored for this data set (fig. 7 ). Likelihood ratio tests were used to compare nested models of evolution in order to identify models that best fit our data set. These models were tested for a phylogeny of well-established vertebrate relationships (fig. 8A ). Among the models tested, among-sites rate heterogeneity was the single most important parameter resulting in significantly better likelihood scores ({chi}2 ranged from 2235.2 to 2412.8, P < 0.001 for all comparisons; table 4 ). Among the models incorporating rate heterogeneity, GTR+{Gamma} had significantly higher likelihood scores in pairwise comparisons with all other models ({chi}2 = 123.8–619.6, P < 0.001 for all comparisons) except for the HKY85+{Gamma} model ({chi}2 = 6, P = 0.2). The HKY85+{Gamma} model, when compared with nested models with fewer parameters, had significantly better likelihood scores ({chi}2 = 117.8—306.8, P < 0.001 for all comparisons). Since the GTR+{Gamma} model was not found to be significantly better than the HKY85+{Gamma} model, the HKY85+{Gamma} model was determined to be the best fit of those tested for this data set and was subsequently used in a full likelihood bootstrap analysis of the rhodopsin data set. However, maximum-likelihood phylogenetic methods under the HKY85+{Gamma} model did not perform any better than distance or parsimony methods, showing substantial support for spurious resolutions at both nodes (79% for rept+amph and 44% for gator+anol; fig. 7 and table 3 ).



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 7.—Maximum-likelihood bootstrap phylogeny under the HKY+{Gamma} model (100 replications)

 


View larger version (24K):
[in this window]
[in a new window]
 
Fig. 8.—Results of simulation studies using parametric bootstrapping (see text for details concerning simulation conditions). Simulated data sets were analyzed using maximum parsimony with 100 nonparametric bootstrap replications. A, Tree topology used for simulations, with branch lengths indicated. B, Distribution of parsimony bootstrap support for the spurious reconstruction ((reptiles, amphibians), mammals) in the simulated data sets. C, Distribution of parsimony bootstrap support for the spurious reconstruction ((alligator, anolis), chicken) in the simulated data sets. These graphs represent the expected distribution of support for each of the spurious nodes under the model assumptions specified in the simulation experiment (see text). Arrows in histograms indicate the bootstrap values of the spurious nodes recovered in parsimony analysis of the real (nonsimulated) rhodopsin data set. Histogram bins represent ranges of 5% bootstrap values

 

View this table:
[in this window]
[in a new window]
 
Table 4 Likelihood Ratio Tests Comparing Nested Models

 
Since base compositional effects appeared to be important in this data set, likelihood methods which allow for nonstationary GC content were also explored (Galtier and Gouy 1998Citation ). Due to the computational intensity of this method (which allows GC content to vary across all of the branches of the tree), which made a full maximum-likelihood bootstrap analysis prohibitive, likelihood scores for alternative resolutions of the reptile-mammal-amphibian node (see fig. 1 ) were determined instead. A topology representing a well-established tree of vertebrate relationships (fig. 8A ) was found to have a lower log likelihood score (L = -9,566.12) than a second topology with the alternate resolution ((amphibians, reptiles), mammals), represented in figure 1B (L = -9,544.30). The higher likelihood score of the topology with the incongruent resolution at this node, as compared with established vertebrate relationships, indicates that even this model cannot fully account for the signal underlying the spurious reconstruction.

Finally, it has been suggested that hydrophobic amino acids may be less useful for phylogenetic reconstruction than other amino acids (Naylor and Brown 1997Citation ). To explore the effects of hydrophobic amino acids in the rhodopsin data set, nucleotide positions encoding the hydrophobic amino acids Ile, Leu, and Val were excluded in a parsimony analysis (189 nucleotide positions excluded, representing 63 amino acids). This analysis showed greatly reduced bootstrap support for the spurious resolutions (<5% for rept+amph and 33% for gator+anol, 2:1 Tv/Ts; table 3 ), indicating that positions encoding for these amino acids may underlie the spurious signal. If the spurious signal was due mainly to functional constraints on these hydrophobic amino acids, then excluding third positions should not affect the analysis. This was not the case, as the effect remained even when only third positions of the hydrophobic amino acids Ile, Leu, and Val were excluded (table 3 ).

Statistical Tests Comparing Trees
Several statistical tests were performed using the rhodopsin nucleotide data set in order to determine if phylogenies with and without the two spurious reconstructions were significantly different. The Templeton (1983)Citation test and the "winning sites" test (Prager and Wilson 1988Citation ) compare trees under the parsimony criteria, whereas the Kishino-Hasegawa test (Kishino and Hasegawa 1989Citation ) was formulated to compare trees under either likelihood or parsimony. Tests under the parsimony criteria are shown in table 5 , and tests under the likelihood criteria are shown table 6 . Each of the two spurious reconstructions (rept+amph, gator+anol) was tested separately in pairwise tests of trees with and without each spurious reconstruction. These tests confirmed the results of the phylogenetic bootstrap analysis, pinpointing third positions and nucleotides encoding Ile, Leu, and Val as the sites supporting the spurious reconstructions. Although neither spurious reconstruction (rept+amph, gator+anol) was significantly better with all nucleotide sites included, when only third positions and sites encoding Ile, Leu, or Val were considered, trees with the spurious reconstructions became significantly better than those without. This was true under both parsimony (table 5 ) and likelihood (table 6 ). Conversely, when only first and second positions, excluding those sites encoding Ile, Leu, or Val, were considered, the tree without spurious reconstructions was found to be better than either one of the trees with the spurious reconstructions. This result was significant under parsimony, but not under likelihood (tables 5 and 6 ).


View this table:
[in this window]
[in a new window]
 
Table 5 P Values for Statistical Tests of Parsimony Length Differences Between Trees With and Without Spurious Reconstructions

 

View this table:
[in this window]
[in a new window]
 
Table 6 Kishino-Hasegawa Tests Using Maximum Likelihood Under the HKY+{{Gamma}} Model

 
Simulation Studies Using Parametric Bootstrapping
Long-branch attraction has been identified as a potential reason for problematic groupings in several studies (Huelsenbeck, Hillis, and Jones 1996Citation ; Huelsenbeck 1997, 1998Citation ). One of the criteria for identifying long-branch attraction as a potential problem in phylogenetic analyses of a particular data set, according to Huelsenbeck (1997)Citation , is to show that branches of the topology are indeed long enough to attract using simulation studies. In order to test for long-branch attraction as a contributing factor to the spurious reconstructions of the rhodopsin data set, simulations were performed using parametric bootstrapping techniques. Parameters and branch lengths for the simulations were estimated using maximum likelihood on an established tree of vertebrate relationships (Carroll 1997Citation ; figure 8A ), under the HKY+{Gamma} model (maximum-likelihood-estimated parameters: K = 3.12, {alpha} = 0.33, frequency of A = 0.19, frequency of C = 0.35, frequency of G = 0.24, frequency of T = 0.23; branch lengths are given in fig. 8A ). Although a few of the resolutions of taxa present in this tree do remain somewhat controversial (e.g., the placement of the rabbits as basal to artiodactyls instead of with rodents), these are unlikely to affect the simulations with respect to the nodes in question.

Results from parsimony bootstrap analysis of the 100 simulated data sets are graphed in figure 8B and C, representing the expected null distribution of parsimony bootstrap values for each spurious reconstruction (rept+amph, gator+anol). Note that support for these spurious clades was being examined under conditions where the data were simulated from topologies reflecting the more established relationships (rept+mamm, gator+chick). The median level of bootstrap support for the incorrect rept+amph clade was 10.5% and that for the gator+anol clade was 19% in the simulated data sets. In the real rhodopsin data set, bootstrap support for both spurious resolutions was significantly higher than expected from the null distribution of simulated data sets generated by parametric bootstrapping (86% for rept+amph and 72% for gator+anol; P < 0.05 in both cases). This indicates that the level of support seen for the problematic reconstructions is higher than would be expected given the conditions of the simulations, and therefore unlikely to be due to long-branch attraction.

Base Composition and Codon Bias Measures
Since the results of the phylogenetic analyses and statistical tests comparing phylogenies implied that third positions, as well as transitions, underlie the bootstrap support of the spurious reconstructions, base composition and codon bias measures were examined for evidence of convergent evolution. First- and second-position nucleotide compositions were fairly homogeneous across all sequences. However, at third positions, reptile and amphibian rhodopsins tended to have lower %GC than other sequences (table 1 ). This pattern of convergent evolution may confound phylogenetic analyses and result in the spurious grouping, as shown by mapping the GC content on the phylogeny (fig. 9 ). Furthermore, amphibian and reptile rhodopsins are less biased in their codon usage, as shown by scaled {chi}2 and ENC codon bias measures, than are the rhodopsins of other vertebrate groups (table 1 ). Not only are there convergences in the overall degree of codon bias, but there are also convergences in the usage frequencies of specific codons that reflect the spurious groupings. This convergent pattern was evident when the codon usage frequencies were mapped on a tree. For example, convergences in the frequency of GGC, one of four codons coding for glycine, are shown mapped on the tree in figure 9 .



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 9.—Phylogeny of expected vertebrate relationships showing convergences in %GC at third positions, and an example of convergence in codon bias of a sample codon of glycine. %GC is represented by the black portion of the pies in the first column, and the proportion of the total codons of Gly represented by the codon GGC is represented by the black portion of the pies in the second column

 
The results of the phylogenetic analyses and statistical tests comparing alternative phylogenies also implicated positions encoding hydrophobic residues Ile, Leu, and Val as contributing to the high bootstrap support of the spurious reconstructions. In order to further explore this effect, base composition was examined at these sites for evidence of convergent evolution. Third positions in general had already been shown to be convergent in this data set (see above, fig. 9 ); therefore, for these amino acids, attention was focused on first and second positions. Second positions did not vary, as the amino acids Ile, Leu, and Val are all encoded by the same nucleotide, T. However, at first positions, at these sites, it was found that reptile and amphibian rhodopsins tended to have more A's (32.01%) than all other sequences (28.86%). This effect was not as marked in first positions that encoded amino acids other than Ile, Leu, and Val (27.28% for amph+rept, 26.43% for all others). This pattern of convergent evolution resulting in increased numbers of A's in first positions also results in an increased proportion of Ile's in reptile and amphibian rhodopsins relative to the total numbers of Ile, Leu, and Val present (28.1% for amph+rept, 26.5% for all others).

Effect of Increased Sampling
If the spurious reconstructions seen in this data set were due to convergent evolution, perhaps better sampling across the tree could ameliorate this effect. Rhodopsin sequences from five basally diverging taxa that were recent additions to GenBank were added to the data set: sea lamprey, Conger eel, Anguilla eel, skate, and Myripristis berndti, a holocentrid marine fish (table 1 ). It is important to note that not only do these sequences represent basal species poorly sampled in the original data set, but several of them also display values of base composition at third positions and/or codon bias quite different from their closest neighbors on the tree, and are thus more likely to "break up" convergent effects.

Myripristis berndti and Anguilla rhodopsins have only 67.81% and 73.65% GC content at third positions, as compared with other fish rhodopsins, which average 80.16% (table 1 ). Similarly, skate rhodopsin has much lower %GC at third positions (70.70%) than the nearest basal lineage, lamprey rhodopsin (87.57%). The two measures of codon bias, scaled {chi}2 and ENC, also showed the M. berndti, skate, and Conger rhodopsins to be atypically low in codon bias compared with neighboring fish and lamprey sequences (table 1 ).

For this expanded data set, equally weighted parsimony analysis of all positions showed reduced bootstrap support for the spurious rept+amph clade (48%) as compared with the original data set (86% without the additional sequences) and increased support for the correct rept+mamm clade, which rose from <5% in the original data set (table 3 ) to 25% in the expanded data set (table 7 ). Unlike analyses of the original data set, in which there was virtually no difference between equal weights versus 2-to-1 Tv/Ts weighting schemes, analysis of the expanded data set was highly sensitive to differences in weighting, particularly in the resolution of the reptile-mammal-amphibian node. When Tv/Ts weighting was used, bootstrap support for the correct rept+mamm clade jumped from 25% (equal weights) to 70% (2:1 Tv/Ts; fig. 10 ). In contrast, bootstrap support for the spurious gator+anol clade remains substantial in the analysis of the expanded data set (73%), and the high degree of sensitivity to differences in Tv/Ts weighting was not seen here (table 7 ).


View this table:
[in this window]
[in a new window]
 
Table 7 Bootstrap Support (%) for the Incongruent Nodes in an Expanded Data Set with Five Additional Sequences

 


View larger version (31K):
[in this window]
[in a new window]
 
Fig. 10.—Maximum-parsimony analysis of the expanded data set with five additional sequences (100 bootstrap replications with 2-to-1 weighting of transversions to transitions)

 
In both cases, bootstrap support for the spurious resolutions disappeared entirely when third positions were excluded from parsimony analysis, regardless of Tv/Ts weighting (table 7 ). These results are similar to those of the analysis of the original data set (table 3 ). However, in contrast to the original data set, when third positions were excluded in the expanded data set, bootstrap support for the more established resolutions was increased (44% for rept+mamm and 78% for chick+gator, equal weights). When analyzed alone, third positions showed substantial support for the spurious resolutions and no support for the well-corroborated resolutions of these nodes, regardless of Tv/Ts weighting (table 7 ).

The patterns of bootstrap support in distance analyses of the expanded data set (table 7 ) remained very similar to those of the original data set (table 3 ), with very little difference in support between the models used, showing neither decreased support for spurious resolutions nor increased support for correct resolutions. Maximum-likelihood reconstructions under HKY85+{Gamma} in the expanded data set also showed results similar to those found for the original data set and did not show reduced support for the spurious nodes nor heightened support for the correct nodes in the expanded data set (table 7 ).

Statistical comparisons of trees with and without the spurious reconstructions (rept+amph, gator+anol) were consistent with the phylogenetic bootstrap analyses. A tree with the gator+anol clade was still better than one without this spurious reconstruction when only third positions and sites encoding Ile, Leu, and Val were considered. This result was significant under the parsimony criterion (table 5 ) and not quite significant under the likelihood criterion (P = 0.07; table 6 ). However, the sites which clearly supported the spurious gator+anol reconstruction in both the original and extended data sets and also supported the spurious rept+amph reconstruction in the original data set were no longer capable of distinguishing between a tree with the spurious rept+amph reconstruction and one without in the extended data set (tables 5 and 6 ). This result is again consistent with the phylogenetic bootstrap analyses, which suggest that the additional sequences aid in breaking up convergences among the sequences, but only for the spurious rept+amph reconstruction, which is more proximal to the additional sequences, leaving the spurious gator+anol reconstruction largely unaffected.

Discussion

Our results indicate that the two problematic reconstructions in the original rhodopsin data set were probably not the result of topological effects such as long-branch attraction. This is demonstrated by the persistence of these spurious nodes when maximum-likelihood methods were used and by the fact that the bootstrap support for these spurious nodes was well outside of the distribution of support obtained for each node from simulated data sets generated by parametric bootstrapping. Rather, these spurious reconstructions were most likely due to convergences in base compositional bias at third positions, in codon bias, and in positions encoding for the hydrophobic amino acids Ile, Val, and Leu, which tend to group unrelated sequences. This represents a strong violation of phylogenetic model assumptions of stationary base composition and codon frequencies across the tree, which would cause methods not directly addressing these problems to fail under these conditions.

Base compositional bias at third positions has often been found to be problematic for phylogenetic reconstruction, and several methods have been developed in an attempt to address this problem (Lockhart et al. 1994Citation ; Galtier and Gouy 1995, 1998Citation ). Although these methods did reduce support for the spurious reconstructions in the rhodopsin data set, they were not completely effective in eliminating the problematic nodes, and it seems clear that base compositional bias is not the only reason for the spurious nodes. In fact, simulation studies on a data set of bat sequences have shown that levels of base compositional bias must be extremely high (>90% AT) in order to show any evidence of spurious reconstructions (Van Den Bussche et al. 1998Citation ). Although fairly high, levels of base compositional bias are not so extreme in the rhodopsin data set.

In addition to base compositional bias, convergent effects in codon bias and in positions encoding hydrophobic amino acids also appear to be supporting the spurious reconstructions in the rhodopsin data set. Other phylogenetic studies that have also found problematic reconstructions have attributed these to various problems such as not incorporating rate heterogeneity across sites into the phylogenetic model (Takezaki and Gojobori 1999Citation ), which is clearly not the case here. However, there is growing evidence that convergent or parallel evolution at the level of nucleotides (or amino acids) is a common feature of many molecular data sets and may pose a significant challenge in attempting to reconstruct unbiased phylogenies (Naylor and Brown 1997, 1998Citation ; Cao et al. 1998Citation ; Foster and Hickey 1999Citation ; Lee 1999Citation ). In particular, nucleotide sites encoding the hydrophobic amino acids Ile, Leu, and Val have been shown in other studies to display lower retention indices than other sites (Naylor and Brown 1997Citation ), and the analyses of the rhodopsin data set presented here provide more evidence of the importance of this effect. The reasons for it still remain unclear but may be related to relaxed constraints on hydrophobic amino acids contained within transmembrane domains.

There are several ways to address these problems of bias in base composition, codon frequencies, and sites encoding hydrophobic amino acids. All of these positions could be excluded from a parsimony phylogenetic analysis. This method can be effective in principle, but in fact may not be ideal, as these positions often contain useful phylogenetic signal in addition to the spurious signal, and excluding them can result in loss of resolution in the phylogenetic reconstructions (e.g., see Campbell, Brower, and Pierce 2000). Another way of addressing this problem would be to develop more complex models of evolution which incorporate these assumptions about base composition, codon bias, and amino acid composition. However, this may require the addition of many more parameters to the model, which may become problematic.

In addition to advances in phylogenetic methodology, this problem may be effectively addressed, albeit indirectly, via better sampling of species. Note that here "better sampling" means the addition of sequences not only proximal to problematic nodes, but also intermediate in base composition and codon bias. In other words, it is not only important when considering sampling issues to "break up" long branches that can lead to the failure of methods such as parsimony, but even more important to "break up" convergences in base composition and codon bias that can cause all types of phylogenetic methods, not just parsimony, to fail. In fact, it should be noted that of all the phylogenetic methods used here, only weighted parsimony methods are able to recover the correct topology once appropriately sampled sequences are included in the analysis, and thus these methods outperform both distance and maximum-likelihood methods in this regard. This may reflect greater sensitivity of maximum-likelihood and distance methods to incorrect assumptions in the underlying models (with respect to nonstationary nucleotide and codon bias and hydrophobic sites) in comparison with parsimony methods, which sometimes may prove more robust to violations of these assumptions despite the fact that maximum-likelihood methods are known to be consistent over a larger set of conditions than are parsimony methods (Hillis, Huelsenbeck, and Cunningham 1994Citation ; Huelsenbeck 1997Citation ; Sullivan and Swofford 1997Citation ).

Acknowledgements

We thank Z. Yang, R. Honeycutt, and two anonymous reviewers for many helpful comments on the manuscript, and N. Pierce and M. Donoghue for discussion and advice. B.S.W.C. is an NSF/Alfred P. Sloan Fellow in Molecular Evolution.

Footnotes

Rodney Honeycutt, Reviewing Editor

1 Present address: Department of Molecular Biology and Biochemistry, Rockefeller University. Back

2 Present address: Department of Biology, University of Maryland, College Park. Back

3 Keywords: molecular evolution hydrophobic amino acids base compositional bias codon bias parametric bootstrapping Back

4 Address for correspondence and reprints: Belinda S. W. Chang, Rockefeller University, 1230 York Ave., Box 284, New York, New York 10021. E-mail: changb{at}rockvax.rockefeller.edu Back

literature cited

    Baylor, D. 1996. How photons start vision. Proc. Natl. Acad. Sci. USA 93:560–565.

    Baylor, D. A., and M. E. Burns. 1998. Control of rhodopsin activity in vision. Eye 12:521–525.

    Bowmaker, J. 1998. Evolution of colour vision in vertebrates. Eye 12:541–547.

    Campbell, D. L., A. V. Z. Brower, and N. E. Pierce. 2000. Molecular evolution of the Wingless gene and its implications for the phylogenetic placement of the butterfly family Riodinidae (Lepidoptera: Papilionoidea). Mol. Biol. Evol. 17:684–696.[Abstract/Free Full Text]

    Cao, Y., A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Paabo, and M. Hasegawa. 1998. Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J. Mol. Evol. 47:307–322.[ISI][Medline]

    Carroll, R. L. 1997. Patterns and processes of vertebrate evolution. Cambridge University Press, Cambridge, England.

    Chang, B. S. W., D. Ayers, W. C. Smith, and N. E. Pierce. 1996. Cloning of the gene encoding honeybee long-wavelength rhodopsin: a new class of insect visual pigments. Gene 173:215–219.

    Chang, B. S. W., K. S. Crandall, J. P. Carulli, and D. L. Hartl. 1995. Opsin phylogeny and evolution: a model for blue shifts in wavelength regulation. Mol. Phylogenet. Evol. 4:31–43.[Medline]

    de Jong, W. W. 1998. Molecules remodel the mammalian tree. Trends Ecol. Evol. 13:270–275.[ISI]

    Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401–410.[ISI]

    ———. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376.[ISI][Medline]

    ———. 1991. PHYLIP: phylogeny inference package. Version 3.4. University of Washington, Seattle.

    Foster, P. G., and D. A. Hickey. 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. 48:284–290.[ISI][Medline]

    Galtier, N., and M. Gouy. 1995. Inferring phylogenies from sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA 92:11317–11321.

    ———. 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871–879.[Abstract]

    Gaut, B. S., and P. O. Lewis. 1995. Success of maximum likelihood phylogeny inference in the four-taxon case. Mol. Biol. Evol. 12:152–162.[Abstract]

    Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736.[Abstract/Free Full Text]

    Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:672–677.

    Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130–131.

    ———. 1998. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst. Biol. 47:3–8.[ISI][Medline]

    Hillis, D. M., and J. P. Huelsenbeck. 1992. Signal, noise, and reliability in molecular phylogenetic analyses. J. Hered. 83:189–195.[ISI][Medline]

    Hillis, D. M., J. P. Huelsenbeck, and C. W. Cunningham. 1994. Application and accuracy of molecular phylogenies. Science 164:671–677.

    Huelsenbeck, J. P. 1997. Is the Felsenstein zone a fly trap? Syst. Biol. 46:69–74.

    ———. 1998. Systematic bias in phylogenetic analysis: is the Strepsiptera problem solved? Syst. Biol. 47:519–537.

    Huelsenbeck, J. P., D. M. Hillis, and R. Jones. 1996. Parametric bootstrapping in molecular phylogenetics: applications and performance. Pp. 19–45 in J. D. Ferraris and S. R. Palumbi, eds. Molecular zoology. Wiley and Sons, New York.

    Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.

    Khorana, H. G. 1992. Rhodopsin, photoreceptor of the rod cell. J. Biol. Chem. 267:1–4.[Free Full Text]

    Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120.[ISI][Medline]

    Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29:170–179.[ISI][Medline]

    Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91:1455–1459.

    Larhammar, D., and C. Risinger. 1994. Molecular genetic aspects of tetraploidy in the common carp, Cyprinus carpio. Mol. Phylogenet. Evol. 1:59–68.

    Lee, M. S. Y. 1999. Molecular phylogenies become functional. Trends Ecol. Evol. 14:177–178.[ISI][Medline]

    Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605–612.[Free Full Text]

    Muse, S. V. 1996. Estimating synonymous and nonsynonymous substitution rates. Mol. Biol. Evol. 13:105–114.[Abstract]

    Nathans, J. 1992. Rhodopsin: structure, function, and genetics. Biochemistry 31:4923–4931.

    Naylor, G. J. P., and W. M. Brown. 1997. Structural biology and phylogenetic estimation. Nature 388:527–528.

    ———. 1998. Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences. Syst. Biol. 47:61–76.[ISI][Medline]

    Poe, S. 1998. The effect of taxonomic sampling on accuracy of phylogeny estimation: test case of a known phylogeny. Mol. Biol. Evol. 15:1086–1090.[Free Full Text]

    Prager, E. M., and A. C. Wilson. 1988. Ancient origin of lactalbumin from lysozyme: analysis of DNA and amino acid sequences. J. Mol. Evol. 27:326–335.[ISI][Medline]

    Rannala, B., J. P. Huelsenbeck, Z. Yang, and R. Nielsen. 1998. Taxon sampling and the accuracy of large phylogenies. Syst. Biol. 47:702–710.[ISI][Medline]

    Saccone, C., G. Pesole, and G. Preparata. 1989. DNA microenvironments and the molecular clock. J. Mol. Evol. 29:407–411.[ISI][Medline]

    Sakmar, T. P. 1998. Rhodopsin: a prototypical G protein-coupled receptor. Prog. Nucleic Acid Res. Mol. Biol. 59:1–34.[ISI][Medline]

    Shields, D. C., P. M. Sharp, D. G. Higgins, and F. Wright. 1988. "Silent" sites in Drosophila genes are not neutral: evidence of selection among synonymous codons. Mol. Biol. Evol. 5:704–716.[Abstract]

    Sidow, A., and A. C. Wilson. 1990. Compositional statistics: an improvement of evolutionary parsimony and its deep branches in the tree of life. J. Mol. Evol. 31:51–68.[ISI][Medline]

    Sogin, M. L., G. Hinkle, and D. D. Lelpe. 1993. Universal tree of life. Nature 362:795.

    Steel, M. 1994. Recovering a tree from the Markov leaf colourations it generates under a Markov model. Appl. Math. Lett. 7:19–23.

    Sullivan, J., and D. L. Swofford. 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J. Mamm. Evol. 4:77–86.

    Swofford, D. L. 1999. PAUP*, phylogenetic analysis using parsimony (*and other methods). Version 4.0. Sinauer, Sunderland, Mass.

    Takezaki, N., and T. Gojobori. 1999. Correct and incorrect vertebrate phylogenies obtained by the entire mitochondrial DNA sequences. Mol. Biol. Evol. 16:590–601.[Abstract]

    Templeton, A. R. 1983. Phylogenetic inference from restriction endonuclease cleavage site maps with particular reference to the humans and apes. Evolution 37:221–244.

    Townson, S. M., B. S. W. Chang, E. Salcedo, L. Chadwell, N. E. Pierce, and S. G. Britt. 1998. Isolation and physiological characterization of the genes encoding the blue and ultraviolet sensitive opsins of the honeybee, Apis mellifera. J. Neurosci. 18:2412–2422.[Abstract/Free Full Text]

    Van Den Bussche, R. A., R. J. Baker, J. P. Huelsenbeck, and D. M. Hillis. 1998. Base compositional bias and phylogenetic analyses: a test of the "flying DNA" hypothesis. Mol. Phylogenet. Evol. 10:408–416.[ISI][Medline]

    Wright, F. 1990. The ‘effective number of codons' used in a gene. Gene 87:23–29.

    Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396–1401.[Abstract]

    ———. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39:105–111.[ISI][Medline]

    ———. 1996. Phylogenetic analysis using parsimony and likelihood methods. J. Mol. Evol. 42:294–307.[ISI][Medline]

    ———. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555–556.[Medline]

    Yang, Z., N. Goldman, and A. Friday. 1994. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316–324.[Abstract]

Accepted for publication April 14, 2000.