The Effect of Positive Selection on a Sexual Reproduction Gene in Thalassiosira weissflogii (Bacillariophyta): Results Obtained from Maximum-Likelihood and Parsimony-Based Methods

Ulf Sorhannus

Department of Biology and Health Services, Edinboro University of Pennsylvania

Correspondence: E-mail: usorhannus{at}edinboro.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Data and Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 
Maximum-Likelihood–based and parsimony-based methods were used to test for potential effects of positive selection on the sexually induced gene 1 (Sig1) in Thalassiosira weissflogii. The Sig proteins are thought to play a role in mediating sperm-egg recognition during the sexual reproduction phase. The results obtained from parsimony-based analyses showed that none of the amino acid sites were influenced by positive selection. Maximum-likelihood analyses indicated that positive selection was affecting a maximum of seven and a minimum of four amino acid sites in the polypeptide derived from Sig1. It was concluded that the results obtained from the maximum-likelihood–based method are more reliable than those obtained from the parsimony-based approach. This is apparently the first study that has shown that reproductive proteins in unicellular eukaryotes are influenced by positive selection.

Key Words: Sexual reproduction gene • likelihood • parsimony • positive selection • synonymous substitution • nonsynonymous substitution


    Introduction
 TOP
 Abstract
 Introduction
 Data and Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 
Genes encoding reproductive proteins seem to have evolved rapidly in taxa ranging from unicellular eukaryotes to humans (e.g., Civetta and Singh 1995; Swanson and Vacquier 1995; Tsaur and Wu 1997; Vacquier 1998; Wyckoff, Wang, and Wu 2000; Armbrust and Galindo 2001; Swanson et al. 2001). Positive selection has often been inferred to be an important evolutionary force driving the accelerated diversification in reproductive proteins (e.g., Swanson and Vacquier 1995; Tsaur and Wu 1997; Wyckoff, Wang, and Wu 2000; Swanson et al. 2001). Many sites influenced by positive selection are located in regions involved in the species-specific sperm-egg interaction (e.g., Swanson et al. 2001). Positive Darwinian selection can be demonstrated by showing that the nonsynonymous (dN) substitution rate is significantly higher than the synonymous (dS) substitution rate (i.e., dN/dS > 1). When dN/dS equals 1 or is significantly less than 1, neutral evolution and purifying selection can be inferred, respectively.

The centric diatom Thalassiosira weissflogii analyzed here forms flagellated spermatozoa and egg cells that must recognize each other when released among a multitude of other cells (i.e., vegetative cells and gametes of other species) (Armbrust and Galindo 2001). Armbrust (1999) has identified three sexually induced genes (Sig1, Sig2, and Sig3) in T. weissflogii, which are thought to play a role in sperm-egg recognition. There are at least 10 unique copies of Sig1 present in an individual (Armbrust and Galindo 2001).

In their study of the evolution of Sig1, Armbrust and Galindo (2001) conducted a gene-wide dN/dS ratio analysis and failed to detect evidence for positive selection. However, gene-wide dN/dS tests have little power compared with more recently developed likelihood and parsimony methods. Thus, maximum-likelihood–based (Yang et al. 2000) and parsimony-based (Suzuki and Gojobori 1999) analyses were carried out here.


    Data and Methods
 TOP
 Abstract
 Introduction
 Data and Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 
A total of 40 partial Sig1 sequences (643 nucleotides), containing a portion of domains I/IV and domains II/III (Armbrust 1999), were obtained from the GenBank. The accession numbers (sampling location) were as following: AF374501 to AF374505 (Long Island Sound, USA), AF374490 to AF374500 (Long Island Sound, USA), AF374506 to AF374510 (Skagerrak Sea, Norway), AF374521 to AF374525 (Del Mar Slough, California), AF374526 to AF 374530 (North Atlantic, Portugal), AF374516 to AF374520 (King Kalakaua's Fishpond, Hawaii), AF374511 to AF374515 (Jakarta Harbor, Indonesia). Each location is represented by a clone. Sequences within each clone represent intraindividual variation. Four identical sequences and an intron located between two coding regions were removed from the data matrix before the analysis. The alignment of the remaining 36 coding sequences (558 nucleotides) (referred to as the "large data set") was unambiguous. To account for possible differences in selective pressures in different isolates, a smaller data set (referred to as the "small data set") containing 27 sequences from the Atlantic and California isolates were also analyzed. The gene tree (not shown) used in the positive selection analyses was constructed as a composite of the ß-tubulin and Sig1 trees presented in Armbrust and Galindo (2001).

The software package DAMBE (version 4.0.98 [Xia 2000]) was employed to manage the data. Maximum-likelihood–based (Yang et al. 2000) and parsimony-based (Suzuki and Gojobori 1999) methods were used to detect potential effects of positive selection on the polypeptide derived from Sig1. The maximum-likelihood–based technique (Yang et al. 2000) was implemented by the CODEML program in the PAML package (version 3.11 [Yang 1997]). A set of likelihood models in CODEML allow for variable dN/dS ratios among sites (Yang et al. 2000). A likelihood ratio test was used to examine the data for positive selection, that is for the presence of sites with dN/dS ratios significantly greater than 1. This was accomplished by comparing a null model that did not allow for variable dN/dS ratios among sites to a more general model that did. Model M0 (one-ratio) was contrasted with the M3 model (discrete model) and model M7 (ß model) was compared with model M8 (ß and {omega}) to discover potential significant heterogeneity in dN/dS among sites (Yang et al. 2000). Since model M8 (ß and {omega}) was prone to multiple local optima, different initial dN/dS values, one value greater than 1 and the other value less than 1, were used in the analysis (Yang 2001). The initial dN/dS that gave the highest likelihood was chosen as the best result. A significance level of 5% was used to test for positive selection. Bayes theorem was implemented in the calculation of the posterior probabilities (confidence probability level = 95%) that sites with a dN/dS > 1 was influenced by positive selection (Yang et al. 2000).

The parsimony-based method (Suzuki and Gojobori 1999) was implemented by the computer program ADAPTSITE (Suzuki, Gojobori, and Nei 2001). This method was used to calculate the ancestral codons for all the internal nodes of the gene tree. Then total number of nonsynonymous (cN), synonymous (cS) substitutions per codon site, the average number of nonsynonymous (sN) and synonymous(sS) sites per codon sites were computed (Suzuki and Nei 2002). The null hypothesis of neutral evolution was tested under the assumption that cS and cN are binomially distributed and that the probabilities of occurrence of synonymous and nonsynonymous substitutions are sS/(sS + sN) and sN/(sS + sN), respectively (Suzuki and Nei 2002). Statistically, positive selection can be inferred when cN/sN is significantly larger than cS/sS (Suzuki and Nei 2002). A significance level of 5% was used.

Recombination events can create the appearance of parallel/convergent changes in different branches of the tree. CODEML will infer parallel/convergent substitutions as independent and could, as a result, give rise to faulty conclusions about positive selection. GENECONV (version 1.81 [Sawyer 1999]) was employed, using the default settings, to detect recombination events in the data set. This method searched for unusually long identical fragments within pairs of aligned sequences or pairwise segments within the alignment characterized by uncommonly high matching scores (Sawyer 1999). To evaluate the significance of the hypothesis that similar fragments arose by recombination, 10,000 randomly permuted data sets derived from the real alignment were generated (see Sawyer 1999 for additional details about the computations). The significance level was set at 5%.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Data and Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 
The results of the recombination analyses of the two data sets indicated that two Long Island sequences (Long2 and Long14), obtained from two different locations, have undergone a significant recombination event (global P-value = 0.024) involving a fragment 437 nucleotides long. When Long14 and Long2 were removed from the data sets, the recombination analyses showed no significant recombination events between the remaining sequences. Both data sets were analyzed for positive selection without Long2 and Long14 sequences.

The likelihood ratio tests for the "large" and the "small" data sets indicated that the selection models M3 (discrete) and M8 (ß and {omega}) fitted the data significantly better than the null models M0 (one-ratio) and M7 (ß), respectively (table 1). The M3 (discrete) and M8 (ß and {omega}) models suggested that about 8% of the sites were under positive selection in both data sets (table 1). Calculations of posterior probabilities identified four amino acid sites (4, 42, 52, and 149) under positive selection in the "large data set" and seven sites (4, 9, 42, 52, 119, 149, and 182) in the "small data set" (table 2). Most of the replacement substitutions in the sites influenced by positive selection were found in the two Long Island isolates. However, two nonsynonymous changes took place in the lineages in the Pacific Ocean, one in the California isolate and the other in the Hawaiian isolate. Replacement substitutions in amino acid site number 4 were the most widely distributed since they occurred in both the Atlantic and Pacific oceans.


View this table:
[in this window]
[in a new window]
 
Table 1 Likelihood Values and Parameter Estimates.

 

View this table:
[in this window]
[in a new window]
 
Table 2 Likelihood Ratio Test of Positive Selection in Sig1.

 
The results obtained from the parsimony-based analyses did not identify any amino acid sites influenced by positive selection. According to this method, the amino acid sites identified by the likelihood analyses were affected neither by positive nor negative selection pressures. However, there were nine codon sites (8, 10, 34, 47, 49, 76, 86, 116, and 181) that the parsimony-based method identified as being influenced by negative selection.

Simulation studies and analyses of real data by Suzuki and Nei (2001, 2002) suggested that positively selected amino acid sites are more reliably inferred by parsimony-based methods than by likelihood-based methods. Suzuki and Nei (2002) concluded that the parsimony-based method tended to be conservative, whereas the maximum-likelihood–based technique appeared to be liberal in the interpretation of the presence of positively selected sites. An analysis of the human leukocyte antigen (HLA) was taken to support their conclusion (Suzuki and Nei 2001). However, the results obtained from the likelihood analyses of the HLA data by Suzuki and Nei (2001) appear to be problematical as simpler models had much higher likelihood values than the more general models, and multiple runs led to many different sets of parameter estimates. Yang and Swanson (2002) analyzed a similar data set of MHC alleles, and the results were all sensible. The parsimony-based method is expected to lack power for "smaller" data sets due to the fact that the technique performs a separate statistical test on each amino acid site. Thus, the failure of the method to detect sites influenced by positive selection here is not surprising.

Extensive simulations performed by Anisimova, Bielawski, and Yang (2001, 2002) and a review paper by Yang (2002) have suggested that the maximum-likelihood–based approach implemented here is in general usable and robust. Predictions of positively selected sites are expected to be unreliable when sequences are very similar and the number of lineages small (e.g., tree length less than 0.12 substitutions per codon and number of lineages less than seven) (Anisimova, Bielawski, and Yang 2002). In this study, the tree length derived from the "large data set" (34 sequences) was 0.5 substitutions per codon and for the "small data set" (25 sequences), 0.3 substitutions per codon. Since the tree length and the number of sequences are clearly larger than 0.11 and six, respectively, the results are expected to be reliable. As far as the author is aware, this is the first study that has shown that reproductive proteins in unicellular eukaryotes are influenced by positive selection.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Data and Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 
I thank Ziheng Yang, Simon D. W. Frost, Pekka Pamilo, and anonymous reviewers for valuable comments and suggestions. The author is also grateful to Simon D. W. Frost for providing a Python interface to the ADAPTSITE program.


    Footnotes
 
Pekka Pamilo, Associate Editor Back


    Literature Cited
 TOP
 Abstract
 Introduction
 Data and Methods
 Results and Discussion
 Acknowledgements
 Literature Cited
 

    Anisimova, M., J. P. Bielawski, and Z. Yang. 2001. The accuracy and power of likelihood ratio tests to detect positive selection at amino acid sites. Mol. Biol. Evol. 18:1585-1592.[Abstract/Free Full Text]

    Anisimova, M., J. P. Bielawski, and Z. Yang. 2002. Accuracy and power of Bayes prediction of amino acid sites under positive selection. Mol. Biol. Evol. 19:950-958.[Abstract/Free Full Text]

    Armbrust, E. V. 1999. Identification of a new gene family expressed during the onset of sexual reproduction in the centric diatom Thalassiosira weissflogii. Appl. Environ. Microbiol. 65:3121-3128.[Abstract/Free Full Text]

    Armbrust, E. V., and H. M. Galindo. 2001. Rapid evolution of a sexual reproduction gene in centric diatoms of the genus Thalassiosira. Appl. Environ. Microbiol. 67:3501-3513.[Abstract/Free Full Text]

    Civetta, A., and R. S. Singh. 1995. High divergence of reproductive tractproteins and their association with postzygotic reproductive isolation in Drosophila melanogaster and Drosophila virilis group species. J. Mol. Evol. 41:1085-1095.[ISI][Medline]

    Sawyer, S. A. 1999. GENECONV: a computer package for the statistical detection of gene conversion. Distributed by the author, Department of Mathematics, Washington University in St. Louis, available at http://www.math.wustl.edu/~sawyer.

    Suzuki, Y., and T. Gojobori. 1999. A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 16:1315-1328.[Abstract]

    Suzuki, Y., T. Gojobori, and M. Nei. 2001. ADAPTSITE: detecting natural selection at single amino acid sites. Bioinformatics 17:660-661.[Abstract/Free Full Text]

    Suzuki, Y., and M. Nei. 2001. Reliabilities of parsimony-based and likelihood-based methods for dectecting positive selection at single amino acid sites. Mol. Biol. Evol. 18:2179-2185.[Abstract/Free Full Text]

    Suzuki, Y., and M. Nei. 2002. Simulation study of the reliability and robustness of the statistical methods for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 19:1865-1869.[Abstract/Free Full Text]

    Swanson, W. J., and V. D. Vacquier. 1995. Extraordinary divergence and positive Darwinian selection in a fusagenic protein coating the acrosomal process of abalone spermatozoa. Proc. Natl. Acad. Sci. USA 92:4957-4961.[Abstract]

    Swanson, W. J., Z. Yang, M. F. Wolfner, and C. F. Aquadro. 2001. Positive Darwinian selection drives the evolution of several female reproductive proteins in mammals. Proc. Natl. Acad. Sci. USA 98:2509-2514.[Abstract/Free Full Text]

    Tsaur, S. C., and C. I. Wu. 1997. Positive selection and the molecular evolution of a gene of male reproduction, Acp26Aa of Drosophila. Mol. Biol. Evol. 14:544-549.[Abstract]

    Vacquier, V. D. 1998. Evolution of gamete recognition proteins. Science 281:1995-1998.[Abstract/Free Full Text]

    Wyckoff, G. J., W. Wang, and C. I. Wu. 2000. Rapid evolution of male reproductive genes in the descent of man. Nature 403:304-309.[CrossRef][ISI][Medline]

    Xia, X. 2000. Data analysis in molecular biology and evolution. Kluwer Academic Publishers, Boston.

    Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum. CABIOS 13:555-556.[Medline]

    Yang, Z. 2001. Phylogenetic analysis by maximum likelihood (PAML). Version 3.11. University College London.

    Yang, Z. 2002. Inference of selection from multiple species alignments. Curr. Opin. Genet. Dev. 12:688-694.[CrossRef][ISI][Medline]

    Yang, Z., R. Nielsen, N. Goldman, and A. M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449.[Abstract/Free Full Text]

    Yang, Z., and W. J. Swanson. 2002. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol. 19:49-57.[Abstract/Free Full Text]

Accepted for publication April 9, 2003.