Genomic Correlates of Hyperthermostability, an Update*

Karsten Suhre and Jean-Michel ClaverieDagger

From the Structural and Genomic Information Laboratory, UPR 2589-CNRS, Institute of Structural Biology and Microbiology, CNRS, Marseille, France

Received for publication, February 6, 2003

    ABSTRACT
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
REFERENCES

It has been shown (Cambillau, C., and Claverie, J. M. (2000) J. Biol. Chem. 275, 32383-32386) that a large difference between the proportions of charged versus polar (non-charged) amino acids (CvP-bias) was an adequate, if empirical, signature of the proteome of hyperthermophilic organisms (Tgrowth >80 °C). Since that study, the number of available microbial genomes has more than doubled, raising the possibility that the simple CvP-bias rule might no longer hold. Taking advantage of the new sequence data, we re-analyzed the genomes of 9 fully sequenced thermophiles, 9 hyperthermophiles, and 53 mesothermophile microorganisms to identify the genomic correlates of hyperthermostability on a wider data set. Our new results confirm that the CvP-bias previously identified on a much smaller data set still holds. Moreover, we show that it is an optimal criterion, in the sense that it corresponds to the most discriminating factor between hyperthermophilic and mesothermophilic microorganisms in a principal component analysis. In parallel, we evaluated two other recently proposed correlates of hyperthermostability, the proteome average pI and the dinucleotide statistical index (Kawashima, T., Amano, N., Koike, H., Makino, S., Higuchi, S., Kawashima-Ohya, Y., Watanabe, K., Yamazaki, M., Kanehori, K., Kawamoto, T., Nunoshiba, T., Yamamoto, Y., Aramaki, H., Makino, K., and Suzuki, M. (2000) Proc. Natl. Acad. Sci. 97, 14257-14262). We show that the CvP-bias is the sole criterion that is able to clearly discriminate hyperthermophile from mesothermophile microorganisms on a global genomic basis.

    INTRODUCTION
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
REFERENCES

Although most organisms grow at temperatures ranging between 20 and 50 °C, several archaea and a few bacteria, such as Pyrococcus and Aquifex, have been found capable of withstanding temperatures close to or higher than 100 °C. Identification of the molecular basis of the increased thermostability of the proteins of such hyperthermophilic organisms is expected to help our understanding of protein folding as well as the design of enzymes retaining their activity at high temperatures (Ref. 1 and references therein). In a previous comparative study, Cambillau and Claverie (2) found that a large difference between the proportions of charged (Asp, Glu, Lys, Arg) versus polar (non-charged) (Asn, Gln, Ser, Thr) amino acids (abbreviated as CvP1-bias) was the most prominent signature of the hyperthermophilic life style at the proteome level. This global CvP-bias was reflected in the amino acid composition of the water-accessible residues computed from an analysis of the surface of 131 mesophilic versus 58 hyperthermophilic proteins.

Given the rapidly increasing number of fully sequenced microbial genomes (more than doubled since the initial study), inferences derived in the past from correlation studies on a limited set of examples are in constant danger of being proven wrong. Now the genomes of seven new thermophilic and hyperthermophile archaea and of three new thermophilic bacteria have been deciphered. Besides these thermophilic organisms, numerous new extremophile organisms, such as the halophilic archaea Halobacterium sp., the halophilic bacterium Sinorhizobium meliloti, and the alkaliphilic bacterium Bacillus halodurans have been added to the list, as well as a large number of mesophilic bacteria and archaea (Table I). With these new genomes from a wider evolutionary spectrum, any previously derived "life style" criterion is at risk of failure in the form of false positive (classifying a mesophile as hyperthermophile) or false negative (classifying an hyperthermophile as mesophile) predictions. In particular, the newly added extremophile sequence data allow us to investigate whether the bias in favor of charged rather than polar residues previously observed in the proteome of hyperthermophiles by Cambillau and Claverie (2) is truly specific to the adaptation in high temperatures or whether it could also be linked to other extreme environments.

This new body of sequence data also gives us an opportunity to reassess the conclusions of several other studies. By systematically comparing the archaeon Thermoplasma volcanium genomic sequence with seven other genomic sequences of archaea, all exhibiting higher optimal growth temperature (OGT), Kawashima et al. (3) identified a number of strong correlations between some characteristics of genome organization. For instance, they reported that the J2 index (computed from the dinucleotide frequency, see "Materials and Methods") was increasing together with the OGT. They also noticed some characteristic changes in the distribution of the isoelectric points (pI) of the proteins: with increasing OGT, the fraction of the basic protein subset (pI >7) becomes larger and with it the genome-average pI.

In a different work, Kreil and Ouzounis (4) used hierarchical clustering and principal component analysis to identify the factors affecting the global amino acid composition of the predicted proteome of 6 thermophilic archaea, 2 thermophilic bacteria, 17 mesophilic bacteria, and 2 eukaryotic species. They concluded that the G + C content is indeed a dominant discriminating property but unrelated to any preference for a thermophilic lifestyle. They also noticed that thermophilic species could be identified by their global amino acid compositions alone, albeit without precisely defining the nature of this discriminating power.

In this study, we applied a similar approach to a much larger genomic data set. This allows us to show that the most discriminating factor between hyperthermophile and mesophile genomes remains the absolute difference between the frequency of charged and polar amino acid residues (CvP-bias). Furthermore, we show that a high J2 index value is a necessary, but not sufficient, criterion for the recognition of a hyperthermophilic proteome, whereas the previously proposed genome-average pI is not a satisfactory correlate of hyperthermostability.

    MATERIALS AND METHODS
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
REFERENCES

A total of 71 fully sequenced genomes was analyzed (Table I), including 53 mesophiles (<50 °C) (50 bacteria, 3 archaea), 9 thermophiles (50-80 °C) (4 bacteria, 5 archaea), and 9 hyperthermophiles (>80 °C) (1 bacteria, 8 archaea). 69 of those genome sequences were downloaded from GenBankTM and 2 unfinished genomes, Carboxydothermus hydrogenoformans and Bacillus stearothermophilus, from the Institute for Genome Research (TIGR, Rockville, MD) and from the University of Oklahoma, Norman, OK, respectively. The proteomes of the various organisms were defined according to the available open reading frame (ORF) annotation (when available in GenBankTM). For the two unfinished genomes, the proteomes were defined as the subsets of ORFs longer than 200 amino acids. Transmembrane segments were then predicted using the simple algorithm of Kyte and Doolittle (5). For each organism, the "soluble" moiety of the proteome was then defined by discarding all proteins (ORFs) containing at least two predicted trans-membrane segments. The subsequent statistical analyses were all performed on these predicted soluble proteins.

Following Kawashima et al. (3), the genome sequences were used to compute the J2 index as: J2 = FYY + FRR - FYR - FRY where FYY designates the relative frequency of dinucleotides pyrimidine (TT, TC, CT, CC), FRR the relative frequency of purine dinucleotides (AA, AG, GA, GG), and FYR and FRY the corresponding mixed purine/pyrimidine combinations. Pure purine and pyrimidine dinucleotides were found to be more frequent than their mixed counterparts in hyperthermophilic Pyrococci (3).

The program "iep" from the European Molecular Biology Open Software Suite (EMBOSS) was used to calculate the theoretical pI value for every soluble protein and, from those, the average pI for every genome. Principal component analysis was performed using the statistical package R. 71 genomes from meso-, thermo-, and hyperthermophilic organisms were included. When multiple genome sequences from closely related strains or species were available, only one of them was included in the analysis to avoid introducing a potential statistical bias.

    RESULTS
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
REFERENCES

Hyperthermophiles Do Exhibit a Specific Proteome Composition Signature-- The computed amino acid compositions for 9 fully sequenced hyperthermophilic and 53 mesophilic bacteria and archaea are presented in Table I and Fig. 1, a-c. The case of the 9 moderately thermophilic genomes will be discussed later. As previously observed by Cambillau and Claverie (Fig. 2 and a-c in Ref. 2) but now confirmed on a much larger data set, the proteins of hyperthermophiles exhibit a strong bias for the use of charged residues at the expense of polar residues. This includes strongly reduced frequencies for the thermolabile amino acid residues Gln and Asn, as also noted by Vieille et al. (6). Among other trends, the aliphatic residue Val appears to be preferred in hyperthermophiles, whereas the tiny non-polar non-charged residue Ala is avoided.


                              
View this table:
[in this window]
[in a new window]
 
Table I
Properties of the 71 genomes analyzed here, ranked by decreasing CvP bias
Properties of the 71 genomes analyzed here, ranked by decreasing CvP bias (see "Materials and Methods" for details). Mesophiles (OGT, <55%) are highlighted in blue, thermophiles (OGT, <80°) in orange, and hyperthermophiles in red.


View larger version (38K):
[in this window]
[in a new window]
 
Fig. 1.   a, plot of the percentages of charged amino acids (Asp, Glu, Lys, Arg, blue), polar non-charged amino acids (Asn, Gln, Ser, Thr, green), and of the difference of the two (CvP-bias, red). Blue, orange, and red vertical lines identify the mesophiles, thermophiles, and hyperthermophiles, respectively. b, plot of the percentages of the various amino acids in mesophiles (blue), thermophiles (orange), and hyperthermophiles (red). c, plot of the various amino acid classes; colors as in panel b.

Principal component analysis (PCA) offers a way to analyze the data set more objectively and eventually identify more intricate relationships between amino acid frequencies and the adaptation of proteins to high temperature. Our PCA analysis (Fig. 2) is consistent with earlier results obtained with a much smaller data set by Kreil and Ouzounis (4); 85% of the data variance is accounted by the first two principal components. The first component (72% of the variance) strongly correlates with the G + C content of the genomes (Fig. 3a, r2 = 0.94). The second component (13% of the variance) turns out to be the sole discriminating factor between meso- and hyperthermophilic lifestyles, correlating quite well with the optimal growth temperature (Fig. 3c, OGT, r2 = 0.72). All other PCA components account for not more than 3.7% each of the variance.


View larger version (28K):
[in this window]
[in a new window]
 
Fig. 2.   Biplot of a principal component analysis of the amino acid composition of all studied organisms. Blue dots identify mesophiles; orange dots identify thermophiles; red dots identify hyperthermophiles. Green vectors represent the position of the different amino acids in the biplot (only vectors with significant contributions to PCA components 1 and 2 are drawn). The black vector represents FCvP = FK + FR + FD + FE - FN - FQ - FS - FT.


View larger version (19K):
[in this window]
[in a new window]
 
Fig. 3.   Correlation between various parameters. a, correlation between G + C content and PCA component 1: r2 = 0.94. b, correlation between CvP-bias and PCA component 2: r2 = 0.93. c, correlation between PCA component 2 and OGT: r2 = 0.72. Red and blue lines indicate the separation between the hyperthermophile organism with the lowest OGT and the mesophile with the highest OGT. d, correlation between CvP-bias and OGT: r2 = 0.68. e, correlation between J2 and OGT: r2 = 0.27. f, correlation between average proteomic pI and OGT: r2 = 0.01.

The first PCA component being the G + C content, we had to find the biochemical interpretation of the second component. First, we found that this component exhibits a strong correlation with the previously defined (CvP-bias) (Fig. 3b, r2 = 0.93). Second, once projected on the first two dimensions of PCA space, the (20-dimensional) vector representing the difference between the frequencies of charged and polar residues (CHA-POL) appears almost parallel to the second PCA dimension (Fig. 2, black vector). As shown in Fig. 3d, the CvP-bias also correlates with the OGT almost as well as the second PCA component (r2 = 0.68 and 0.72, respectively). Finally, this parameter also successfully discriminates all hyperthermophilic microorganisms from all mesothermophilic ones. In conclusion, this analysis indicates that, among all possible combinations of G + C content and amino acid frequencies, the CvP-bias is a near optimal discriminating quantity to characterize the hyperthermophilic proteins from evolutionarily diverse microorganisms.

In contrast to the CvP-bias, other previously proposed quantities thought to characterize the hyperthermophilic life style do not fare well when confronted with this expanded proteome data set. Fig. 3, e and f, shows that the J2 index, as well as the average pI, now fail to correctly discriminate between hyperthermophiles and mesophiles. In fact, both parameters correlate only weakly, if at all, with OGT (r2 = 0.27 for J2 and r2 = 0.01 for average pI). However, all hyperthermophiles have a J2 index greater than 0.06, so that a high J2 index can be considered a necessary, but not a sufficient, criterion for the identification of a hyperthermophilic genome.

Moderately Thermophile Organisms-- Because only a single moderate thermophilic organism had been sequenced (Methanobacterium thermoautotrophicum) at the time of the work of Cambillau and Claverie (2), little could be said about the properties of moderately thermophile organisms. Proteome data is now available for nine of them. Two (Thermotoga maritima and Thermoanaerobacter tengcongensis) have OGTs close to the threshold of 80 °C we somewhat arbitrarily used to define hyperthermophile organisms. It turns out that they both exhibit PCA2, CvP-bias, and J2 values that could all allow their classification in continuity with previously defined hyperthermophiles.

This clear discrimination breaks down if we use an OGT of 75 °C as the hyperthermophilicity threshold. Down to this temperature, the value along the PCA2 coordinate remains a valid criterion, allowing two Sulfolobus species (OGT of 80 and 78 °C) to be classified in continuity with the other hyperthermophiles (PCA2 < -1.6 for OGT higher than 75 °C, and PCA2 > -1.3 OGT lower than 75 °C; see Fig. 3c). However, the straightforward, more biochemically meaningful CvP-bias criteria become invalidated by three mesophiles with OGT of 37 °C, Fusobacterium nucleatum (CvP = 9.06), Halobacterium sp (CvP = 9.17), Clostridium perfingens (CvP = 9.39), exhibiting larger values than Sulfolobus (CvP = 8.90 and 8.80).

    DISCUSSION
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
REFERENCES

Using a much larger data set including many new thermophile and mesophile whole genome sequences, this follow-up study confirms the previous suggestion that the global replacement of polar residues (Asn, Gln, Ser, Thr) by charged residues (Asp, Glu, Lys, Arg) is the dominant proteome characteristic of microorganisms adapted to hyperthermophilic growth condition (OGT >80 °C). This effect is observed for both bacteria and archaebacteria and thus is not a simple consequence of phylogenetic relationship.

Even though the strict correspondence between the highest CvP-bias and the highest OGT breaks down below 80 °C, this property globally remains a characteristic of all thermophilic (OGT >55 °C) microorganisms, as shown in Fig. 1c where the CvP-bias averaged over all mesothermophiles remains markedly higher than for mesophile organisms. The influence of other adaptive strategies (e.g. to high salinity or extreme pH environments), together with the phylogenetic affinities of certain mesophiles to thermophile organisms (7), probably contributes to weaken CvP-bias signal, allowing a few false positives to sneak in (such as Clostridium perfringens or Streptomyces coelicolor) (Table I.).

This study confirms that a strong CvP-bias is specifically associated with hyperthermophilic proteomes. This observation is consistent with the thermodynamic advantage resulting from the increased significance of coulomb interaction with the increasing temperature (as the dielectric constant of water decreases). The simultaneous increase of oppositely charged residues (mostly Arg, Lys, and Glu) further allows for more ion pairs to be formed at the surface of hyperthermostable proteins (2). This rationale involving the stability of proteins in a high temperature aqueous environment is also supported by our observation that the proteins predicted to be associated to the membrane (and thus designed for hydrophobic environments) exhibit a much less significant CvP-bias (data not shown). Finally, the fact that a large number of diverse genomes confirms a statistical trend previously inferred from a much smaller data set argues that increased ion-pair formation is both a significant physico-chemical factor and the preferred evolutionary pathway toward thermostable soluble proteins.

    FOOTNOTES

* The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

Dagger To whom correspondence should be addressed. Tel.: 33-4-91-16-45-48; Fax: 33-4-91-16-45-49; E-mail: Jean-Michel.Claverie@igs.cnrs-mrs.fr.

Published, JBC Papers in Press, February 24, 2003, DOI 10.1074/jbc.M301327200

    ABBREVIATIONS

The abbreviations used are: CvP, charged versus polar; OGT, optimal growth temperature; PCA, principal component analysis; G + C, guanine + cytosine.

    REFERENCES
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
REFERENCES

1. Kumar, S., and Nussinov, R. (2001) Cell. Mol. Life Sci. 58, 1216-1233[Medline] [Order article via Infotrieve]
2. Cambillau, C., and Claverie, J. M. (2000) J. Biol. Chem. 275, 32383-32386[Abstract/Free Full Text]
3. Kawashima, T., Amano, N., Koike, H., Makino, S., Higuchi, S., Kawashima-Ohya, Y., Watanabe, K., Yamazaki, M., Kanehori, K., Kawamoto, T., Nunoshiba, T., Yamamoto, Y., Aramaki, H., Makino, K., and Suzuki, M. (2000) Proc Natl. Acad. Sci. 97, 14257-14262[Abstract/Free Full Text]
4. Kreil, D. P., and Ouzounis, C. A. (2001) Nucleic Acids Res. 29, 1608-1615[Abstract/Free Full Text]
5. Kyte, J., and Doolittle, R. F. (1982) J. Mol. Biol. 157, 105-132[Medline] [Order article via Infotrieve]
6. Vieille, C., Epting, K. L., Kelly, R. M., and Zeikus, J. G. (2001) Eur. J. Biochem. 268, 6291-6301[Abstract/Free Full Text]
7. Brochier, C., and Philippe, H. (2002) Nature 417, 244[CrossRef][Medline] [Order article via Infotrieve]


Copyright © 2003 by The American Society for Biochemistry and Molecular Biology, Inc.