Genomic Correlates of Hyperthermostability, an
Update*
Karsten
Suhre and
Jean-Michel
Claverie
From the Structural and Genomic Information Laboratory, UPR
2589-CNRS, Institute of Structural Biology and Microbiology,
CNRS, Marseille, France
Received for publication, February 6, 2003
 |
ABSTRACT |
It has been shown (Cambillau, C., and
Claverie, J. M. (2000) J. Biol. Chem. 275, 32383-32386) that a large difference between the proportions of
charged versus polar (non-charged) amino acids (CvP-bias) was an adequate, if empirical, signature of the
proteome of hyperthermophilic organisms
(Tgrowth >80 °C). Since that study, the number of available microbial genomes has more than doubled, raising the possibility that the simple CvP-bias rule might
no longer hold. Taking advantage of the new sequence data, we
re-analyzed the genomes of 9 fully sequenced thermophiles, 9 hyperthermophiles, and 53 mesothermophile microorganisms to identify
the genomic correlates of hyperthermostability on a wider data set. Our
new results confirm that the CvP-bias previously identified
on a much smaller data set still holds. Moreover, we show that
it is an optimal criterion, in the sense that it corresponds to the
most discriminating factor between hyperthermophilic and
mesothermophilic microorganisms in a principal component analysis.
In parallel, we evaluated two other recently proposed correlates
of hyperthermostability, the proteome average pI and the dinucleotide
statistical index (Kawashima, T., Amano, N., Koike, H., Makino, S.,
Higuchi, S., Kawashima-Ohya, Y., Watanabe, K., Yamazaki, M.,
Kanehori, K., Kawamoto, T., Nunoshiba, T., Yamamoto, Y., Aramaki, H.,
Makino, K., and Suzuki, M. (2000) Proc. Natl. Acad. Sci.
97, 14257-14262). We show that the CvP-bias is the
sole criterion that is able to clearly discriminate hyperthermophile
from mesothermophile microorganisms on a global genomic basis.
 |
INTRODUCTION |
Although most organisms grow at temperatures ranging
between 20 and 50 °C, several archaea and a few bacteria, such as
Pyrococcus and Aquifex, have been found capable
of withstanding temperatures close to or higher than 100 °C.
Identification of the molecular basis of the increased thermostability
of the proteins of such hyperthermophilic organisms is expected to help
our understanding of protein folding as well as the design of enzymes
retaining their activity at high temperatures (Ref. 1 and references therein). In a previous comparative study, Cambillau and Claverie (2)
found that a large difference between the proportions of charged (Asp,
Glu, Lys, Arg) versus polar (non-charged) (Asn, Gln, Ser,
Thr) amino acids (abbreviated as
CvP1-bias) was the
most prominent signature of the hyperthermophilic life style at the
proteome level. This global CvP-bias was reflected in the
amino acid composition of the water-accessible residues computed from
an analysis of the surface of 131 mesophilic versus 58 hyperthermophilic proteins.
Given the rapidly increasing number of fully sequenced
microbial genomes (more than doubled since the initial study),
inferences derived in the past from correlation studies on a limited
set of examples are in constant danger of being proven wrong. Now the
genomes of seven new thermophilic and hyperthermophile archaea and of
three new thermophilic bacteria have been deciphered. Besides these
thermophilic organisms, numerous new extremophile organisms, such as
the halophilic archaea Halobacterium sp., the halophilic bacterium
Sinorhizobium meliloti, and the alkaliphilic bacterium Bacillus halodurans have been added to the list,
as well as a large number of mesophilic bacteria and archaea (Table I).
With these new genomes from a wider evolutionary spectrum, any
previously derived "life style" criterion is at risk of failure in
the form of false positive (classifying a mesophile as
hyperthermophile) or false negative (classifying an hyperthermophile as
mesophile) predictions. In particular, the newly added extremophile
sequence data allow us to investigate whether the bias in favor of
charged rather than polar residues previously observed in the proteome of hyperthermophiles by Cambillau and Claverie (2) is truly specific to
the adaptation in high temperatures or whether it could also be linked
to other extreme environments.
This new body of sequence data also gives us an opportunity to reassess
the conclusions of several other studies. By systematically comparing
the archaeon Thermoplasma volcanium genomic sequence with
seven other genomic sequences of archaea, all exhibiting higher optimal
growth temperature (OGT), Kawashima et al. (3) identified a
number of strong correlations between some characteristics of genome
organization. For instance, they reported that the
J2 index (computed from the dinucleotide
frequency, see "Materials and Methods") was increasing together
with the OGT. They also noticed some characteristic changes in the
distribution of the isoelectric points (pI) of the proteins: with
increasing OGT, the fraction of the basic protein subset (pI >7)
becomes larger and with it the genome-average pI.
In a different work, Kreil and Ouzounis (4) used hierarchical
clustering and principal component analysis to identify the factors
affecting the global amino acid composition of the predicted proteome
of 6 thermophilic archaea, 2 thermophilic bacteria, 17 mesophilic
bacteria, and 2 eukaryotic species. They concluded that the G + C content is indeed a dominant discriminating property but unrelated to
any preference for a thermophilic lifestyle. They also noticed that
thermophilic species could be identified by their global amino acid
compositions alone, albeit without precisely defining the nature of
this discriminating power.
In this study, we applied a similar approach to a much
larger genomic data set. This allows us to show that the most
discriminating factor between hyperthermophile and
mesophile genomes remains the absolute difference between the frequency
of charged and polar amino acid residues (CvP-bias).
Furthermore, we show that a high J2 index value
is a necessary, but not sufficient, criterion for the recognition of a
hyperthermophilic proteome, whereas the previously proposed
genome-average pI is not a satisfactory correlate of hyperthermostability.
 |
MATERIALS AND METHODS |
A total of 71 fully sequenced genomes was analyzed (Table
I), including 53 mesophiles (<50 °C) (50 bacteria, 3 archaea), 9 thermophiles (50-80 °C) (4 bacteria, 5 archaea), and 9 hyperthermophiles (>80 °C) (1 bacteria, 8 archaea). 69 of those
genome sequences were downloaded from GenBankTM and 2 unfinished genomes, Carboxydothermus hydrogenoformans and Bacillus stearothermophilus, from the Institute for Genome
Research (TIGR, Rockville, MD) and from the University of Oklahoma,
Norman, OK, respectively. The proteomes of the various organisms were defined according to the available open reading frame (ORF) annotation (when available in GenBankTM). For the two unfinished
genomes, the proteomes were defined as the subsets of ORFs longer than
200 amino acids. Transmembrane segments were then predicted using the
simple algorithm of Kyte and Doolittle (5). For each organism, the
"soluble" moiety of the proteome was then defined by discarding all
proteins (ORFs) containing at least two predicted trans-membrane
segments. The subsequent statistical analyses were all performed on
these predicted soluble proteins.
Following Kawashima et al. (3), the genome sequences were
used to compute the J2 index as:
J2 = FYY + FRR
FYR
FRY where FYY designates
the relative frequency of dinucleotides pyrimidine (TT, TC, CT, CC),
FRR the relative frequency of purine
dinucleotides (AA, AG, GA, GG), and FYR and
FRY the corresponding mixed purine/pyrimidine combinations. Pure purine and pyrimidine dinucleotides were found to be
more frequent than their mixed counterparts in hyperthermophilic Pyrococci (3).
The program "iep" from the European Molecular Biology Open
Software Suite (EMBOSS) was used to calculate the theoretical pI value
for every soluble protein and, from those, the average pI for every
genome. Principal component analysis was performed using the
statistical package R. 71 genomes from meso-, thermo-, and hyperthermophilic organisms were included. When multiple genome sequences from closely related strains or species were available, only
one of them was included in the analysis to avoid introducing a
potential statistical bias.
 |
RESULTS |
Hyperthermophiles Do Exhibit a Specific Proteome Composition
Signature--
The computed amino acid compositions for 9 fully
sequenced hyperthermophilic and 53 mesophilic bacteria and archaea are
presented in Table I and Fig.
1, a-c. The case of the 9 moderately thermophilic genomes will be discussed later. As previously
observed by Cambillau and Claverie (Fig. 2 and a-c in Ref. 2) but now
confirmed on a much larger data set, the proteins of hyperthermophiles
exhibit a strong bias for the use of charged residues at the expense of polar residues. This includes strongly reduced frequencies for the
thermolabile amino acid residues Gln and Asn, as also noted by Vieille
et al. (6). Among other trends, the aliphatic residue Val
appears to be preferred in hyperthermophiles, whereas the tiny
non-polar non-charged residue Ala is avoided.
View this table:
[in this window]
[in a new window]
|
Table I
Properties of the 71 genomes analyzed here, ranked by decreasing CvP
bias
Properties of the 71 genomes analyzed here, ranked by decreasing
CvP bias (see "Materials and Methods" for details).
Mesophiles (OGT, <55%) are highlighted in blue, thermophiles (OGT,
<80°) in orange, and hyperthermophiles in red.
|
|

View larger version (38K):
[in this window]
[in a new window]
|
Fig. 1.
a, plot of the percentages of charged
amino acids (Asp, Glu, Lys, Arg, blue), polar
non-charged amino acids (Asn, Gln, Ser, Thr, green), and of
the difference of the two (CvP-bias, red).
Blue, orange, and red vertical
lines identify the mesophiles, thermophiles, and
hyperthermophiles, respectively. b, plot of the percentages
of the various amino acids in mesophiles (blue),
thermophiles (orange), and hyperthermophiles
(red). c, plot of the various amino acid classes;
colors as in panel b.
|
|
Principal component analysis (PCA) offers a way to analyze the
data set more objectively and eventually identify more intricate relationships between amino acid frequencies and the adaptation of
proteins to high temperature. Our PCA analysis (Fig.
2) is consistent with earlier results
obtained with a much smaller data set by Kreil and Ouzounis (4); 85%
of the data variance is accounted by the first two principal
components. The first component (72% of the variance) strongly
correlates with the G + C content of the genomes (Fig.
3a, r2 = 0.94). The second component (13% of the variance) turns out to be
the sole discriminating factor between meso- and hyperthermophilic lifestyles, correlating quite well with the optimal growth temperature (Fig. 3c, OGT, r2 = 0.72).
All other PCA components account for not more than 3.7% each of the
variance.

View larger version (28K):
[in this window]
[in a new window]
|
Fig. 2.
Biplot of a principal component analysis of
the amino acid composition of all studied organisms. Blue
dots identify mesophiles; orange dots identify
thermophiles; red dots identify hyperthermophiles.
Green vectors represent the position of the different amino
acids in the biplot (only vectors with significant contributions to PCA
components 1 and 2 are drawn). The black vector represents
FCvP = FK + FR + FD + FE FN FQ FS FT.
|
|

View larger version (19K):
[in this window]
[in a new window]
|
Fig. 3.
Correlation between various parameters.
a, correlation between G + C content and PCA component 1:
r2 = 0.94. b, correlation between
CvP-bias and PCA component 2: r2 = 0.93. c, correlation between PCA component 2 and OGT:
r2 = 0.72. Red and blue
lines indicate the separation between the hyperthermophile
organism with the lowest OGT and the mesophile with the highest OGT.
d, correlation between CvP-bias and OGT:
r2 = 0.68. e, correlation between
J2 and OGT: r2 = 0.27. f, correlation between average proteomic pI and OGT:
r2 = 0.01.
|
|
The first PCA component being the G + C content, we had to find the
biochemical interpretation of the second component. First, we found that this component exhibits a strong correlation with the
previously defined (CvP-bias) (Fig. 3b,
r2 = 0.93). Second, once projected on the first
two dimensions of PCA space, the (20-dimensional) vector representing
the difference between the frequencies of charged and polar residues
(CHA-POL) appears almost parallel to the second PCA dimension (Fig. 2,
black vector). As shown in Fig. 3d, the CvP-bias
also correlates with the OGT almost as well as the second PCA component
(r2 = 0.68 and 0.72, respectively). Finally,
this parameter also successfully discriminates all hyperthermophilic
microorganisms from all mesothermophilic ones. In conclusion, this
analysis indicates that, among all possible combinations of G + C
content and amino acid frequencies, the CvP-bias is a near
optimal discriminating quantity to characterize the hyperthermophilic
proteins from evolutionarily diverse microorganisms.
In contrast to the CvP-bias, other previously proposed
quantities thought to characterize the hyperthermophilic life style do
not fare well when confronted with this expanded proteome data set. Fig. 3, e and f, shows that the
J2 index, as well as the average pI, now fail to
correctly discriminate between hyperthermophiles and mesophiles. In
fact, both parameters correlate only weakly, if at all, with OGT
(r2 = 0.27 for J2 and
r2 = 0.01 for average pI). However, all
hyperthermophiles have a J2 index greater than
0.06, so that a high J2 index can be considered a necessary, but not a sufficient, criterion for the identification of a hyperthermophilic genome.
Moderately Thermophile Organisms--
Because only a single
moderate thermophilic organism had been sequenced
(Methanobacterium thermoautotrophicum) at the time of the
work of Cambillau and Claverie (2), little could be said
about the properties of moderately thermophile organisms. Proteome data
is now available for nine of them. Two (Thermotoga maritima
and Thermoanaerobacter tengcongensis) have OGTs close to the
threshold of 80 °C we somewhat arbitrarily used to define hyperthermophile organisms. It turns out that they both exhibit PCA2,
CvP-bias, and J2 values that could
all allow their classification in continuity with previously defined hyperthermophiles.
This clear discrimination breaks down if we use an OGT of 75 °C as
the hyperthermophilicity threshold. Down to this temperature, the value
along the PCA2 coordinate remains a valid criterion, allowing two
Sulfolobus species (OGT of 80 and 78 °C) to be classified in continuity with the other hyperthermophiles (PCA2 <
1.6 for OGT higher than 75 °C, and PCA2 >
1.3 OGT lower than 75 °C;
see Fig. 3c). However, the straightforward, more
biochemically meaningful CvP-bias criteria become
invalidated by three mesophiles with OGT of 37 °C,
Fusobacterium nucleatum (CvP = 9.06),
Halobacterium sp (CvP = 9.17),
Clostridium perfingens (CvP = 9.39),
exhibiting larger values than Sulfolobus
(CvP = 8.90 and 8.80).
 |
DISCUSSION |
Using a much larger data set including many new thermophile and
mesophile whole genome sequences, this follow-up study confirms the
previous suggestion that the global replacement of polar residues (Asn,
Gln, Ser, Thr) by charged residues (Asp, Glu, Lys, Arg) is the dominant
proteome characteristic of microorganisms adapted to hyperthermophilic
growth condition (OGT >80 °C). This effect is observed for both
bacteria and archaebacteria and thus is not a simple consequence of
phylogenetic relationship.
Even though the strict correspondence between the highest
CvP-bias and the highest OGT breaks down below 80 °C,
this property globally remains a characteristic of all thermophilic
(OGT >55 °C) microorganisms, as shown in Fig. 1c where
the CvP-bias averaged over all mesothermophiles remains
markedly higher than for mesophile organisms. The influence of other
adaptive strategies (e.g. to high salinity or extreme pH
environments), together with the phylogenetic affinities of certain
mesophiles to thermophile organisms (7), probably contributes to weaken
CvP-bias signal, allowing a few false positives to sneak in
(such as Clostridium perfringens or Streptomyces
coelicolor) (Table I.).
This study confirms that a strong CvP-bias is
specifically associated with hyperthermophilic proteomes. This
observation is consistent with the thermodynamic advantage resulting
from the increased significance of coulomb interaction with the
increasing temperature (as the dielectric constant of water decreases).
The simultaneous increase of oppositely charged residues (mostly Arg, Lys, and Glu) further allows for more ion pairs to be formed at the
surface of hyperthermostable proteins (2). This rationale involving the
stability of proteins in a high temperature aqueous environment is also
supported by our observation that the proteins predicted to be
associated to the membrane (and thus designed for hydrophobic
environments) exhibit a much less significant CvP-bias (data
not shown). Finally, the fact that a large number of diverse genomes
confirms a statistical trend previously inferred from a much smaller
data set argues that increased ion-pair formation is both a significant
physico-chemical factor and the preferred evolutionary pathway toward
thermostable soluble proteins.
 |
FOOTNOTES |
*
The costs of publication of this
article were defrayed in part by the
payment of page charges. The article
must therefore be hereby marked
"advertisement" in
accordance with 18 U.S.C. Section
1734 solely to indicate this fact.
To whom correspondence should be addressed. Tel.:
33-4-91-16-45-48; Fax: 33-4-91-16-45-49; E-mail:
Jean-Michel.Claverie@igs.cnrs-mrs.fr.
Published, JBC Papers in Press, February 24, 2003, DOI 10.1074/jbc.M301327200
 |
ABBREVIATIONS |
The abbreviations used are:
CvP, charged versus polar;
OGT, optimal growth
temperature;
PCA, principal component analysis;
G + C, guanine + cytosine.
 |
REFERENCES |
1.
|
Kumar, S.,
and Nussinov, R.
(2001)
Cell. Mol. Life Sci.
58,
1216-1233[Medline]
[Order article via Infotrieve]
|
2.
|
Cambillau, C.,
and Claverie, J. M.
(2000)
J. Biol. Chem.
275,
32383-32386[Abstract/Free Full Text]
|
3.
|
Kawashima, T.,
Amano, N.,
Koike, H.,
Makino, S.,
Higuchi, S.,
Kawashima-Ohya, Y.,
Watanabe, K.,
Yamazaki, M.,
Kanehori, K.,
Kawamoto, T.,
Nunoshiba, T.,
Yamamoto, Y.,
Aramaki, H.,
Makino, K.,
and Suzuki, M.
(2000)
Proc Natl. Acad. Sci.
97,
14257-14262[Abstract/Free Full Text]
|
4.
|
Kreil, D. P.,
and Ouzounis, C. A.
(2001)
Nucleic Acids Res.
29,
1608-1615[Abstract/Free Full Text]
|
5.
|
Kyte, J.,
and Doolittle, R. F.
(1982)
J. Mol. Biol.
157,
105-132[Medline]
[Order article via Infotrieve]
|
6.
|
Vieille, C.,
Epting, K. L.,
Kelly, R. M.,
and Zeikus, J. G.
(2001)
Eur. J. Biochem.
268,
6291-6301[Abstract/Free Full Text]
|
7.
|
Brochier, C.,
and Philippe, H.
(2002)
Nature
417,
244[CrossRef][Medline]
[Order article via Infotrieve]
|
Copyright © 2003 by The American Society for Biochemistry and Molecular Biology, Inc.