Correlation between occupancy and B factor of water molecules in protein crystal structures

O. Carugo1

European Molecular Biology Laboratory, Meyerhofstrasse 1,69012 Heidelberg, Germany and Department of General Chemistry,Pavia University, via Taramelli 12, 27100 Pavia, Italy.E-mail: carugo{at} embl-heidelberg.de


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
An empirical relationship between occupancy and the atomic displacement parameter of water molecules in protein crystal structures has been found by comparing a set of well refined sperm whale myoglobin crystal structures. The relationship agrees with a series of independent structural features whose impact on water occupancy can easily be predicted as well as with other known data and is independent of the protein fold. The estimation of the water occupancy in protein crystal structures may help in understanding the physico-chemical properties of the protein–solvent interface and can allow the monitoring of the accuracy of the protein crystal structure refinement.

Keywords: atomic displacement parameter/occupancy/protein hydration/protein structure/water structure


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
The importance of the protein–water interface can hardly be overemphasized since it mediates any physiological protein behaviour. Most of what we know about protein hydration results from crystallographic studies. Despite recent advances in the treatment and refinement of water posional parameters at the protein–solvent interface (Schoenborn et al., 1995Go; Podjarny et al., 1997Go), the largest amount of structural information is given by traditional refinements in which each hydration site is associated with a water oxygen atom either manually or with some more objective and automatic criteria (Lamzin and Wilson, 1993Go). Crystallographic results have been enriched by several theoretical studies (Parak et al., 1992Go; Lounnas and Pettit, 1994aGo,bGo; Ehrlich et al., 1998Go; Makarov et al., 1998Go), resulting in a general picture in which the protein–solvent interface appears organized in more than one layer and in a more dense and hydrogen-bonded network than the bulk water despite the large mobility of each single water molecule (Thanki et al., 1988; Karplus and Faerman, 1994Go; Phillips and Pettit, 1995Go; Bryant, 1996Go; Gerstein and Chothia, 1996Go; Pettit et al., 1998Go; Svergun et al., 1998Go). It is exactly this mobility that makes it difficult to apply a point model to the crystallographic refinements. Water molecules oscillating with large amplitude around a conformational energy minimum will be hard to detect, and also molecules with small crystallographic occupancies caused by conformational disorder and exchange with the bulk solvent. The positional spread of each atom can be monitored by the atomic displacement parameters (a.d.p.s, often referred to as B factors) even though their aetiology may often be unclear (authentic oscillations around a mean position or various types of disorder). The occupancies are, in contrast, more difficult to determine and their influence on the a.d.p. values is quantitatively unrecognized. The occupancy monitors the presence of an atom at its mean position. It ranges from 0.0 to 1.0 and it can be lowered if the atom spends only a fraction of the time in its position during the crystallographic data collection or if it does not occupy its position in all the crystal asymmetric units of the sample studied. Occupancies can be determined only if a constraint is used. For example, if some positions are considered alternatively occupied, it is possible to constrain their total occupancy to 1.0 and refine the single values of the occupancy of each position. This is generally impossible in protein crystallography, since it is unclear how to define an ensemble of hydration sites which can be considered alternatively occupied. Moreover, it is unclear how to define the total occupancy for a cluster of water positions since water molecules can fluctuate between energetically stable hydration sites and the bulk solvent which is crystallographically silent. Consequently, the total occupancy of an ensemble of alternative water positions is not necessarily 1.0. Therefore, single water occupancies are generally not refined and a low occupancy value, which implies a low electron density, may result in a unnecessarily high a.d.p. value. As stressed a few years ago (Frey, 1993Go), a better picture of the protein–solvent interface would certainly be obtained by quantitatively estimating water molecule occupancies. Since a.d.p. and occupancy values of water molecules are necessarily related, a quantitative relationship between them is presented in this paper.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Twenty-six sperm whale myoglobin crystal structures determined at room temperature (room temperature was assumed if not otherwise explicitly specified), in the same space group (P21), with identical sequence (disregarding the first and last residues and without mutations, deletions and gaps in the electron density maps) and resolution <=2.0 Å were found in the Protein Data Bank (Bernstein et al., 1977Go) (identification codes: 104m, 1hjt, 1mbc, 1mbd, 1mbi, 1mbo, 1spe, 1swm, 1vxa, 1vxb, 1vxc, 1vxd, 1vxe, 1vxf, 1vxg, 1vxh, 1yog, 1yoh, 1yoi, 2mya, 2myb, 2myc, 2myd, 2mye, 4mbn, 5mbn; details on the quality of each crystal structure, i.e. resolution, software used, data completeness and final refinement statistics, are available from the author). These structures have been independently determined by several laboratories with the use of various software packages and it is assumed that they are representative of the entire population of protein structures. Only structures determined in the P21 space group were considered to eliminate crystal lattice bias. The paucity of structural data in other space groups (see Carugo and Argos, 1998) does not allow an analogous study in the case of different crystal packing interactions. Structures deposited with unrefined a.d.p.s were disregarded. The ratio of the number of protein atoms and water molecules ranges between 3.2 and 11.5, reflecting the relatively high variability of this parameter (Carugo and Bordo, 1999Go) and suggesting that this structure ensemble is actually random. Therefore, although different treatments of the water molecules clearly depend on the crystallographer's preferences and might result in different a.d.p. values for the entire crystal structure, the sample of 26 crystal structures examined here is likely to be representative of the entire population of protein structures. Water molecules coordinated to the metal centre of the haem were disregarded. The total number of water molecules within all 26 structures was 4411. Since the a.d.p. values in protein crystal structures vary not only because of genuine physical differences but also because of refinement strategies (Frauenfelder and Petsko, 1980Go; Karplus and Schulz, 1985Go; Ringe and Petsko, 1986Go; Tronrud, 1996Go), they were normalized as

where <a.d.p.> and s(a.d.p.) are the mean value and the standard deviation of the distribution of the a.d.p. values within each single protein structure (Parthasarathy and Murthy, 1997Go; Carugo and Argos, 1998Go).

All equivalenced C{alpha} atoms were superposed on those of the structure 104m by the method of Kabsch and McLachan (Kabsch, 1978Go; McLachan, 1979Go) and the rotation and translation matrices were applied to the coordinates of the water molecules. The root-mean-square distances between equivalent C{alpha} atoms varied from 0.148 to 0.507 Å. A cluster of water molecules was then built around each water molecule by grouping water molecules no more distant than a given threshold distance of identity (t.d.i.) from the central water molecule. The appropriate symmetry operations were applied to each water molecule in each structure to detect equivalences when water molecules were positioned in distant parts of the asymmetric unit. Threshold distances of identity were varied from 0.25 to 3.25 Å in steps of 0.25 Å. Care was taken to avoid redundancies so that a cluster found several times (i.e. by considering different central water molecules with identical surrounding water molecules) was included only once in the analysis. The mean normalized a.d.p. was computed for each cluster and also the mean number of hydrogen bonds formed by its water molecules. The hydrogen bonds, computed with HBPLUS (McDonald and Thornton, 1994Go), were classified into three types, those with other water molecules, those with main-chain atoms and those with side-chain atoms. The occupancy of each cluster was computed as the ratio between the number of its water molecules and the number of myoglobin structures. Hence a cluster containing 26 (or 13) water molecules has occupancy 1.0 (or 0.5).


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
The relationship between normalized a.d.p. and occupancy of a cluster can be evaluated by a least-squares fit of the function

The occupancy decreases as the normalized a.d.p. increases. The correlation coefficient was found within the range –0.876 and –0.968, depending on the t.d.i. value used to define the water molecule clusters. The coefficients a and b in Equation 2 and the normalized a.d.p. values associated with occupancy 0.0 and 1.0 were dependent on the t.d.i. values. Nevertheless, for t.d.i. values >=2.75 Å, the regression coefficients were statistically independent of the t.d.i., as shown in Figure 1Go, where the normalized a.d.p.s associated with occupancy 0.0 and 1.0 are plotted against the t.d.i.s. Therefore, the t.d.i. value of 2.75 Å was retained in the subsequent work. It is noteworthy that this value is similar to the oxygen–oxygen distance in bulk water or ice (Mak and Zhou, 1997Go). By using the t.d.i. of 2.75 Å (Figure 2Go), it was found that

A hydration site is therefore fully occupied (or empty) if its normalized a.d.p. is –1.2 (or 7.2). Given, for example, a protein crystal structure with mean a.d.p. of 20 Å2 and a standard deviation of 5 Å2, a fully occupied water location would show an a.d.p. of 14 Å2 while the a.d.p. of 56 Å2 would be the upper limit associated with a positive occupancy. Moreover, by standard statistics (Dowdy and Wearden, 1991Go), it is possible to predict that the accuracy of the estimated occupancies ranges between 0.2, for normalized a.d.p.s around –1 or +7, and 0.03, for normalized a.d.p.s around +3.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 1. Normalized a.d.p. values corresponding to occupancy 0 (circles) and 1 (squares) at various threshold distances of identity (Å).

 


View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2. Dependence of the occupancy on the normalized a.d.p. at a threshold distance of identity of 2.75 Å.

 
The mean number of hydrogen bonds increases with the cluster occupancy nearly linearly from 1.5 (occupancy lower than 0.1) to 3.8 (occupancy higher than 0.9). The latter value nearly corresponds to the saturation of the hydrogen bond capability of a water molecule. In detail, the mean number of hydrogen bonds with other water molecules, with main-chain atoms and with side-chain atoms increases from 1.1, 0.2 or 0.2 for cluster occupancy lower than 0.1 to 2.6, 0.5 or 0.7 for cluster occupancy higher than 0.9, respectively. This agrees with the observation that the occupancy of a hydration site is strictly proportional to the total number of possible hydrogen bonds (Lounnas and Pettit, 1994aGo,bGo) and implies that highly occupied hydration sites are more probable within the first hydration layer directly facing the protein surface. This hypothesis is supported by the observation that clusters of high occupancy tend to be closer to other clusters of high occupancy and vice versa. Within 3 (or 5) Å, clusters of occupancy higher that 0.75 are surrounded by clusters of occupancy 0.52 (or 0.47) while clusters of occupancy lower than 0.25 are surrounded by clusters of occupancy 0.08 (or 0.11). This indicates that different occupancies are found in different regions of the structure. Moreover, clusters of occupancy higher than 0.75 are 3.15 (0.01) Å distant from their closest protein atom while clusters of occupancy lower than 0.25 are 4.39 (0.11) Å distant from the protein surface (Figure 3Go).



View larger version (42K):
[in this window]
[in a new window]
 
Fig. 3. Stereoview of the myoglobin trace surrounded by the water molecule clusters of occupancy higher than 0.75 (white spheres) and lower than 0.25 (black spheres). Each cluster is indicated by a sphere located at the mean position of the water molecules constituting it. Figure prepared with MOLSCRIPT (Kraulis, 1991).

 
Equation 3 was used to estimate the occupancies of water molecules in a set of 257 protein structures selected with PDBSELECT (Hobohm and Sander, 1994Go) (resolution 2 Å or better and maximum sequence identity of 25%) and in the set of 26 myoglobins. Table IGo shows that very few water molecules have estimated occupancies statistically lower than 0.0 or higher than 1.0. Most of the water molecules have occupancies higher than 0.5. The two distributions are closely similar. This also supports the assumption that the learning set of 26 myoglobins is representative of the entire protein crystal structure population. Moreover, the similarity between the results in the set of myoglobins and in the larger structural set suggests that Equation 3 is independent of the fold type and can therefore be applied in any type of structure. This agrees with the observation (Pettit et al., 1998Go) that the positional features of the water molecules mostly depend on the nearest chemical surrounding and not on other properties such as protein sequence, secondary structure or side-chain conformation. The observation that most of the water molecules crystallographically positioned have occupancies lower than 1.0 is not surprising, since a high mobility has been observed even for water molecules buried in protein internal cavities (Ernst et al., 1996Go; Feher et al., 1996Go).


View this table:
[in this window]
[in a new window]
 
Table I. Frequency (%) of the occupancy values predicted with Equations 3 and 4 in the set of 26 myoglobins and in a larger set of 257 unique protein crystal structures
 
It must be remembered that the correlation between occupancy and a.d.p. value should be applied within each protein monomer in the case of structures containing several monomers in the asymmetric unit. For example, in a dimeric protein it may happen that one monomer has an overall mean a.d.p. of 15 Å2 (in B units) and the other of 30 Å2 due to a different crystal packing. A well conserved water molecule located in both monomers is expected to have the same occupancy despite the fact that it will probably be associated with a higher a.d.p. value in the second monomer. The normalization of the a.d.p. values (Equation 1) within each monomer should give the water molecule the same normalized a.d.p. value in both monomers and thus should allow the computing of the same occupancy.

In conclusion, an empirical relationship between observed atomic displacement parameters and occupancies of water molecules has been derived by analysing known myoglobin crystal structures. The predicted occupancies agree with other structural features whose trends are clearly predictable. High occupancy hydration sites are more probable within the first protein hydration layer, cluster together and are strictly associated with the formation of many hydrogen bonds. It has also been shown that the predicted occupancies do not depend on the fold type, in agreement with the currently accepted model that protein hydration mostly depends on the local chemical features of the protein surface (Pettit et al., 1998Go).

These results can have a large range of applications. They allow one to monitor the progress in the protein crystal structure refinement by avoiding possible overfitting due to the inclusion of too low an occupancy of water molecules in the model. Moreover, water molecules with too high an occupancy (much larger than 1.0) can be detected and should be considered as possible atoms other than water oxygens, for example metal cations or heavy anions. These results also should allow for more detailed physico-chemical characterizations of the protein–bulk water interface which plays a relevant role in most protein functions such as recognition of cofactors, substrates and other proteins and could be included in docking predictions (Sternberg et al., 1998Go). A more precise evaluation of the protein–bulk water interface is also desirable when physical and mechanical properties of the protein are of interest, since the flexibility of a protein fragment is always directly connected to the flexibility of the interface.


    Notes
 
1 Correspondence should be sent to the Pavia address Back


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535–542.[ISI][Medline]

Bryant,R.G. (1996) Annu. Rev. Biophys. Biol. Struct., 25, 29–53.[ISI][Medline]

Carugo,O. and Argos,P. (1998) Proteins: Struct. Funct. Genet., 31, 201–213.[ISI][Medline]

Carugo,O. and Bordo,D. (1999) Acta Crystallogr., D55, 479–483.

Dowdy,S. and Wearden,S. (1991) Statistics for Research. Wiley, New York, pp. 229–281.

Ehrlich,L., Reczko,M., Bohr,H. and Wade R.C. (1998) Protein Engng, 11, 11–19.[Abstract]

Ernst,J.A., Clubb,R.T., Zhou H.-X., Gronenborn,A.M. and Clore,G.M. (1996) Science, 267, 1813–1817.[ISI]

Feher,V.A., Baldwin,E.P. and Dhalquist,W. (1996) Nature Struct. Biol., 3, 516–521.[ISI][Medline]

Frauenfelder,H. and Petsko, GA. (1980) Biophys. J., 32, 465–478.[Abstract]

Frey,M. (1993) Top. Mol. Struct. Biol., 17, 100–146.

Gerstein,M. and Chothia,C. (1996) Proc. Natl Acad. Sci. USA, 93, 10167–10172.[Abstract/Free Full Text]

Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522–530.[Abstract/Free Full Text]

Kabsch,W.A. (1978) Acta Crystallogr., A34, 828–828.

Karplus,P.A. and Faerman,C. (1994) Curr. Opin. Struct. Biol., 4, 770–776.[ISI]

Karplus,P.A. and Schulz,G.E. (1985) Naturwissenschaften, 72, 212–213.[ISI]

Kraulis,P.J. (1991) J. Appl. Crystallogr., 24, 946–950.[ISI]

Lamzin,V.S. and Wilson,K.S. (1993) Acta Crystallogr., D49, 129–147.

Lounnas,V. and Pettit,B.M. (1994a) Proteins: Struct. Funct. Genet., 18, 133–147.[ISI][Medline]

Lounnas V. and Pettit,B.M. (1994b) Proteins: Struct. Funct. Genet., 18, 148–160.[ISI][Medline]

Mak,T.C.W. and Zhou,G.-D. (1997). Crystallography in Modern Chemistry. Wiley, New York.

Makarov,V.A., Andrews,B.K. and Pettit,B.M. (1998) Biopolymers, 45, 469–478.[ISI][Medline]

McDonald,I.K. and Thornton,J.M. (1994) J. Mol. Biol., 238, 777–793.[ISI][Medline]

McLachan,A.D. (1979) J. Mol. Biol., 128, 48–67.

Parak,F., Hartmann,H., Schmidt,M., Corongiu,G. and Clementi,E. (1992) Eur. Biophys. J., 21, 313–320.[ISI][Medline]

Parthasarathy,S. and Murthy,M.R.N. (1997) Protein Sci., 6, 2561–2567.[Abstract/Free Full Text]

Pettit,B.M., Makarov,V.A. and Andrews,B.K. (1998) Curr. Opin. Struct. Biol., 8, 218–221.[ISI][Medline]

Phillips,G.N. and Pettit,B.M. (1995) Protein Sci., 4, 149–158.[Abstract/Free Full Text]

Podjarny,A.D., Howard,E.I., Urzhumetsev,A. and Grigera,J.R. (1997) Proteins: Struct. Funct. Genet., 28, 303–312.[ISI][Medline]

Ringe,D. and Petsko,G.A. (1986) Methods Enzymol., 131, 389–433.[Medline]

Schoenborn,B.P., Garcia,A. and Knott,R. (1995) Prog. Biophys. Mol. Biol., 64, 105–119.[ISI][Medline]

Sternberg,M.J.E., Gabb,H.A. and Jackson,R.M. (1998) Curr. Opin. Struct. Biol., 8, 250–256.[ISI][Medline]

Svergun,D.I., Richard,S., Koch,M.H.J., Sayers,Z., Kuprin,S. and Zaccai,G. (1998) Proc. Natl Acad. Sci. USA, 95, 2267–2272.[Abstract/Free Full Text]

Thanki,N., Thornton,J.M. and Goodfellow,J.M. (1998) J. Mol. Biol., 202, 637–657.

Tronrud,D.E. (1996) J. Appl. Crystallogr., 29, 100–104.[ISI]

Received May 28, 1999; revised September 23, 1999; accepted September 23, 1999.