A cross-section of the fitness landscape of dihydrofolate reductase
Takuyo Aita1,2,
Masahiro Iwakura3 and
Yuzuru Husimi2,4
1 Tsukuba Research Institute, Novartis Pharma KK, Ohkubo 8, Urawa Tsukuba, 300-2611,
2 Department of Functional Materials Science, Saitama University, 338-8570 and
3 National Institute of Bioscience and Human Technology, 11 Higahi, Tsukuba, 305-8566, Ibaraki, Japan
 |
Abstract
|
---|
In vitro molecular evolution is regarded as a hill-climbing on a fitness landscape in sequence space, where the fitness is a quantitative measure of a certain physicochemical property of a biopolymer. We analyzed a cross-section of the enzymatic activity landscape of dihydrofolate reductase (DHFR) by using a method of analysis of a fitness landscape. We limited the sequence space of interest to the five-dimensional sequence space, where the coordinate corresponds to the 1st, 16th, 20th, 42nd and 92nd site in the DHFR sequence. Thirty six mutants mapped into the limited sequence space were taken in the analysis. As a result, the cross-section is of the rough Mt Fuji type based on the mutational additivity. The ratio of the mean slope to the roughness is 2.8 and the Z-score of the original ratio against a distribution of random references is 7.0, which indicates a large statistical significance. The existence of such a cross-section was discussed in terms of the occurrence probability of sets of five sites distantly separated from each other on the DHFR 3D structure. Our results support the effectiveness of the evolution strategy which exploits the accumulation of advantageous single point mutations in such a cross-section.
Keywords: dihydrofolate reductase/fitness landscape/mutational additivity/protein sequence space/sequence-function relationship
 |
Introduction
|
---|
Exploring fitness landscapes of biopolymers and revealing statistically universal properties underlying in the landscapes are very important for making efficient evolution strategies (Voigt et al., 2000
). In evolutionary molecular engineering, the fitness is a quantitative measure of a certain physicochemical property of a biopolymer (e.g. enzymatic activity, affinity to a ligand or structural stability). We previously proposed an analysis method of a local fitness landscape for a current biopolymer (Aita et al., 2000
), that is, an analysis method of the sequencefitness relationship. The method is based on a model of a rough Mt Fuji-type landscape as the additive fitness model. We applied the model to analyze a cross-section of a local landscape for the thermostability of prolyl endopeptidase and that for the enzymatic activity of thermolysine. These cross-sections were proved to be of the rough Mt Fuji type with
values of >1.0, where
is defined as the ratio of the mean slope to the roughness on the fitness surface.
In general, it is impossible for us to explore the immense sequence space of all possible sequences of biopolymers. Experimental limitations with respect to the mutagenized sites in a wild-type sequence and the number of residue types available for each site force us to reduce the size of sequence space for exploration. Consequently, we can depict only a cross-section of a local fitness landscape, on which our interest is focused.
Iwakura et al. explored the enzymatic activity landscape of dihydrofolate reductase (DHFR, EC 1.5.1.3) with a chain length of 159 (M.Iwakura, K.Maki, J.Suzuki, C.Yamane, in preparation). Initially, they constructed a fully active Cys-free DHFR containing a double point mutation of C85A and C152S (this double point mutant was abbreviated as AS-DHFR). The enzymatic activity of AS-DHFR is the same as that of the native wild-type Escherichia coli DHFR. The AS-DHFR was the initial sequence in the following artificial evolution experiment. The saturation mutagenesis was performed at each of the five sites (1st, 16th, 20th, 42nd and 92nd sites) where methionine residues are located in the AS-DHFR sequence, with the aim of producing methionine-free mutant, which is prevented from oxidation. The enzymatic activity was evaluated spectrophotometrically at 15°C by monitoring the disappearance of NADPH and DHF at 340 nm. Several (nearly) advantageous or neutral single point mutations enhancing or keeping the activity were selected as candidates for the following optimization. Subsequently, several multiple mutants incorporating the selected mutations or wild-type residue at the five sites were constructed and evaluated for the relative activity. It is noted that the artificial evolution of the enzyme was carried out under the selection pressure in the scale of the natural logarithm of the relative enzymatic activity. We analyzed the 36 clones of DHFR mutants by using our model, regarding AS-DHFR as the wild-type in this paper.
 |
Materials and methods
|
---|
Basis of model and analysis of the fitness landscape
Picking out all of the mutagenized sites in a particular wild-type sequence, rearranging them in arbitrary order and numbering each of them from j = 1 through j = v, we define the edited sequences with the length v (where the non-mutagenized sites are not considered). Let
j be the total number of residue types available for the jth site. Then, we consider the {
1,
2, ...,
v}-valued v-dimensional sequence space of all possible sequences of
j = 1
j.
The fitness of a sequence P,
P, is composed of an additive term FP and a non-additive term
P as a small random component:
| (1) |
 | (2) |
| (3) |
where O represents the optimum sequence that corresponds to the single peak on the additive fitness landscape and
Pj represents the particular residue at the jth site in the given sequence P. Therefore,
Oj denotes the correct residue at the jth site in the sequence O. wj(
) is the site-fitness as a fitness contribution from a certain residue
at the jth site. The mean site-fitness over incorrect residues for the jth site is denoted by
where 


Oj means that the sum is taken over all the incorrect residue (

Oj) for the jth site. We define a mean slope as the mean change in fitness when an incorrect residue (

Oj is replaced by a correct residue (
Oj), and define an altitude as the difference between the highest FO and the mean fitness over all possible sequences.
Let {P1, P2, ..., PN} be a set of evaluated sequences in the {
1,
2, ...,
v}-valued v-dimensional sequence space and let {
P1,
P2, ...,
PN} be a set of fitness data for the sequences. N is the number of fitness data. We consider a procedure for fitting our model to the explored landscape [where we have 1 +
(
j - 1) unknowns: FO and wj(
)s]. The fitting procedure is performed by the least-squares method to identify a sequence of O, FO and wj (
)s which minimize the value of
(
Pi - FPi)2. The simultaneous equations for these unknowns are
 | (4) |
 | (5) |
where
(
,
') is Kronecker's delta.
After obtaining the site-fitness wj (
) for all available residues, the mean-slope, altitude and roughness of the landscape are determined, respectively, as follows:
 | (6) |
 | (7) |
 | (8) |
In addition, the
value as the ratio of the mean slope to the roughness and
value as the ratio of the altitude to the roughness are determined:
 | (9) |
 | (10) |
As the
or
value becomes larger, the surface becomes smoother in comparison with the mean slope or altitude. When
or
is small, O does not necessarily correspond to the global optimum. Therefore, we call O the pseudo-optimum.
A permutation testing is necessary to evaluate the statistical significance for the original
value, by comparison with the reference
values for randomly shuffled landscapes created by randomly shuffling the fitness values in the sequence space. In almost all cases, randomly shuffled landscapes are expected to show no correlation on their surfaces. Generating a number of them by the Monte Carlo method and analyzing them as we mentioned above, we can obtain the frequency distribution of the
values for the randomly shuffled landscapes. Then we defined the Z-score for the original
value as
 | (11) |
where
ORG denotes the original
value and E [
RSL] and SD [
RSL] denote the mean and standard deviation for the frequency distribution of the
values for the randomly shuffled landscapes, respectively.
 |
Results and discussion
|
---|
Based on the AS-DHFR sequence (C85A, C152S), Iwakura et al. constructed 35 mutants incorporating amino acid substitutions at the five sites (1st, 16th, 20th, 42nd and 92nd sites) where methionine residues are located in the AS-DHFR sequence. The enzymatic activity was evaluated spectrophotometrically at 15°C by monitoring the disappearance of NADPH and DHF at 340 nm. For normalization of activity, the measured activity was divided by the absorbance at 660 nm, which is proportional to the total amount of the enzyme. The adopted measure of enzymatic activity is approximately proportional to the specific activity of the enzyme. These 36 clones and their relative activity are listed in Table I
.
Since the enzymatic reactions of the DHFR mutants obey a MichaelisMenten mechanism with Km values in the range 110 µM and the concentration of the substrate DHF was 50 µM, the specific activity is approximately proportional to the turnover number kcat. Therefore, we define the fitness of each clone as the natural logarithm of the relative activity, to treat the fitness on the free energy (over temperature) scale (Matsuura et al., 1998
). The coordinate of the sequence space that we are considering consists of {M,A} at the 1st site (j = 1), {M,A,F,N,S,R} at the 16th site (j = 2), {M,I,L,V} at the 20th site (j = 3), {M,F,V,Y} at the 42nd site (j = 4) and {M,F,I} at the 92nd site (j = 5). This is the {2,6,4,4,3}-valued five-dimensional sequence space. We note that the 85th and 152nd sites are fixed with alanine residue and serine residue, respectively. Figure 1
illustrates the fitness landscape in this space. The pseudo-optimum O on the landscape was expected as a sequence AFLYM, while the most inferior sequence was expected as MRVFI. Each axis in the sequence space in this figure is expressed by an arrow and amino acids on each axis are arranged in increasing order of site-fitness values that we obtained through the analysis. We can see the gradient of the fitness from the lower side through the upper side. The gradient shows the feature of the (rough) Mt Fuji-type landscape. Figure 2a
is a graph showing the interconnection of the 36 clones with the corresponding additive fitnesses (
Pi) and is an another expression of the fitness landscape. We note that the presentation for experimental fitnesses (FPi) on the same graph is omitted because it is very similar to Figure 2a
. The clones which are interconnected to each other by the unity in the Hamming distance (d = 1) are grouped into the three clusters. The three clusters are interconnected by a Hamming distance of 2 (d = 2). A good point of this expression is that we can follow the fitness correlation of neighbor sequences, such as a conspicuous tendency that clones with large fitnesses are relatively close to O. Figure 2b
is an expression of the landscape of the residuals
Pi (=
Pi - FPi). We can see that the residuals are not mutually correlated between neighbor sequences. The correlation length of the residual landscape was calculated to be less than unity in the Hamming distance (Aita et al., 2000
).
The landscape properties compiled in Table II
show that the cross-section of the enzymatic activity landscape of DHFR is of the rough Mt Fuji type. The Z-score of the
value, Z
= 7.00, indicates a large statistical significance. This result supports the hypothesis that mutational effects are almost additive in current proteins (Schreiber and Fersht, 1995
; Zhang et al., 1995
; Matthew and Thomas, 1996
; Skinner and Terwilliger, 1996
; Nikolova et al., 1998
; Udaka et al., 2000
) and the effectiveness of the evolution strategy which exploit the accumulation of advantageous mutations in a cross-section of the rough Mt Fuji type (Kuchner and Arnold, 1997
; Aita and Husimi, 2000
; Voigt et al., 2000
; M.Iwakura, K.Maki, J.Suzuki, C.Yamane, in preparation).
The results of the analysis are shown in Figure 3
. The correlation between the experimental fitnesses
Pis and the additive fitnesses FPis is shown in Figure 3a
, with a correlation coefficient r = 0.95. The relation between
Pis (or FPis) and the Hamming distance from O, dPis, is shown in Figure 3b
. This figure is one of the simplest expressions of a fitness landscape, where the sequence space is projected on to a single axis representing the Hamming distance from O. We can see a descending slope from the peak O in this figure and this slope corresponds to the mean slope of the landscape (<|
j|>w = 0.70). Figure 3c
shows the relation between fitnesses of multiple mutants and those predicted by strict additivity of the component mutations. It is remarkable that the plot in Figure 3a
is much less discrepant from the diagonal line than Figure 3c
. This implies that the former plot is more effective for presentation of the mutational additivity. Figure 3d
shows the site-fitness distribution at each site (Aita and Husimi, 1996
). The third site (j = 3) has the least tolerance to amino acid substitutions.
Figure 3a and c
show different aspects of the same landscape. The former is made based on the site-fitness determined through the analysis and the latter is made based on the fitness change observed in the mutants. In the latter plot, the information of local landscapes around the wild-type sequence mainly influences the plot. The additivity in mutational effects has been considered in terms of the latter plot (Wells, 1980). However, the better correlation shown in the former [r = 0.95, P = 9.1x1019, AIC = 35.8 for Figure 3a
; r = 0.77, P = 3.9x105, AIC = 43.2 for Figure 3c
] suggests that it is better to consider the additivity in terms of the former plot. Therefore, we should take into account the fitness landscape unbiasedly to comprehend the additivity. In the case where sufficient sequence-fitness data are not available for the analysis, we cannot draw the former plot.
The concept of approximating a fitness by the summation of site-fitnesses is related to the mean-field approximation in physics. It is important to consider the relation between the site-fitness change
wj(
) and fitness change 
j(
) in a single point mutation from the outgoing residue (wild-type residue) to an incoming residue
. The discrepancy between these values decreases with increasing additivity. Figure 4
shows the correlation between site-fitness change
wj(
) and fitness change 
j(
). Although these quantities do not show quantitatively the same values, these are well correlated with each other.

View larger version (21K):
[in this window]
[in a new window]
|
Fig. 4. Relation between the site-fitness change and fitness change. Each bold letter represents the corresponding amino acid (one-letter abbreviation) substitution introduced in the wild-type sequence. The number at the top in each figure shows the mutated sites in the amino acid sequence of DHFR and the number in parentheses is the site index j.
|
|
The structural basis for the additivities in the mutational effects for DHFR may be the spatial separation of the sites. Based on the spatial coordinates from the PDB (PDB code = 1DDR), the distance between the
-carbon (ß-carbon) of the side chain of Met16 and that of Met20, the distance between Met42 and Met92 and the distance between Met1 and Met92 are 8.6 (9.1), 9.1 (9.4) and 8.7 (10.2) Å, respectively. Other pairs have a distance >10 Å. These distances seem to be long (>7 Å) enough to avoid interactions among these sites (Schreiber and Fersht,1995
). Following this, we examined the occurrence probability of sets of sites distantly separated from each other. First, we randomly chose 106 sets of n sites from among all the 159 sites in the DHFR sequence by the Monte Carlo method. Second, we counted the number of sets of n sites satisfying the condition that the distance between two
-carbons for every pair in the set is greater than dc Å. We used dc = 7 or 8. As a result, the occurrence probability of the sets is 60% (53%) for n = 5, 9% (5%) for n = 10, 0.3% (0.05%) for n = 15, 0.01% (0.0008%) for n = 18 and 0.001% (<0.000001%) for n = 20 (the values in parentheses are for dc = 8). This result suggests that the occurrence probability of cross-sections which are probably the rough Mt Fuji type decreases exponentially as the dimension n increases linearly. Also, our result for n = 5 in this paper is a frequent case (>53%). In addition, we examined the occurrence probability with respect to thermolysine (1LNF) and prolyl endopeptidase (1E5T), whose cross-sections were analyzed by Aita et al. (Aita et al., 2000
) and 22 other proteins, which are designated as representative protein in the FSSP (Fold classification based on StructureStructure alignment of Proteins) database. The occurrence probability for n = 5, 10, 15, 18 and 20 is plotted against the chain length of the proteins in Figure 5
. Following the result that the occurrence probability is dependent on the chain length, we examined the occurrence probability of sets of n points separated by more than dc Å from each other in a sphere with a volume of 142 Å3xchain length, where 142 Å3 is the mean volume for amino acids (Zamyatnin, 1975
). The results from this sphere model show good agreement with those from the simulation for each protein (Fig. 5
).
The analysis of the cross-section of the local fitness landscape reveals the corresponding landscape properties such as a mean slope, roughness and the ratio between them. In almost all practical cases, the cross-section may be constructed by defining the point of the wild-type in sequence space as the origin and defining several sites showing large fitness improvements as axes of the cross-section. Following this, we can expect that the top of the cross-section will be roughly close to the (local or global) peak of the underlying fitness landscape. The mean change in site-fitness over single point mutations from the wild-type residues to the optimal residues roughly gives the mean slope of the cross-section. Although the mean change in site-fitness is inherent in individual cases, the results of the analysis that we have carried out for other proteins so far suggest that the mean slope takes values ranging from 0.5 to 2.0 and the
values are >1.0 (Aita et al., 2000
; unpublished data). If a large number of sequence-fitness data for many proteins are accumulated and available for the analysis, we will be able to obtain the statistically universal quantities of the landscape properties and may apply the information to efficient evolution strategies.
 |
Notes
|
---|
4 To whom correspondence should be addressed. E-mail: husimi{at}fms.saitama-u.ac.jp 
 |
Acknowledgments
|
---|
This work was performed as a part of the R & D Project of the Industrial Science and Technology Frontier Program supported by NEDO (New Energy and Industrial Technology Development Organization).
 |
References
|
---|
Aita,T. and Husimi,Y. (1996) J. Theor. Biol., 182, 469485.[ISI][Medline]
Aita,T. and Husimi,Y. (2000) J. Theor. Biol., 207, 543556.[ISI][Medline]
Aita,T.,Uchiyama,H., Inaoka,T., Nakajima,M., Kokubo,T. and Husimi,Y. (2000) Biopolymers, 54, 6479.[ISI][Medline]
Kuchner,O. and Arnold,F.H. (1997) TIBTECH, 15, 523530.
Matsuura,T., Yomo,T., Trakulnaleamsai,S., Ohashi,Y., Yamamoto,K. and Urabe,I. (1998) Protein Eng., 11, 789795.[Abstract]
Matthew,M.S., and Thomas,C.T. (1996) Proc. Natl Acad. Sci. USA, 93, 1075310757.[Abstract/Free Full Text]
Nikolova,P.V., Henckel,J., Lane,D.P. and Fersht,A.R. (1998) Proc. Natl Acad. Sci. USA, 95, 1467514680.[Abstract/Free Full Text]
Schreiber,G. and Fersht,A.R. (1995) J. Mol. Biol., 248, 478486.[ISI][Medline]
Skinner,M.M. and Terwilliger,T.C. (1996) Proc. Natl Acad. Sci. USA, 93, 1075310757.[Abstract/Free Full Text]
Udaka,K., Wiesmuller,K.H., Kienle,S., Jung,G., Tamamura,H., Yamagishi,H., Okumura,K., Walden,P., Suto,T. and Kawasaki,T. (2000) Immunogenetics, 51, 816828.[ISI][Medline]
Voigt,C.A., Kauffman,S. and Wang,Z.G. (2000) Adv. Protein Chem., 55, 79160.[ISI][Medline]
Wells,J.A. (1990) Biochemistry, 29, 85098517.[ISI][Medline]
Zamyatnin,A.A. (1975) Prog. Biophys. Mol. Biol., 24, 109123.
Zhang,X., Baase,W.A., Shoichet,B.K., Wilson,K.P. and Matthews,B.W. (1995) Protein Eng., 8, 10171022.[Abstract]
Received March 2, 2001;
revised May 15, 2001;
accepted May 28, 2001.