A cross-section of the fitness landscape of dihydrofolate reductase

Takuyo Aita1,2, Masahiro Iwakura3 and Yuzuru Husimi2,4

1 Tsukuba Research Institute, Novartis Pharma KK, Ohkubo 8, Urawa Tsukuba, 300-2611, 2 Department of Functional Materials Science, Saitama University, 338-8570 and 3 National Institute of Bioscience and Human Technology, 1–1 Higahi, Tsukuba, 305-8566, Ibaraki, Japan


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
In vitro molecular evolution is regarded as a hill-climbing on a fitness landscape in sequence space, where the ‘fitness’ is a quantitative measure of a certain physicochemical property of a biopolymer. We analyzed a ‘cross-section’ of the enzymatic activity landscape of dihydrofolate reductase (DHFR) by using a method of analysis of a fitness landscape. We limited the sequence space of interest to the five-dimensional sequence space, where the coordinate corresponds to the 1st, 16th, 20th, 42nd and 92nd site in the DHFR sequence. Thirty six mutants mapped into the limited sequence space were taken in the analysis. As a result, the cross-section is of the rough Mt Fuji type based on the mutational additivity. The ratio of the mean slope to the roughness is 2.8 and the Z-score of the original ratio against a distribution of random references is 7.0, which indicates a large statistical significance. The existence of such a cross-section was discussed in terms of the occurrence probability of sets of five sites distantly separated from each other on the DHFR 3D structure. Our results support the effectiveness of the evolution strategy which exploits the accumulation of advantageous single point mutations in such a cross-section.

Keywords: dihydrofolate reductase/fitness landscape/mutational additivity/protein sequence space/sequence-function relationship


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Exploring fitness landscapes of biopolymers and revealing statistically universal properties underlying in the landscapes are very important for making efficient evolution strategies (Voigt et al., 2000Go). In evolutionary molecular engineering, the ‘fitness’ is a quantitative measure of a certain physicochemical property of a biopolymer (e.g. enzymatic activity, affinity to a ligand or structural stability). We previously proposed an analysis method of a local fitness landscape for a current biopolymer (Aita et al., 2000Go), that is, an analysis method of the sequence–fitness relationship. The method is based on a model of a rough Mt Fuji-type landscape as the additive fitness model. We applied the model to analyze a cross-section of a local landscape for the thermostability of prolyl endopeptidase and that for the enzymatic activity of thermolysine. These cross-sections were proved to be of the rough Mt Fuji type with {theta} values of >1.0, where {theta} is defined as the ratio of the ‘mean slope’ to the ‘roughness’ on the fitness surface.

In general, it is impossible for us to explore the immense sequence space of all possible sequences of biopolymers. Experimental limitations with respect to the mutagenized sites in a wild-type sequence and the number of residue types available for each site force us to reduce the size of sequence space for exploration. Consequently, we can depict only a ‘cross-section’ of a local fitness landscape, on which our interest is focused.

Iwakura et al. explored the enzymatic activity landscape of dihydrofolate reductase (DHFR, EC 1.5.1.3) with a chain length of 159 (M.Iwakura, K.Maki, J.Suzuki, C.Yamane, in preparation). Initially, they constructed a fully active Cys-free DHFR containing a double point mutation of C85A and C152S (this double point mutant was abbreviated as AS-DHFR). The enzymatic activity of AS-DHFR is the same as that of the native wild-type Escherichia coli DHFR. The AS-DHFR was the initial sequence in the following artificial evolution experiment. The saturation mutagenesis was performed at each of the five sites (1st, 16th, 20th, 42nd and 92nd sites) where methionine residues are located in the AS-DHFR sequence, with the aim of producing methionine-free mutant, which is prevented from oxidation. The enzymatic activity was evaluated spectrophotometrically at 15°C by monitoring the disappearance of NADPH and DHF at 340 nm. Several (nearly) advantageous or neutral single point mutations enhancing or keeping the activity were selected as candidates for the following optimization. Subsequently, several multiple mutants incorporating the selected mutations or wild-type residue at the five sites were constructed and evaluated for the relative activity. It is noted that the artificial evolution of the enzyme was carried out under the selection pressure in the scale of the natural logarithm of the relative enzymatic activity. We analyzed the 36 clones of DHFR mutants by using our model, regarding AS-DHFR as the ‘wild-type’ in this paper.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Basis of model and analysis of the fitness landscape

Picking out all of the mutagenized sites in a particular ‘wild-type’ sequence, rearranging them in arbitrary order and numbering each of them from j = 1 through j = v, we define the edited ‘sequences’ with the length v (where the non-mutagenized sites are not considered). Let {lambda}j be the total number of residue types available for the jth site. Then, we consider the {{lambda}1, {lambda}2, ..., {lambda}v}-valued v-dimensional sequence space of all possible sequences of j = 1 {lambda}j.

The ‘fitness’ of a sequence P, P, is composed of an additive term FP and a non-additive term {omega}P as a small random component:

(1)

(2)

(3)
where O represents the optimum sequence that corresponds to the single peak on the additive fitness landscape and {alpha}Pj represents the particular residue at the jth site in the given sequence P. Therefore, {alpha}Oj denotes the ‘correct’ residue at the jth site in the sequence O. wj({alpha}) is the ‘site-fitness’ as a fitness contribution from a certain residue {alpha} at the jth site. The mean site-fitness over ‘incorrect’ residues for the jth site is denoted by

where {sum}{alpha}!={alpha}Oj means that the sum is taken over all the ‘incorrect’ residue ({alpha}!={alpha}Oj) for the jth site. We define a ‘mean slope’ as the mean change in fitness when an ‘incorrect’ residue ({alpha}!={alpha}Oj is replaced by a ‘correct’ residue ({alpha}Oj), and define an ‘altitude’ as the difference between the highest FO and the mean fitness over all possible sequences.

Let {P1, P2, ..., PN} be a set of evaluated sequences in the {{lambda}1, {lambda}2, ..., {lambda}v}-valued v-dimensional sequence space and let {P1, P2, ...,PN} be a set of fitness data for the sequences. N is the number of fitness data. We consider a procedure for fitting our model to the explored landscape [where we have 1 + ({lambda}j - 1) unknowns: FO and wj({alpha})s]. The fitting procedure is performed by the least-squares method to identify a sequence of O, FO and wj ({alpha})s which minimize the value of (Pi - FPi)2. The simultaneous equations for these unknowns are

(4)

(5)
where {delta}({alpha},{alpha}') is Kronecker's delta.

After obtaining the site-fitness wj ({alpha}) for all available residues, the ‘mean-slope’, ‘altitude’ and ‘roughness’ of the landscape are determined, respectively, as follows:

(6)

(7)

(8)

In addition, the {theta} value as the ratio of the mean slope to the roughness and {Theta} value as the ratio of the altitude to the roughness are determined:

(9)

(10)

As the {theta} or {Theta} value becomes larger, the surface becomes smoother in comparison with the mean slope or altitude. When {theta} or {Theta} is small, O does not necessarily correspond to the global optimum. Therefore, we call O the ‘pseudo-optimum’.

A permutation testing is necessary to evaluate the statistical significance for the original {theta} value, by comparison with the reference {theta} values for ‘randomly shuffled landscapes’ created by randomly shuffling the fitness values in the sequence space. In almost all cases, randomly shuffled landscapes are expected to show no correlation on their surfaces. Generating a number of them by the Monte Carlo method and analyzing them as we mentioned above, we can obtain the frequency distribution of the {theta} values for the randomly shuffled landscapes. Then we defined the Z-score for the original {theta} value as

(11)
where {theta}ORG denotes the original {theta} value and E [{theta}RSL] and SD [{theta}RSL] denote the mean and standard deviation for the frequency distribution of the {theta} values for the randomly shuffled landscapes, respectively.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Based on the AS-DHFR sequence (C85A, C152S), Iwakura et al. constructed 35 mutants incorporating amino acid substitutions at the five sites (1st, 16th, 20th, 42nd and 92nd sites) where methionine residues are located in the AS-DHFR sequence. The enzymatic activity was evaluated spectrophotometrically at 15°C by monitoring the disappearance of NADPH and DHF at 340 nm. For normalization of activity, the measured activity was divided by the absorbance at 660 nm, which is proportional to the total amount of the enzyme. The adopted measure of enzymatic activity is approximately proportional to the specific activity of the enzyme. These 36 clones and their relative activity are listed in Table IGo.


View this table:
[in this window]
[in a new window]
 
Table I. List of mutants of dihydrofolate reductasea

 
Since the enzymatic reactions of the DHFR mutants obey a Michaelis–Menten mechanism with Km values in the range 1–10 µM and the concentration of the substrate DHF was 50 µM, the specific activity is approximately proportional to the turnover number kcat. Therefore, we define the fitness of each clone as the natural logarithm of the relative activity, to treat the fitness on the free energy (over temperature) scale (Matsuura et al., 1998Go). The coordinate of the sequence space that we are considering consists of {M,A} at the 1st site (j = 1), {M,A,F,N,S,R} at the 16th site (j = 2), {M,I,L,V} at the 20th site (j = 3), {M,F,V,Y} at the 42nd site (j = 4) and {M,F,I} at the 92nd site (j = 5). This is the {2,6,4,4,3}-valued five-dimensional sequence space. We note that the 85th and 152nd sites are fixed with alanine residue and serine residue, respectively. Figure 1Go illustrates the fitness landscape in this space. The pseudo-optimum O on the landscape was expected as a sequence ‘AFLYM’, while the most inferior sequence was expected as ‘MRVFI’. Each axis in the sequence space in this figure is expressed by an arrow and amino acids on each axis are arranged in increasing order of site-fitness values that we obtained through the analysis. We can see the gradient of the fitness from the lower side through the upper side. The gradient shows the feature of the (rough) Mt Fuji-type landscape. Figure 2aGo is a graph showing the interconnection of the 36 clones with the corresponding additive fitnesses (Pi) and is an another expression of the fitness landscape. We note that the presentation for experimental fitnesses (FPi) on the same graph is omitted because it is very similar to Figure 2aGo. The clones which are interconnected to each other by the unity in the Hamming distance (d = 1) are grouped into the three clusters. The three clusters are interconnected by a Hamming distance of 2 (d = 2). A good point of this expression is that we can follow the fitness correlation of neighbor sequences, such as a conspicuous tendency that clones with large fitnesses are relatively close to O. Figure 2bGo is an expression of the landscape of the residuals {omega}Pi (= Pi - FPi). We can see that the residuals are not mutually correlated between neighbor sequences. The correlation length of the ‘residual landscape’ was calculated to be less than unity in the Hamming distance (Aita et al., 2000Go).



View larger version (69K):
[in this window]
[in a new window]
 
Fig. 1. A cross-section of a local enzymatic activity landscape for dihydrofolate reductase. {lambda}1 = 2, {lambda}2 = 6, {lambda}3 = 4, {lambda}4 = 4, {lambda}5 = 3, {nu} = 5. The sequence space is projected on the plane of the page. The sequences denoted by letters (shown in Table IGo) are located at nodes: A denotes the AS-DHFR and B, C, ..., denote mutants. A single point mutation is represented by the corresponding vector shown as an arrow. The bold letters on arrows represent amino acids and are arranged in increasing order of site-fitness values. The fitness of a sequence is represented by the height of the bar standing at near the corresponding letter (arbitrary unit). The black bars are for experimental fitnesses (Pi) and gray bars are for additive fitnesses (FPi) obtained from the analysis.

 


View larger version (25K):
[in this window]
[in a new window]
 
Fig. 2. Graphical representation of DHFR landscape. (a) The interconnection of the 36 DHFR mutants is shown. Each letter at a vertex represents the corresponding sequence defined in Table IGo. ‘O’ is the ‘pseudo optimum’. The additive fitness FPi of each clone is expressed as a disk size. The sequences interconnected by the unity Hamming distance (d = 1) are linked with thick solid lines. The sequences included in the different clusters and interconnected by the Hamming distance of 2 (d = 2) are linked with thin solid lines. All of the sequences included in the upper left clusters (surrounded by thin dashed circle) are interconnected with the sequence ‘Q’ by d = 2. The sequences from ‘A’ to ‘V’ which are interconnected with O by a Hamming distance of 3 (d = 3) are linked with thin dotted broken lines. (b) The residual {omega}Pi (= FPiFPi) is expressed as a triangle size. The upward and downward filled triangles suggest positive residuals and negative residuals, respectively. This is regarded as the residual landscape.

 
The landscape properties compiled in Table IIGo show that the cross-section of the enzymatic activity landscape of DHFR is of the rough Mt Fuji type. The Z-score of the {theta} value, Z{theta} = 7.00, indicates a large statistical significance. This result supports the hypothesis that mutational effects are almost additive in current proteins (Schreiber and Fersht, 1995Go; Zhang et al., 1995Go; Matthew and Thomas, 1996Go; Skinner and Terwilliger, 1996Go; Nikolova et al., 1998Go; Udaka et al., 2000Go) and the effectiveness of the evolution strategy which exploit the accumulation of advantageous mutations in a cross-section of the rough Mt Fuji type (Kuchner and Arnold, 1997Go; Aita and Husimi, 2000Go; Voigt et al., 2000Go; M.Iwakura, K.Maki, J.Suzuki, C.Yamane, in preparation).


View this table:
[in this window]
[in a new window]
 
Table II. Local landscape properties for dihydrofolate reductasea
 
The results of the analysis are shown in Figure 3Go. The correlation between the experimental fitnesses Pis and the additive fitnesses FPis is shown in Figure 3aGo, with a correlation coefficient r = 0.95. The relation between Pis (or FPis) and the Hamming distance from O, dPis, is shown in Figure 3bGo. This figure is one of the simplest expressions of a fitness landscape, where the sequence space is projected on to a single axis representing the Hamming distance from O. We can see a descending slope from the peak O in this figure and this slope corresponds to the mean slope of the landscape (<|{varepsilon}j|>w = 0.70). Figure 3cGo shows the relation between fitnesses of multiple mutants and those predicted by strict additivity of the component mutations. It is remarkable that the plot in Figure 3aGo is much less discrepant from the diagonal line than Figure 3cGo. This implies that the former plot is more effective for presentation of the mutational additivity. Figure 3dGo shows the site-fitness distribution at each site (Aita and Husimi, 1996Go). The third site (j = 3) has the least tolerance to amino acid substitutions.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 3. Results of analysis of dihydrofolate reductase. (a) Correlation between experimental fitnesses (Pi) and additive fitnesses (FPi). AIC = 35.8, where AIC is the Akaike information criterion defined as AIC = –2x(natural logarithm of the maximal likelihood – the number of parameters). (b) A sketch of the landscape. The abscissa is the Hamming distance dPi from the pseudo-optimum O to each sequence Pi, where O = ‘AFLYM’. The ordinate is Pi (upper case letter) or FPi (lower case letter). The solid line represents the mean slope. (c) The changes in fitnesses for the multiple mutant [{Delta}(multi)] versus the sum for the component mutants ({Sigma}j {Delta}j). The ‘wild-type’ is AS-DHFR (‘A’). The dashed line has a slope of 1. AIC = 43.2. (d) Distribution of site-fitnesses {Delta}wj({alpha}) for individual residues at each site. The number at the top in each column is the site index j and the number in the parentheses is the site in the amino acid sequence of DHFR. Each letter represents the one-letter abbreviation of the amino acid residue.

 
Figure 3a and cGo show different aspects of the same landscape. The former is made based on the site-fitness determined through the analysis and the latter is made based on the fitness change observed in the mutants. In the latter plot, the information of local landscapes around the wild-type sequence mainly influences the plot. The ‘additivity’ in mutational effects has been considered in terms of the latter plot (Wells, 1980). However, the better correlation shown in the former [r = 0.95, P = 9.1x10–19, AIC = 35.8 for Figure 3aGo; r = 0.77, P = 3.9x10–5, AIC = 43.2 for Figure 3cGo] suggests that it is better to consider the additivity in terms of the former plot. Therefore, we should take into account the fitness landscape unbiasedly to comprehend the additivity. In the case where sufficient sequence-fitness data are not available for the analysis, we cannot draw the former plot.

The concept of approximating a fitness by the summation of site-fitnesses is related to the mean-field approximation in physics. It is important to consider the relation between the site-fitness change {Delta}wj({alpha}) and fitness change {Delta}j({alpha}) in a single point mutation from the outgoing residue (wild-type residue) to an incoming residue {alpha}. The discrepancy between these values decreases with increasing additivity. Figure 4Go shows the correlation between site-fitness change {Delta}wj({alpha}) and fitness change {Delta}j({alpha}). Although these quantities do not show quantitatively the same values, these are well correlated with each other.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 4. Relation between the site-fitness change and fitness change. Each bold letter represents the corresponding amino acid (one-letter abbreviation) substitution introduced in the wild-type sequence. The number at the top in each figure shows the mutated sites in the amino acid sequence of DHFR and the number in parentheses is the site index j.

 
The structural basis for the additivities in the mutational effects for DHFR may be the spatial separation of the sites. Based on the spatial coordinates from the PDB (PDB code = 1DDR), the distance between the {alpha}-carbon (ß-carbon) of the side chain of Met16 and that of Met20, the distance between Met42 and Met92 and the distance between Met1 and Met92 are 8.6 (9.1), 9.1 (9.4) and 8.7 (10.2) Å, respectively. Other pairs have a distance >10 Å. These distances seem to be long (>7 Å) enough to avoid interactions among these sites (Schreiber and Fersht,1995Go). Following this, we examined the occurrence probability of sets of sites distantly separated from each other. First, we randomly chose 106 sets of n sites from among all the 159 sites in the DHFR sequence by the Monte Carlo method. Second, we counted the number of sets of n sites satisfying the condition that the distance between two {alpha}-carbons for every pair in the set is greater than dc Å. We used dc = 7 or 8. As a result, the occurrence probability of the sets is 60% (53%) for n = 5, 9% (5%) for n = 10, 0.3% (0.05%) for n = 15, 0.01% (0.0008%) for n = 18 and 0.001% (<0.000001%) for n = 20 (the values in parentheses are for dc = 8). This result suggests that the occurrence probability of cross-sections which are probably the rough Mt Fuji type decreases exponentially as the dimension n increases linearly. Also, our result for n = 5 in this paper is a frequent case (>53%). In addition, we examined the occurrence probability with respect to thermolysine (1LNF) and prolyl endopeptidase (1E5T), whose cross-sections were analyzed by Aita et al. (Aita et al., 2000Go) and 22 other proteins, which are designated as ‘representative protein’ in the FSSP (Fold classification based on Structure–Structure alignment of Proteins) database. The occurrence probability for n = 5, 10, 15, 18 and 20 is plotted against the chain length of the proteins in Figure 5Go. Following the result that the occurrence probability is dependent on the chain length, we examined the occurrence probability of sets of n points separated by more than dc Å from each other in a sphere with a volume of 142 Å3xchain length, where 142 Å3 is the mean volume for amino acids (Zamyatnin, 1975Go). The results from this sphere model show good agreement with those from the simulation for each protein (Fig. 5Go).



View larger version (55K):
[in this window]
[in a new window]
 
Fig. 5. The occurrence probability of sets of n sites distantly separated from each other. Regarding DHFR (1DDR; chain length = 159), thermolysine (1LNF; chain length = 316) and prolyl endopeptidase (1E5T; chain length = 710) and 22 other proteins designated as ‘representative protein’ in the FSSP database, the occurrence probability of sets of n sites satisfying the condition that the distance between two {alpha}-carbons for every pair site in the set is greater than dc Å was calculated by the Monte Carlo method. dc = 7 (top) and 8 (bottom). The plots on each vertical line are for the corresponding protein. The plots relevant to our analyses (this work and Aita et al., 2000Go) are shown using solid symbols. The solid curves were derived theoretically from the sphere model of proteins (see text).

 
The analysis of the cross-section of the local fitness landscape reveals the corresponding landscape properties such as a mean slope, roughness and the ratio between them. In almost all practical cases, the cross-section may be constructed by defining the point of the wild-type in sequence space as the origin and defining several sites showing large fitness improvements as axes of the cross-section. Following this, we can expect that the top of the cross-section will be roughly close to the (local or global) peak of the underlying fitness landscape. The mean change in site-fitness over single point mutations from the wild-type residues to the optimal residues roughly gives the mean slope of the cross-section. Although the mean change in site-fitness is inherent in individual cases, the results of the analysis that we have carried out for other proteins so far suggest that the mean slope takes values ranging from 0.5 to 2.0 and the {theta} values are >1.0 (Aita et al., 2000Go; unpublished data). If a large number of sequence-fitness data for many proteins are accumulated and available for the analysis, we will be able to obtain the statistically universal quantities of the landscape properties and may apply the information to efficient evolution strategies.


    Notes
 
4 To whom correspondence should be addressed. E-mail: husimi{at}fms.saitama-u.ac.jp Back


    Acknowledgments
 
This work was performed as a part of the R & D Project of the Industrial Science and Technology Frontier Program supported by NEDO (New Energy and Industrial Technology Development Organization).


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Aita,T. and Husimi,Y. (1996) J. Theor. Biol., 182, 469–485.[ISI][Medline]

Aita,T. and Husimi,Y. (2000) J. Theor. Biol., 207, 543–556.[ISI][Medline]

Aita,T.,Uchiyama,H., Inaoka,T., Nakajima,M., Kokubo,T. and Husimi,Y. (2000) Biopolymers, 54, 64–79.[ISI][Medline]

Kuchner,O. and Arnold,F.H. (1997) TIBTECH, 15, 523–530.

Matsuura,T., Yomo,T., Trakulnaleamsai,S., Ohashi,Y., Yamamoto,K. and Urabe,I. (1998) Protein Eng., 11, 789–795.[Abstract]

Matthew,M.S., and Thomas,C.T. (1996) Proc. Natl Acad. Sci. USA, 93, 10753–10757.[Abstract/Free Full Text]

Nikolova,P.V., Henckel,J., Lane,D.P. and Fersht,A.R. (1998) Proc. Natl Acad. Sci. USA, 95, 14675–14680.[Abstract/Free Full Text]

Schreiber,G. and Fersht,A.R. (1995) J. Mol. Biol., 248, 478–486.[ISI][Medline]

Skinner,M.M. and Terwilliger,T.C. (1996) Proc. Natl Acad. Sci. USA, 93, 10753–10757.[Abstract/Free Full Text]

Udaka,K., Wiesmuller,K.H., Kienle,S., Jung,G., Tamamura,H., Yamagishi,H., Okumura,K., Walden,P., Suto,T. and Kawasaki,T. (2000) Immunogenetics, 51, 816–828.[ISI][Medline]

Voigt,C.A., Kauffman,S. and Wang,Z.G. (2000) Adv. Protein Chem., 55, 79–160.[ISI][Medline]

Wells,J.A. (1990) Biochemistry, 29, 8509–8517.[ISI][Medline]

Zamyatnin,A.A. (1975) Prog. Biophys. Mol. Biol., 24, 109–123.

Zhang,X., Baase,W.A., Shoichet,B.K., Wilson,K.P. and Matthews,B.W. (1995) Protein Eng., 8, 1017–1022.[Abstract]

Received March 2, 2001; revised May 15, 2001; accepted May 28, 2001.