1 Faculty of Technology, Gunma University, Kiryu, Gunma 376, Japan and 3 Room B-116, Bldg 12B, MSC 5677, Laboratory of Experimental and Computational Biology, DBS, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892-5677, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: empirical potentials/inverse protein folding/protein fold recognition/sequencestructure alignment/threading and inverse threading with gaps and insertions
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In order to allow gaps in sequencestructure alignments, two types of problems must be overcome. If deletions and insertions in sequencestructure alignments are to be allowed, then the problem of fold recognition becomes essentially the same as for the inverse protein folding problem (defined as the problem of selecting from a set of sequences only those sequences that are compatible with a single given structure). One must take into account not only the conformational energies of folds but also the sequence dependences of the whole ensemble of protein conformations in order to evaluate the relative stabilities of sequences or alignments (Miyazawa and Jernigan, 1999c). Here, the stabilities of structures are assumed as a primary requirement for compatibilities between sequences and structures. The second problem is how to evaluate multi-body interactions among residues, or at least specifically the pairwise interactions.
From the viewpoint of the inverse folding problem, Bowie et al. (1991) developed a method in which the fitness of each type of residue at a given residue position in a structure is evaluated with respect to the environment of the residue in the native structure, and then a conventional dynamic programming method (Needleman and Wunsch, 1970) is utilized to align a sequence with a given structure. The score of the alignment obtained is used to represent the compatibility of the sequence with the given structure. It has also been used to evaluate protein models (Lüthy et al., 1992
). This method is based on the fact that the environment of a particular residue in a structure is more conservative than the residue itself, and is equivalent to an approximation called the `frozen approximation' by Godzik et al. (1992), in which the residue's environment is evaluated for the native sequence rather than the trial sequence. If the `frozen approximation' is used, a conventional dynamic programming method can be used for sequencestructure alignment. However, in principle, the assumption of the native structure environment is inappropriate for detecting structural similarities between extremely divergent proteins and especially between proteins sharing a common fold through convergent evolution, where the environments surrounding equivalent residue position could be dissimilar (Jones and Thornton, 1993
).
Nishikawa and Matsuo (1993) developed an improved evaluation function by adding hydration potentials, hydrogen bond potentials and local conformational potentials, all of which were estimated as potentials of mean force based on statistical preferences observed in known protein structures. They reported that homologous sequence pairs in a sequence database could also be discriminated on the basis of structuresequence compatibility. In their work, sequences were aligned on the basis of sequence information only by a conventional dynamic programming method, and then 3D1D compatibilities of protein pairs were evaluated, although 3D1D alignments were made with the `frozen approximation' in their later work (Matsuo et al., 1995).
In order to evaluate more precisely pairwise interactions between residues, Jones et al. (1992) used a double dynamic programming method that was originally devised for structural alignments by Taylor and Orengo (1989) and which is an approximate method to take account of pairwise potentials. A search algorithm for finding exact global optimum threadings into protein core segments connected by variable loops, was devised for pairwise interaction potentials (Lathrop and Smith, 1996).
These and a number of other works (Crippen, 1991; Finkelstein and Reva, 1991
; Maiorov and Crippen, 1992
; Sippl, 1993
; Kocher et al., 1994
; Matsuo and Nishikawa, 1994
; Huang et al., 1995
; Park and Levitt, 1996
; Thomas and Dill, 1996
; Park et al., 1997
) indicate that simple empirical potentials without atomic details may be sufficient to determine overall folds, although some limitation to pairwise potentials is indicated (Mirny and Shaknovich, 1996). Munson and Singh (1997) developed multi-body potentials for recognition between sequences and structures. Samudrala and Moult (1998) illustrated the importance of using a detailed atomic description for obtaining the most accurate discrimination. To increase weak signals in each pairwise sequencestructure alignment, multiple sequence threading was also utilized (Taylor, 1997
).
Here, we examine the utility of the simple potential function developed by Miyazawa and Jernigan (1999c) for identifying compatibilities between sequences and structures of proteins. The potential function consists of pairwise contact energies (Miyazawa and Jernigan, 1985, 1996
, 1999a
), repulsive packing potentials for residues (Miyazawa and Jernigan, 1996
) and short-range potentials for secondary structures (Miyazawa and Jernigan, 1999b
). These potentials were estimated from statistical preferences observed in known protein structures, but are devised to represent the actual interactions in proteins and to be able to estimate actual conformational energies for a wide range of conformations from the native to the denatured state.
Miyazawa and Jernigan (1999c) described how to modify these energy potentials to represent approximately the stabilities of proteins, both multimeric and monomeric, and also how to define a single reference state that is appropriate for both fold and sequence recognition. There, it was shown that this simple scoring function can distinguish native structures from alternate folds and also discriminate native sequences from non-native sequences, in which non-native sequence and structure pairs are generated by threading sequences into other structures in all possible ways without gaps. Here, it will be generalized by allowing deletions and insertions in sequencestructure alignments. Gap penalties will be assumed to be proportional to the number of contacts at each residue position so that gaps tend to be more frequently placed on protein surfaces than in cores.
We evaluate pairwise contact interactions between residues in a mean field approximation on the basis of the probabilities of site pairs being aligned. To obtain the self-consistent values of alignment probabilities of site pairs, an iterative method is employed. In addition to the most probable alignment, that is, the minimum energy alignment, an alignment is also made by successively assigning aligned site pairs by their alignment probabilities; see Miyazawa (1995) for this alignment method. This method, termed a probability alignment, can provide information regarding how reliable each individual aligned pair is. This feature is certainly desirable for aligning distantly related sequences and structures. No information about native amino acid sequences but only structural information is used in the present sequencestructure alignments, in order to see how well empirical energy potentials can select for compatibilities between sequences and structures.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The total conformational energy of a protein is represented here as a sum over contributions from residues along the sequence as
|
Each residue's contribution is further divided into two terms, for secondary structure and for tertiary structure:
|
where p indexes residue position.
The short-range interaction energies for secondary structures used here are those estimated (Miyazawa and Jernigan, 1999b) by a potential of mean force from the observed frequencies of secondary structures in known protein structures, which are assumed to be in an equilibrium distribution following the Boltzmann factors of their secondary structure energies. The effects of long-range interactions are taken into account only as a mean field. Because of the limited number of available protein structures, the secondary structure potential is approximated as a sum of additive contributions from neighboring residues along a sequence, with neglect of side chainside chain interactions. Non-additive contributions are simply neglected. In addition, the effects here from neighboring residues are limited to a dependence on their amino acid types but not on their secondary structures. The conformational specification is limited to a sequential tripeptide. Thus, their secondary structure potential is approximated as a sum of the following contributions:
|
where ip is the pth residue of type ip and sp is the conformational state of the pth residue. The first term in Equation 3 represents the backbonebackbone interactions and the second term corresponds to side chainbackbone interactions either within a residue or among close residues. Altogether side chainbackbone interactions within five consecutive backbone units on each side of a side chain are included in the short-range interactions. Here it should be noted that two-body and higher order interactions between side chains and backbones of triplets are counted only once in the estimation of each term in Equation 3
to add to the total short-range interaction. The first term es(s1,s0,s1) is also defined (Miyazawa and Jernigan, 1999b
) to include only half of the two-body interactions between nearest neighbors in order to avoid multiple counts of nearest neighbor interactions in the estimation of the total secondary structure energy of Equation 1
.
The tertiary structure energies have previously been estimated as a sum of pairwise residueresidue contact energies and repulsive residue packing energies for volume exclusion, together termed long-range interaction energies in Miyazawa and Jernigan (1996):
|
The contact energy Epc and the repulsive packing energy Epr of a residue at position p are defined by Equations 18, 19 and 40 in one of our previous papers (Miyazawa and Jernigan, 1996
). For the contact energies (eij) for all pairs of the 20 types of residues, which are applied to residueresidue close contacts, our estimates (Miyazawa and Jernigan, 1999a
) corrected for the Bethe approximation are used here. Actually, the new estimates of contact energies listed in Miyazawa and Jernigan (1999a) are divided by
'
0.263 defined in Equation 34
in that paper and used as the values of contact energies. In other words, the intrinsic pairwise interaction energies (
eij) are corrected relative to the hydrophobic energies (
eir), and the hydrophobic energies are not corrected at all, in order to make the magnitude of contact energies comparable to secondary structure energies (see Table I
). The repulsive packing energies for the 20 types of residues corresponding to penalties for overly dense packing, which are a function of the number of residues in contact, previously estimated by us (Miyazawa and Jernigan, 1996
), are employed here.
|
Alignment energy for scoring of sequencestructure compatibility
The stability of native structure is assumed as a primary requirement for proteins to fold into their native structures. The probability P ({sp}|{ip}) with which a protein sequence {ip} takes a specific conformation {sp} is represented by its conformational energy relative to the free energy of the whole ensemble of protein conformations:
|
|
|
|
ß is equal to 1/(kT) and nr is the sequence length of a protein. In Equation 5, Econf({sp}|{ip}) is the conformational energy of state {sp} of sequence {ip}, and the sum is taken over all possible conformations. Therefore, the free energy of the whole ensemble can be regarded as a zero energy state, i.e. a reference state for the energy potential representing protein stability. The free energy of the protein ensemble varies unless the protein sequence is fixed. Thus, in order to discuss the compatibilities of different protein sequences with a given fold, the change of the protein ensemble due to variable sequence must be taken into account, in addition to the conformational energy. In sequencestructure alignments, deletions in amino acid sequences are allowed, so that the change to the whole ensemble of protein conformations must be taken into account. As discussed in detail in Miyazawa and Jernigan (1999c), the second term of Equation 5
is approximated by Equation 6
; in the sum of Boltzmann factors over all conformations only dominant terms, i.e. native-like compact conformations are taken into account, and then the logarithm of the function is evaluated in a high-temperature expansion.
, representing a conformational entropy per residue for native-like structures, is a constant independent of the amino acid sequence of the protein. The unweighted average of Econf({sp}|{ip}) over native-like conformations is approximated as the conformational energy expected for a typical native structure with the given amino acid composition, which depends only on amino acid composition. This approximation is justified by testing of sequence recognition in inverse threading without gaps performed in our previous work (Miyazawa and Jernigan, 1999c
).
If the environments surrounding proteins are the same, the stabilities of those proteins can be compared by potential energies with proper reference states. However, in fold and sequence recognition protein conformations in a monomeric state may need to be compared with other structures in a multimeric state. In this case, entropy loss due to binding ought to be taken into account in addition to binding energies between subunits, in order to measure the stabilities of protein structures in a multimeric state. To overcome this difficulty, energy potentials are modified to measure approximately protein stabilities even for proteins in different environments. A collapse energy (err) is subtracted from the contact energies to remove the protein size dependences and in order to represent protein stabilities for monomeric and multimeric states. On the other hand, the intrinsic potential and backbonebackbone interaction potentials for secondary structures depend strongly on the types of protein structures from which they are estimated. Thus, only energy terms dependent on amino acid sequences are included in a scoring function for sequencestructure alignments; the first term in Equation 3 is ignored and only the other terms are included. These modifications for energy potentials used here have been discussed in detail in Miyazawa and Jernigan (1999c).
After all of the considerations above are included, the following quantity is taken for assessing compatibilities between protein sequences and structures:
|
where eij err within parentheses means that it is the argument of the function. Because judgements on insertions and deletions in sequencestructure alignments are made for every residue, these modifications to energies are taken into account for every residue.
The energy score for secondary structures is defined as follows, by excluding the intrinsic and backbonebackbone interaction energies:
|
The reference state for the backboneside chain interaction potentials es(ip,sq1,sq,sq+1), is defined by adjusting their average energies over native structures to zero energy.
For the tertiary structure energies, the reference energy corresponds to the average tertiary structure energy per residue for each type of residue in the native protein structures. That is, the following difference in the tertiary structure energy is considered:
|
|
The second and the fourth terms in Equation 12 are the average contact energy per residue of type ip and the average repulsive energy per residue of type ip in native structures.
Alignment by evaluating pairwise interactions in a mean field approximation
An example of a specific sequencestructure alignment A is
|
where means a deletion, sp is the conformational state of the pth residue in a given structure and iq is the qth residue of type iq in sequence that is threaded into the structure.
From Equation 6, a conditional probability P ({sp}|{ip},A) with which the sequence in alignment A takes a specific conformation {sp} can be approximated as follows:
|
where nalignedr is the number of aligned site pairs in the alignment A. Epconf ({sp}|iq,A) is an alignment energy for aligning the qth residue iq, of the sequence to the pth residue structure position, sp, in the structure {sp} threaded with the aligned energy A.
Then, according to Bayes' rule (Feller, 1968), the conditional probability P (A|{sp},{iq}) of an alignment A for a given structure {sp} is represented as
|
where P (A) is the a priori probability for an alignment A. Here, this a priori probability is represented as follows by introducing penalties for gaps:
|
where W is a positive quantity to represent a gap penalty and E0 is a negative constant that does not depend on the correspondence of the qth residue iq to the residue position p and is used as a scaling parameter together with gap penalties; it is chosen in such a way that the total energy scores of sequencestructure alignments for random sequences are always positive. In the case of gapless alignments, i.e. simple threadings of sequences into a fold, the a priori probability is the same for any alignment, because the second term in Equation 16 is equal to zero. Unless alignments between different sequence and structure pairs are compared, the use of E0 is equivalent to the addition of E0/2 to the gap penalty in one scheme for gap penalty, because
, where ki, kd and km are the numbers of insertions, deletions and matches/mismatches, respectively, and nrseq and nrstr are the lengths of the sequence and structure, respectively. However, not all gaps are equivalent in the present scheme where different gap penalties are employed for terminal gaps and for the middle of gaps, and also the gap penalty is taken to have an upper limit; the parameter choices are described later.
Thus, the conditional probability of an alignment A for a given structure {sp} is represented as
|
|
where Z is a partition function for alignments. The energy score E ({sp}|{iq}, A) of an alignment A for a given structure {sp} is defined as
|
The energy score E({sp}|iq,A) is simply equal to the alignment energy Epconf with a scaling parameter
|
Pairwise interactions are evaluated on the basis of the probabilities for site pairs to be aligned, that is, this is a kind of mean field approximation. Thus, the pairwise interaction energies in Epconf ({sp}|iq,A) for alignment A are approximated with pairwise energies for amino acid pairs (iq,i'q) located at neighboring sites (p,p') in structure with alignment probabilities P (p',q') of structuresequence site pairs (p',q'):
|
|
The probability for a structuresequence site pair (p,q) to be aligned and the probabilities for deletions (p,) and (,q) are calculated as
|
|
|
where Zp1,q1 is also a partition function but for aligning the N-terminal, partial sequence from 1 to (q1)th residues with the N-terminal, partial structure from 1 to (p1)th residues in the whole structure. Z'p+1,q+1 is a partition function for aligning the C-terminal sequence starting from the (q+1)th residue with the C-terminal part from p+1 to the terminal end in the whole structure. Therefore, the following relation is satisfied:
|
where nrseq and nrstr are the lengths of the sequence and structure, respectively. Such partition functions can be calculated from energy scores by a transfer matrix method; see Miyazawa (1995) for a specific description of this method for alignments. To obtain a self-consistent solution for alignment probabilities P (p,q) of structuresequence site pairs (p,q) in Equation 23, an iteration method is employed here.
Epconf ({sp}|iq,P ( p',q')) in Equations 2021
is calculated as the sum of contributions of short-range, secondary structure interactions and long-range, tertiary structure interactions; see Equation 9
:
|
The present evaluation of the secondary structure energies does not include side chainside chain interactions, so that the secondary structure energy, the first term in the above equation, is calculated as the short-range interaction energy between the backbone conformation (...,sp,...) and the qth residue of type iq placed at the pth residue position of the structure. In other words, the secondary structure energies have only single body interactions with respect to side chains; hence they can be evaluated without requiring alignment information.
|
where es is defined by Equation 10
. The long-range component,
Eptert({sp}|iq,P ( p',q')), which is defined by Equations 9, 11 and 12
, and which includes pairwise contact energies between residues and density-dependent packing interactions among residues, is evaluated as an alignment energy for aligning the amino acid iq at the pth residue position sp of the target structure on the basis of the alignment probabilities for site pairs obtained in the previous iteration. For the first iteration, Godzik et al. (1992) employed the native sequence to evaluate the environment surrounding a residue; instead, the alignment that does not have gaps except at both termini and in which residue iq is forced to be aligned to position sp is assumed here for evaluating the long-range energies.
Then, by evaluating the energy score of alignments with the self-consistent alignment probabilities of site pairs in Equation 21, we can easily calculate the minimum energy score alignment that is the most probable alignment with a conventional dynamic programming method; the minimum energy score alignment Amin is defined as
|
|
The approximation in Equation 21 and this approximation in Equation 30
for the minimum energy alignment becomes rigorous in the low-temperature limit.
In addition, we also employ here alignments consisting of the most probable site pairs by successively aligning a site pair in order of pairwise alignment probabilities P (p,q) as follows (Miyazawa, 1995):
Here we term such an alignment a probability alignment. It should be noted that a probability alignment is different from the most probable alignment, i.e. the minimum energy score alignment. The former is based on alignment probabilities of site pairs, and the latter simply means the alignment with the maximum probability, that is, the minimum energy score. Of course, the probability alignment coincides with the most probable alignment in the limit of low temperature.
A whole ensemble of sequence-structure alignments can be characterized by such quantities as the minimum energy score, free energy score, and internal energy score. The minimum energy score and the free energy score are defined as:
|
|
The statistical average of energy scores over all alignments, which corresponds to the internal energy, is calculated from the following relation involving the partition function:
|
A preliminary test indicates that the capability of recognition of sequencestructure compatibilities seems to be about the same among these three energy scales. In the following, minimum energy scores are employed to judge sequencestructure compatibilities.
Gap penalty for sequencestructure alignments
The effects of amino acid replacements on protein structure are not uniform over a sequence, indicating a dependence of amino acid variabilities on residue position. It is well known that, on average, residues are more conserved in the interiors of proteins than on their surfaces (Go and Miyazawa, 1980). One may expect deletions and insertions of residues to occur more frequently in less conserved regions of a sequence. Gap penalties ought to reflect the mutability at each residue position. Here the dependence of residue mutability on residue position (Go and Miyazawa, 1980
) is taken into account by setting the gap penalty to be proportional to the number of contacts at each residue position in a protein structure. The number of contacts is utilized here as a simple measure of burial and packing density of residues; see Equations 3436
. In other words, gaps will tend to be inserted in alignments more often on protein surfaces than in protein cores.
For deletions of residues of the structure in sequencestructure alignments, a gap penalty is taken as
|
where w0, w1, w2 and wc are parameters taking zero or positive values and npc is the number of residues in contact with the pth residue. The definition of a contact is the same as that for contact energies; if the two centers of side chains are within 6.5 Å, then they are defined to be in contact with each other. The summation in the equation above is taken over all deletions in a gap. The value of the gap penalty is cut off beyond a certain value wc to allow us to find single domains in multi-domain proteins, and also to reduce computational time for alignments.
For insertions of k residues to the sequence in sequencestructure alignments, inserted between the qth and (q+1)th residue positions in the structure, the gap penalty is set to
|
Equation 35 is trivially different for additions of terminal residues
|
where w'0, w'1, w'2 and w'c are all parameters taking zero or positive values.
Here, it should be noted that the gap penalties are convex functions of gap length because all the gap parameters above take zero or positive values. Gap parameters could be set to different values for insertions and for deletions; however, the same values are employed here for both, in order to reduce the number of parameters. On the other hand, in general, penalties for terminal gaps and gaps in the middle of the sequence should have different values. The algorithm of maximum similarity alignment (minimum energy alignment) used for the present gap scheme corresponds to Equations 22 and 23 in Miyazawa (1995).
Parameter choice for sequencestructure alignments
In conventional sequence alignments, it is well known that an alignment program can produce significantly different alignments with different parameter settings. The effects that a parameter choice has on resulting alignments have been studied (Fitch and Smith, 1983; Vingron and Waterman, 1994
). Gotoh (1990) also studied the effects of the variation of gap penalties. Heuristic knowledge about gap penalties in conventional sequence alignments is used in sequencestructure alignments.
Parameters which we must specify are deletion penalties (E0 for aligning residues, w0, w1, w2, wc for gaps in the middle, and w'0, w'1, w'2, w'c for terminal gaps) and the relative temperature 1/ß. First, let us consider one of the parameters, E0, which is an additional energy for aligning a residue at a structural position against a deletion and an insertion, and is a scaling parameter independent of the residue type and residue position; see Equations 16 and 1920 for its definition. The parameter E0 is chosen in such a way that minimum energy scores for most of the dissimilar protein pairs falls above zero; also there is no clear indication that the minimum energy scores depend linearly on the sequence length. If this were not the case, long or short alignments would tend to have low-energy scores independent of whether proteins aligned were related.
The mean energy for random alignments of residues is listed for each type of interaction potential in Table I. All energies in the following are represented in kT units. Here we emphasize that the mean of each energy component of short-range secondary structure energy, contact energy and repulsive energy, in native structures is set to zero; see Equations 912
. Permitting gaps in alignments improves energy scores over the mean energy scores for random residue matches. Thus, E0 must be more positive than 1.57 [= (0.83+0.800.06)] with secondary structure energies included or 0.74 [= (0.800.06)] without secondary structure energies.
The penalties for a deletion or an insertion of a residue must be greater than one half of E ({sp}|iq, A), that is the score for aligning the qth residue of type iq at position p as defined by Equation 20, because the sum of sequence lengths of the two proteins is equal to the sum of the numbers of deletions, insertions, and two times the number of other matches or mismatches in an alignment. Otherwise alignments would not be favored. In the present case, the largest average individual increment of tertiary structure alignment energy in the native environment is expected to be 3.39 (4.13 for contact energies) for misaligning a Leu to a Lys position and 3.67 (4.03 for contact energies) for misaligning a Cys to a Lys position in the native environment, and for secondary structure energies to be 5.36 for misaligning Pro to a Gly position [see Table III
of Miyazawa and Jernigan (1999b)]. The largest average increments of tertiary structure alignment energies are smaller for misalignments in the random environment than for those in the native structure environments. The largest increment of tertiary structure alignment energy, 4.97 (5.33 for contact energies), will occur if Leu is aligned to residue positions that are completely exposed to water. On the other hand, the largest average increment of the sums of secondary structure energies and tertiary structure energies is 5.66 (6.19 for contact energies) for misaligning Ile to a Gly position in the native environment, and 5.90 (6.44 for contact energies) for the random environment. Based on such information, the parameters defined by Equations 3436
for gap penalties are given in Table II
. w0+w1 has been set to be greater than (5.90+E0)/2 with secondary structure energies or (4.97+E0)/2 without secondary structure energies. w1+4w2 has been configured to be greater than (5.90+E0)/2 with secondary structure energies or (3.67+E0)/2 without secondary structure energies; the average number of residues in contact at each residue position in proteins is 4.19. The value of the gap penalty is cut off beyond a certain value wc to avoid loading too much penalty onto long gaps. The use of upper limits for gap penalties is especially appropriate for global alignments, over a whole sequence, of sequencestructure pairs in which a compatible domain is limited to only a portion of the sequence and structure. The value for an upper limit wc is chosen to be equal to a penalty for a gap of 20 residues on average; wc~w0+20(w1+4w2). Based on choices in conventional sequence alignments (Miyazawa, 1995
), the gap penalties and their upper limits have been set to be smaller for terminal gaps than for gaps in the middle of a sequence. The parameters for terminal gaps, w'0, w'1, w'2 and w'c, are arbitrarily set to be one-half of their corresponding parameter values for middle gaps, w0, w1, w2 and wc, respectively. All gap parameters used here are listed in Table II
.
|
|
Datasets of protein structures
Two datasets of protein pairs were prepared; one is a set of homologous protein pairs, and the other is a set of dissimilar protein pairs. For each protein pair in these sets, we calculate minimum energy alignments and also probability alignments, and examine whether their sequences and structures are compatible with each other.
Release 1.35 of the SCOP database (Murzin et al., 1995) is used for the classification of protein folds. Representatives of superfamilies, families or domains are the first entries in the protein lists of each superfamily, each family or each domain in SCOP; if these first proteins in the lists are not appropriate to use for the present purpose, then the second ones are chosen. These families and domains are all those which belong to the protein classes 15, that is, classes of all
, all ß,
/ß,
+ß, and multi-domain proteins. Classes of membrane and cell surface proteins, small proteins, peptides and designed proteins are not used. Proteins whose structures were determined by NMR or with resolution worse than 2.5 Å are removed. Also, proteins whose coordinate sets either consist of only C
atoms, or include many unknown residues, or lack many atoms or residues, are removed. Proteins shorter than 50 residues are also removed.
In the SCOP database, protein domains whose sequences are highly homologous may be classified into the same domains, and protein domains whose structures are extremely similar may belong to different domains although in the same family. Therefore, protein pairs, which are more similar than 90% sequence identity, or whose structures are more similar than 1 Å r.m.s.d. (root mean square deviation), are also removed from the set of domain representatives. As a result, our set of superfamily representatives includes 308 proteins, the set of family representatives has 440 proteins and the set of domain representatives has 988 proteins.
The set of homologous protein pairs is made by pairing the protein representatives of families with those of different domains within the families; the number of homologous protein pairs in this set is 548. Because there are families that consist of only one domain present, only 164 families are included in this set. The set of dissimilar protein pairs is made by arbitrarily choosing only every 100th pair from the ordered list of all possible pairs of superfamily representatives; 505 protein pairs are chosen. In sequencestructure alignments, the first proteins in those protein pairs are used as sequences and the second ones as structures; in other words, the sequences of family representatives and the structures of domain representatives in the same families are compared in sequencestructure alignments of homologous proteins. In inverse structuresequence alignments, the first proteins are used as structures and the second ones as sequences.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
First, the adequacy of sequencestructure alignments with the present method has been examined by comparing the overall characteristics of sequencestructure alignments with those of conventional sequence alignments. Both secondary structure energies and tertiary structure energies are included in the calculation of alignment energy scores. Folds of multimeric proteins and domains are evaluated in the multimeric state or within a whole protein even for sequences of monomeric proteins. Table III shows the values of gap parameters used here for our conventional sequence alignment; the Dayhoff 250 PAM matrix (Dayhoff et al., 1978
) is used as a scoring matrix for the sequence alignment, but alternatively BLOSUM matrices (Henikoff and Henikoff, 1992
) could have been used.
Figure 1 shows comparisons of the fractions of aligned residue pairs and the fractions of identical amino acid pairs for 548 homologous protein pairs between sequencestructure alignments and conventional sequence alignments. In Figure 1a
and 1c
, minimum energy score alignments defined by Equation 30
are compared with maximum similarity alignments for sequences. In Figure 1b
, probability alignments, which are made by successively aligning site pairs in order of their alignment probabilities, are employed for both sequencestructure and sequencesequence alignments. The fraction of aligned residue pairs is defined as
|
|
The fraction of identical amino acid pairs is defined in a similar way.
Both the sequencestructure alignments and the conventional sequence alignments give similar aligned fractions of residues for most proteins, indicating the values of E0 and gap parameters to be appropriate. Also, as shown in Figure 1b, we have adjusted the relative temperature (1/ß) in such a way that similar fractions of residues are aligned in the probability alignments for both sequencestructure and sequencesequence alignments.
As shown in Figure 1c, the present method of sequencestructure alignments yields only slightly fewer identical amino acid pairs than the conventional sequence alignment method, especially for relatively dissimilar proteins. This is understandable, since the sequencestructure alignment method does not actually maximize the sequence identity, as does the conventional sequence alignment method.
Figure 1 indicates the adequacy of sequencestructure alignments for homologous protein pairs for their overall characteristics. To examine further the quality of the present sequencestructure alignments, the r.m.s.d.s in superpositions of aligned residue pairs in the sequencestructure alignments are compared in Figure 2
with those from the maximum similarity alignments of sequences. For this purpose, we employ probability alignments for sequencestructure consisting of only the most reliable residue pairs aligned with probabilities
0.5. The r.m.s.d.s of aligned residue pairs calculated for dissimilar protein pairs indicate that values of r.m.s.d. can be <7 Å even for dissimilar protein pairs, if the number of superposed residues is <50. Therefore, in this figure, protein pairs whose alignments have <50 aligned residue pairs are excluded. In addition, homologous protein pairs which have positive minimum energy scores and negative maximum similarity scores are excluded: in other words, only homologous protein pairs whose similarities are identified by both methods are used. The 357 homologous protein pairs meeting these criteria are plotted in this figure for the sequencestructure alignments. Significant improvements in the values of r.m.s.d. are shown. Although these improvements are made partially by choosing only residue pairs most reliably aligned, they also indicate that the quality of the probability alignments of sequencestructure are usually better than those for the corresponding conventional sequence alignments.
|
Two different types of sequencestructure alignments can be utilized to assess the similarity between two proteins from the viewpoint of sequencestructure relationships: using the first as sequence and the second as structure and then inversely using the first as structure and the second as sequence. Figure 3 shows comparisons between sequencestructure alignments and their inverse structuresequence alignments for the same homologous protein pairs. In the sequencestructure alignments, family representatives are used as sequences and domain representatives are employed as structures, and in the inverse structuresequence alignments, structures are family representatives and sequences are domain representatives.
|
One interesting observation is that on average the energy scores for the alignments are roughly equal for the two types of alignments; see Figure 3a. This result indicates that the present scale of energies and its reference state may be used equally well either to detect compatible sequences with a given structure or compatible folds for a given sequence.
Detection of homologous proteins from dissimilar proteins
One of the most important questions is how well this energy scale can recognize a compatible pair of structure and sequence, particularly those not found from sequence comparisons. The minimum energy scores of alignments are plotted in Figure 4a for the dissimilar protein pairs and in Figure 4b
for the homologous protein pairs; see the section "Datasets of protein structures" above for these protein datasets.
|
As shown in Figure 1c, the present set of homologous protein pairs includes many distantly related protein pairs whose alignments have fractions of identical amino acid pairs below 10%. Thus, as shown in Figure 4b
, there are many distantly related protein pairs which have positive minimum energy scores of alignment and are not identified as compatible sequencestructure pairs. The conventional sequence alignment method cannot detect similarities for all of those homologous protein pairs, either. Table IV
lists the numbers of false positives and false negatives for the present sequencestructure alignment method and for the conventional sequence alignment method. Here, the judgements are made solely on the basis of the values of scores. In sequencestructure alignments, gap parameters are adjusted so that compatible sequence and structure pairs tend to take negative energy scores and incompatible ones positive energy scores. However, in conventional sequence alignments, gap parameters are adjusted so that positive scores are expected for similar sequences and negative scores for dissimilar sequences. The overall capability to identify homologous protein pairs is slightly better for the conventional sequence method than for the present sequencestructure alignment method, but both methods can complement each other to recognize some different homologous protein pairs. Figure 5
shows a comparison of alignment scores between both methods. On the basis of the values of scores, both methods identify similarities for 392 protein pairs of the 548 homologous protein pairs, but fail for 81 protein pairs. The present sequencestructure alignments identify 25 homologous protein pairs whose similarities were not identified by the conventional sequence alignment method. In the case of the inverse structuresequence pairs, both methods identify the similarities of 395 protein pairs but fail for 79 protein pairs. The inverse structuresequence alignments can identify the similarities of 27 homologous protein pairs that cannot be identified by the sequence alignments; 11 of those 27 protein pairs are protein pairs whose compatibilities are identified in common by both the sequencestructure alignments and inverse structuresequence alignments.
|
|
|
|
|
The alignment of the immunoglobulin chain A from Fab HIL (8FAB-A:3105) and the CD2 chain A from rat (1CDC-A) has a negative minimum energy score per residue of 0.45, but has an extremely large r.m.s.d. value, 19.7 Å; this protein pair is not shown in Figure 7a, because the number of residue pairs aligned with probabilities
0.5 is only 47, which is <50. The reason for the large r.m.s.d. is that the coordinates of 1CDC correspond to a metastable misfolded structure. It is interesting that such a misfolded structure has been detected to be compatible with a sequence for which the alignment contains a fraction of identical amino acid pairs below 0.15.
Figure 7b shows the relationship between minimum energy scores per residue and the fractions of identical amino acid pairs in the minimum energy score alignments for the 548 homologous protein pairs. This figure indicates that almost all protein pairs having fractions of identical amino acid pairs >0.2 have negative minimum energy scores, and thus can be identified to be similar. Remarkably, some protein pairs with fractions of identical amino acid pairs <0.10 can have negative energy scores and can therefore be identified to be compatible. The strength of the new approach presented here lies in the individual cases newly identified to be similar, which are not found by sequence comparisons.
Examples of sequencestructure alignment
Figure 8a shows sequencestructure alignments between human glutathione reductase C-terminal domain of residues 364478 (3GRS:364478) and NADH peroxidase C-terminal domain of residues 322447 from Enterococcus faecalis (1NPX:322447). Both types of alignment, that is, the sequence of 3GRS:364478 versus the structure of 1NPX:322447, and inversely the structure of 3GRS:364478 versus the sequence of 1NPX:322447, are shown. Also, for each type of sequencestructure alignment, two kinds of alignment are shown in this figure: the minimum energy score alignment and the probability alignment that is made by successively aligning site pairs in order of their alignment probabilities.
|
Superposition of the two structures, 3GRS (residues 364478) and 1NPX (residues 322447), with the 73 matched residues, which have a probability of 0.5 of being aligned in the probability alignment of the sequence of 3GRS with the structure of 1NPX, is shown in Figure 8b
. These aligned residues are shown in green for 3GRS and in blue for 1NPX. It can be seen that the green and blue regions constitute a type of core of these structure fragments.
The minimum energy alignments and probability alignments tend to align the same residue pairs but not always, when alignment probabilities are >0.5; also, it should be noted that both types of sequencestructure and inverse structuresequence alignments tend to be identical especially at sites aligned with probabilities >0.5; sites commonly aligned in all alignments are marked by `#' between the alignments. This fact indicates the suitability of the present scoring function for both fold and sequence recognition.
Figure 9 shows the sequencestructure alignments of purine nucleoside phosphorylase from bovine (1PBN) and purine nucleoside phosphorylase A chain from E.coli (1ECP-A). In this figure, probability alignments show only those residues aligned with probabilities
0.5. This protein pair is also one of the protein pairs whose compatibilities were not detected by the conventional sequence alignment, but only by the present sequencestructure alignment. There are at least two sequence alignments with the maximum similarity score (25), which are completely different from each other, for this protein pair. One alignment yields only a small number of aligned residues, i.e. 27 residue pairs, and the other aligns as many as 231 residue pairs; the fraction of identical amino acids is 0.02 for the former and 0.14 for the latter. The r.m.s.d. of 27 residue pairs for the former is 8.0 Å, which is attained due to such a small number of superposed residues. The r.m.s.d. for the latter is 15.4 Å. These facts indicate the present sequence alignment method actually fails to find similarities for this protein pair. On the other hand, the r.m.s.d. for the minimum energy score alignment of the 1PBN structure with 1ECP-A sequence is extremely small, 5.3 Å for 235 aligned residue pairs. The probability alignment consisting of the most reliable 107 residue pairs even improves the r.m.s.d. to 2.6 Å.
|
To examine the effectiveness of secondary structure potentials on sequencestructure alignments, alignments are also calculated by including only tertiary structure energies, without secondary structure energies. In Figure 10, the fractions of identical amino acid pairs in alignments are compared between the two energy schemes, that is, with and without secondary structure energies. Alignments calculated with secondary structure energies tend to contain more identical amino acid pairs than those without secondary structure energies. This suggests that short-range energy potentials are useful to yield correct positions of residues in sequencestructure alignments. Also, as shown in Table IV
, short-range energy potentials improve the capability for recognition of compatibility between sequences and structures. Such improvements in the recognition of sequencestructure compatibilities by secondary structure potentials were previously shown by threading sequences into structures without gaps (Miyazawa and Jernigan, 1999b
). Figure 10
indicates that short-range energy potentials improve the recognition of sequencestructure compatibilities through yielding more correct positions of residues in sequencestructure alignments, even though long-range potentials work well principally for the recognition of overall folds. To obtain correct alignments, the short- and long-range potentials are complementary and both seem to be essential.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Miyazawa and Jernigan (1999c) reported that this same potential could discriminate native structures from non-native folds and also distinguish native sequences from non-native sequences, in which non-native pairs of sequences and structures are generated by threading in all possible ways, without gaps. In the present paper, significantly more non-native folds are generated by making sequencestructure alignments in which gaps in both the sequences and structures are permitted.
A scoring function to estimate the stabilities of protein sequence and structure pairs has been devised to assess compatibilities between sequences and structures. The compatibility between a sequence and a structure has been taken here to be equivalent to the stability for that pair of structure and sequence. On an energy scale of stability, folds can be compared with one another.
As discussed in our previous work (Miyazawa and Jernigan, 1999c), the following problems need to be solved. First, protein structures in multimeric states and monomeric states must be compared in order to judge which is more compatible with a given sequence. Since it is difficult to evaluate rigorously the stabilities of such folds in multimeric states, because of the entropy loss due to protein binding, we choose instead to approximate those stabilities. In order to overcome this problem, only the terms of conformational energy that depend on the amino acid sequence order have been included in the present energy function. In the case of contact energies, the collapse energy err is subtracted from the contact energies (Miyazawa and Jernigan, 1996
, 1999c
). This modified energy scale was shown in Miyazawa and Jernigan (1996, 1999c) to provide a threading reference state for successfully discriminating native structures from non-native folds.
The second problem is more essential; the assessment of compatibilities between sequences and structures requires comparisons between different sequences for a given fold. Also, deletions and insertions in alignments must be considered in order to detect similar folds for a given sequence. As a result, the sequence dependences of the whole ensemble of protein conformations must be taken into account to measure stabilities of protein conformations. We take account of only dominant terms, i.e. native-like compact conformations in the summation of Boltzmann factors over all conformations, and then evaluate the logarithm of the partition function with the first and second terms in a high-temperature expansion. Finally, the zero energy state of the energy scoring function is adjusted for each sequence by representing conformational energies relative to a properly defined reference state, the conformational energy of a typical native structure with the same amino acid composition. For assessing the suitability of each type of residue for each structural position, the average conformational energy of each type of residue in the native structures has been chosen as a reference energy for that type of residue, relative to which conformational energies of folds are compared. In other words, sequencestructure alignments with a zero energy in this energy scale have conformational energies comparable to the native structure. It was shown in Miyazawa and Jernigan (1999c) that on the energy scale with this modification, native sequences had lower energy scores than all non-native sequences when the sequences were threaded into structures without gaps. Here it should be noted that in principle native structures ought to be the lowest energy folds for their sequences but native sequences need not be the most compatible with their native structures, even though this is highly probable; some proteins may be incompletely evolved toward the most compatible sequences.
As a result, this energy function with two types of modifications is expected to estimate properly the stabilities of protein structures with different sequences and also for different environments, i.e. monomeric and multimeric environments. The suitability of these modifications to the energy potentials for fold and sequence recognition is supported by the present results showing that this scoring function can recognize folds compatible with sequences, and inversely sequences with folds, and can generate mostly similar alignments for these two types of aligned sequence and structure pairs.
However, in order to allow deletions and insertions in sequencestructure alignments, additional parameters, corresponding to penalties for gaps, must be introduced into the scoring function. To obtain good alignments, it is important to use a proper gap scheme and to determine a set of appropriate values for gap parameters. Lesk et al. (1986) pointed out that in globin sequences deletions and insertions are infrequently observed in the interiors of helical regions of proteins because of the importance of the stabilization for structures of the packing at helixhelix interfaces, and they introduced variable gap penalties between helical regions and inter-helical and loop regions. Barton and Sternberg (1987) also showed the superiority of their secondary structure dependent alignment method using various gap penalties. Fischel-Ghodsian et al. (1990) modified a dynamic programming method to include predicted secondary structure information. On the other hand, Kanaoka et al. (1989) assigned large gap penalties to the hydrophobic core. Ouzounis et al. (1993) demonstrated that the use of core weights considerably improves the detectability of remote homologues with sequencestructure alignments. In all of these analyses, sequence alignments were improved by the use of variable gap penalties. However, no structural information is available for most sequence alignments. Such a gap scheme is useful only for sequencestructure alignments.
Here, gap penalties are taken simply to be proportional to the number of contacts at each residue position in protein structures. The number of residue contacts is utilized as a simple measure of the packing density of residues. Thus, in densely packed regions in protein structures, insertions and deletions of residues rarely occur in alignments. If necessary, gap penalties could be set to depend on local secondary structures at each residue position. In this paper, we have not quantitatively examined how much the present gap scheme can improve sequencestructure alignments, although qualitative improvements are observed. It is difficult to determine an optimal set of values for gap parameters. It has also been shown that both sequencestructure alignments and conventional sequence alignments of homologous protein pairs have similar overall characteristics with respect to the proportions of deletions and identical residues (see Figure 1). However, it is not easy to obtain good alignments for proteins whose lengths are significantly different. Such alignments depend strongly on the gap parameters for termini. No penalty for terminal gaps may be better for aligning a single domain with a multi-domain protein for identifying a domain in multi-domain proteins, but it is not appropriate in other cases to ignore all terminal gaps.
Here, folds of multimeric proteins should always be evaluated in their multimeric states even against sequences of monomeric proteins. This is appropriate for searching for sequences compatible with a given structure. However, for searching for compatible folds with a given sequence, templates of protein folds should be evaluated in the monomeric state for monomeric sequences and in the multimeric state for multimeric sequences. Alternatively, protein folds could be evaluated in both the monomeric and multimeric states and the form with the lower energy chosen.
The present energy potential includes long-range, pairwise interaction potentials, so that the exact solution of a minimum energy score alignment cannot be calculated by the dynamic programming method. It is another technical problem in fold recognition to obtain minimal energy-score alignments with such a long-range potential.
Here, pairwise contact energies have been evaluated in a mean field approximation on the basis of probabilities of site pairs being aligned. To obtain self-consistent values of alignment probabilities of site pairs, an iterative method is used for pairwise potential extraction. In most cases, iterations converge rapidly. This approximation becomes rigorous in the low-temperature limit, but it is more useful at higher temperature where an ensemble of alignments becomes significant rather than a minimum energy alignment. In addition to the most probable alignment, i.e. the minimum energy alignment, an alignment has also been made by successively aligning site pairs in order of their alignment probabilities. This alignment method based on alignment probabilities of site pairs is consistent with the mean field approximation for pairwise contact energies to be evaluated on the basis of those probabilities. Alignments made by this probability alignment method coincide with the most probable alignments in a low-temperature limit. This method also provides information about how reliable each aligned site pair is. Figure 2 indicates that alignments consisting of residues aligned with high probabilities can improve significantly the r.m.s.d.s in superposition of two proteins. This feature is particularly desirable for aligning distantly related sequences and structures. Also, it is noteworthy that reliably aligned residue pairs between the sequence of 3GRS and the structure of 1NPX constitute a type of core of these structure fragments (see Figure 8b
).
It has been clearly demonstrated that the present scoring function including the present modifications in energy scale and parameters for gap penalties can properly evaluate compatibilities between sequences and structures (see Figures 14), and therefore can be used both for searching for compatible folds with a given sequence and likewise for searching for compatible sequences with a given fold (see Figure 3
). Figure 5
shows that this method of sequencestructure alignment complements conventional sequence alignment in detecting compatible proteins. As shown in Table V
, it is further useful to find a significant number of new protein pairs in which structures are similar but in which sequences were too different for the conventional sequence alignments to detect their structural similarities.
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bowie,J.U., Lüthy,R. and Eisenberg,D. (1991) Science, 253, 164170.[ISI][Medline]
Bryant,S.H. and Lawrence,C.E. (1993) Proteins 16, 92112.[ISI][Medline]
Crippen,G.M. (1991) Biochemistry, 30, 42324237.[ISI][Medline]
Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Dayhoff,M.O. (ed.). Atlas of Protein Sequence and Structure 1978, vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, DC, pp. 345352.
Feller,W. (1968) An Introduction to Probability Theory and its Applications, vol. I. Wiley, New York.
Finkelstein,A.V. and Reva,B.A. (1991) Nature, 351, 497499.[ISI][Medline]
Fischel-Ghodsian,F., Mathiowitz,G. and Smith,T.F. (1990) Protein Eng., 3, 577581.[Abstract]
Fitch,W.M. and Smith,T.F. (1983) Proc. Natl Acad. Sci. USA, 80, 13821386.[Abstract]
Go,M. and Miyazawa,S. (1980) Int. J. Pept. Protein Res., 15, 211224.[ISI][Medline]
Godzik,A., Kolinski,A. and Skolnick,J. (1992) J. Mol. Biol., 227, 227238.[ISI][Medline]
Gotoh,O. (1990) Bull. Math. Biol., 52, 359373.[ISI][Medline]
Hendlich,M., Lackner,P., Weitckus,S., Floechner,H., Froschauer,R., Gottsbachner,K., Casari,G. and Sippl,M.J. (1990) J. Mol. Biol., 216, 167180.[ISI][Medline]
Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl Acad. Sci. USA, 89, 1091510919.[Abstract]
Huang,E.S., Subbiah,S. and Levitt,M. (1995) J. Mol. Biol., 252, 709720.[ISI][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 8689.[ISI][Medline]
Jones,D. and Thornton,J. (1993) J Comput.-Aided Mol. Des., 7, 439456.
Kanaoka,M., Kishimoto,F., Ueki,Y. and Umeyama,H. (1989) Protein Eng., 2, 347351.[Abstract]
Karlin,S. and Altschul,S.F. (1990) Proc. Natl Acad. Sci. USA, 87, 22642268.[Abstract]
Kocher,J.-P.A., Rooman,M.J. and Wodak,S.J. (1994) J. Mol. Biol., 235, 15981613.[ISI][Medline]
Kraulis,P.J. (1991) J. Appl. Crystallogr., 24, 946950.[ISI]
Lathrop,R.H. and Smith,T.F. (1996) J. Mol. Biol., 255, 641665.[ISI][Medline]
Lesk,A.M., Levitt,M. and Chothia,C. (1986) Protein Eng., 1, 7778.[ISI][Medline]
Lüthy,R., Bowie,J.U. and Eisenberg,D. (1992) Nature, 356, 8385.[ISI][Medline]
Maiorov,V.N. and Crippen,G.M. (1992) J. Mol. Biol., 227, 876888.[ISI][Medline]
Matsuo,Y., Nakamura,H. and Nishikawa,K. (1995) J. Biochem., 118, 137148.[Abstract]
Matsuo,Y. and Nishikawa,K. (1994) Protein Sci., 3, 20552063.
Mirny,L.A. and Shakhnovich,E.I. (1996) J. Mol. Biol., 264, 11641179.[ISI][Medline]
Miyazawa,S. (1995) Protein Eng., 8, 9991009.[Abstract]
Miyazawa,S. and Jernigan,R.L. (1985) Macromolecules, 18, 534552.[ISI]
Miyazawa,S. and Jernigan,R.L. (1996) J. Mol. Biol., 256, 632644.
Miyazawa,S. and Jernigan,R.L. (1999a) Proteins 34, 4968.[ISI][Medline]
Miyazawa,S. and Jernigan,R.L. (1999b) Proteins 36, 347356.[ISI][Medline]
Miyazawa,S. and Jernigan,R.L. (1999c) Proteins 36, 357369.[ISI][Medline]
Moews,P.C. and Kretsinger,R.H. (1975) J. Mol. Biol., 91, 201228.[ISI][Medline]
Munson,P.J. and Singh,R.K. (1997) Protein Sci., 6, 14671481.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[ISI][Medline]
Needleman,S.B. and Wunsch,C.B. (1970) J. Mol. Biol., 48, 443453.[ISI][Medline]
Nishikawa,K. and Matsuo,Y. (1993) Protein Eng., 6, 811820.[Abstract]
Ouzounis,C., Sander,C., Scharf,M. and Schneider,R. (1993) J. Mol. Biol., 232, 805825.[ISI][Medline]
Park,B. and Levitt,M. (1996) J. Mol. Biol., 258, 367392.[ISI][Medline]
Park,B.H., Huang,E.S. and Levitt,M. (1997) J. Mol. Biol., 266, 831846.[ISI][Medline]
Samudrala,R. and Moult,J. (1998) J. Mol. Biol., 275, 895916.[ISI][Medline]
Sippl,M.J. (1990) J. Mol. Biol., 213, 859883.[ISI][Medline]
Sippl,M.J. (1993) Proteins, 17, 355362.[ISI][Medline]
Sippl,M.J. and Weitckus,S. (1992) Proteins, 13, 258271.[ISI][Medline]
Taylor,W.R. (1997) J. Mol. Biol., 269, 902943.[ISI][Medline]
Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 122.[ISI][Medline]
Thomas,P.D. and Dill,K.A. (1996) Proc. Natl Acad. Sci. USA, 93, 1162811633.
Vendruscolo,M. and Domany,M. (1998) Folding Des., 3, 329336.[ISI][Medline]
Vingron,M. and Waterman,M.S. (1994) J. Mol. Biol., 235, 112.[ISI][Medline]
Received October 4, 1999; revised March 6, 2000; accepted April 12, 2000.