On the structural complexity of a protein

Pierre-Yves Calland

LAMIH-ROI, UMR CNRS 8530, Université de Valenciennes et du Hainaut-Cambrésis, Le Mont Houy, B.P. 311, 59304 Valenciennes Cédex, France E-mail: pierre-yves.calland{at}univ-valenciennes.fr


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Appendix
 References
 
The determination of the configuration of a protein in three-dimensional (3D) space constitutes one of the major challenges in molecular biology research today. A method consists in choosing a protein structure from a database that minimizes an energy function. First, we model the problem in terms of dynamic programming and show that the determination of the order in which the variables must be considered to minimize the time complexity is an NP-hard problem. Second, we propose a new decomposition algorithm of the threading problem that is based on the connectivity of the graph induced by the 3D structure of a protein. Our decomposition could be used to solve the threading problem. The goal in this paper is to evaluate the intrinsic complexity of 3D structure, which can be viewed as information that may be incorporated into a solution method. It provides two indexes of complexity (time and space) and determines in polynomial time complex components of the 3D structure of a protein.

Keywords: complexity/connectivity/dynamic programming/NP-hard/protein threading


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Appendix
 References
 
Each protein folds into its own unique three-dimensional (3D) structure determined by the sequence of amino acids of which the protein is composed. It is crucial to have a good understanding of how proteins fold to make predictions. The determination of the configuration of a protein in 3D space from information related to its primary structure (a chain of amino acids) and to its secondary structure (succession of loops, {alpha}-helices and ß-sheets) constitutes one of the major challenges in molecular biology research today.

One of the most studied problems consists in recognizing the three-dimensional structure of a protein of which the one-dimensional structure is known (Bryant and Lawrence, 1993Go; Lathrop and Smith, 1996Go; Madej et al., 1995Go). The prediction of the 3D structure of a protein is based on a set of experimental results providing a coefficient of repulsion or of attraction between two amino acids in a specified environment. The values of these preference functions are determinant for a good prediction of the 3D structure, and constitute the first step in a protocol of determining the 3D structure (Bryant and Lawrence, 1993Go; Madej et al., 1995Go; Lathrop and Smith, 1996Go; Xu and Xu, 2000Go). Once these functions are known and given a protein and a set of 3D structures, the problem consists in determining a score that corresponds to the appropriateness of a protein and a 3D model, and isolating a structure or a set of candidate structures. An approach which may be utilized consists in minimizing a cost function of which the parameters are the positions of the secondary structure cores in the primary structure.

More formally, B = (Bi)1<=i<=m denotes a set of m cores symbolizing the elements of the secondary structure, where bi is the length of core i and xi is its relative position (i.e. xi = x'i {Sigma}1<=j<ibj, where x'i is the absolute position) on the protein of length n (xi [1...ñ]).

Let f'i(xi) denote the value of the energy or the preference of a secondary structure element, g'ij(xi, xj) denote the potential value of the interaction between the core i and the core j and l'i,i+1(xi, xi+1) denote the energy value of the loop between core i and i + 1. We assume that these functions f'i(.), g'ij(.,.) and l'i,i+1(.,.) are pre-computed and stored in tables. These different tables are obtained from the preference tables evaluated experimentally and takes O(m2ñ2) time and space complexities. The function to minimize is the total energy of the protein’s structure. The problem of threading can be formulated as:

(1)
where l0,1(x0, x1) = l0,1(1, x1) and lm,m+1(xm, xm+1) = lm,m+1(xm, ñ) correspond respectively to the cost of the loop between the beginning of the sequence and core 1, and the cost of the loop between core m and the end of the sequence. The function (1)Go is simplified to give a reformulated threading problem (2)Go :

(2)
where PX corresponds to a subset of position variables of the set of cores X B, e.g. PX = {xi,i X}, and C = {xi <= xi+1, 1 <= xi <= ñ}.

To solve this NP-hard problem (Lathrop, 1994Go), two types of method can be used: exact methods such as the ‘divide and conquer’ approach (Xu and Xu, 2000Go), ‘branch and bound’ method (Lathrop, 1999Go), etc., or approximate methods (heuristics) such as genetic algorithms (Yadgari et al., 1998Go), etc.

This paper is organized as follows. We first explain how the structural complexity of a protein is tightly linked to the complexity of a threading algorithm based on the dynamic programming method. Then we show that the determination of the order in which the variables must be considered to minimize time complexity is an NP-hard problem and we propose a polynomial algorithm for the threading problem using a decomposition based on the connectivity of the graph induced by the 3D structure of a protein. We also analyze time and space complexities and we extend our decomposition to a weighted graph. Finally, we summarize and discuss our results.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Appendix
 References
 
In this section we give an intuitive explanation of the method which we will develop later to evaluate the 3D complexity of a protein by using an algorithm based on a dynamic programming principle.

Given a 3D structure and a protein, a method of threading consists in positioning the cores of the 3D structure on to the 1D chain of the target protein in such way to minimize the energy. We illustrate this method on a small example given in Figure 1Go. Each capital letter denotes an amino acid and cores (of length equal to two) are represented by vertices in Figure 1aGo and by rectangle boxes in Figure 1c and dGo. The purpose of this example is only to introduce the problem. The length of the protein is equal to n = 10 and the number of positions that a core can take is equal to ñ = 5. Figure 1cGo represents the alignment of the 3D structure (Figure 1aGo) with the target protein (Figure 1bGo) such as x1 = 1, x2 = 2 and x3 = 4 (x1, x2 and x3 are the relative positions of the cores). In this case the energy of interaction between cores 1 and 2 is equal to 1, and the energy between cores 2 and 3 is equal to 2. With this distribution, the energy of the protein is equal to 3. If we assume on the other hand that x3 = 3 (see Figure 1dGo), the energy of the protein is equal to 2. The configuration in Figure 1cGo will thus not be retained because its energy is higher than the energy of the configuration in Figure 1dGo.



View larger version (4K):
[in this window]
[in a new window]
 
Fig. 1. Example.

 
Determine the position of the cores in such a manner to minimize the energy can be carried out in various ways. We will illustrate the dynamic programming principle that we will more formally develop further. We will show with this example that the computation effort needed to find the minimal energy is strongly related to the structural complexity of the 3D structure.

A dynamic programming method proceeds by successive steps where each step uses the results of one or several previous steps. Concretely, determining the positions of the cores that minimize the energy function in the example in Figure 1Go can be done in the following way:

  1. For all i, such as 1 <= i <= 5, compute the minimum of the energy between cores 1 and 2 subject to core 2 being fixed at the position i and the variable x1 being free [namely the quantity g{2}{1} (i)].
  2. For all i, such as 1 <= i <= 5, compute the minimum of the energy of protein knowing that the core 3 is fixed at position i [namely the quantity g{3}{1,2} (i)].

Let us suppose that the result obtained in step 1 is the one obtained in Figure 2Go, where for each of the ñ possible positions of core 2 we represent the position of core 1 which minimizes the energy between cores 1 and 2.



View larger version (7K):
[in this window]
[in a new window]
 
Fig. 2. Results of dynamic programming algorithm.

 
In the second step we will use these results to compute the quantity g{3}{1,2} (i) for 1 <= i <= 5. For example, to evaluate g{3}{1,2} (3) it will be necessary and sufficient to consider configurations given in Figure 2Go, step 2. In step 2 each of the ñ possible positions of the core 2 is considered but only the positions of the core 1 calculated in the previous step are taken into account. In step 2 some configurations will therefore not be tested thanks to the previous step. For example, the configuration x1 = 2, x2 = 2 and x3 = 3 is not considered because we know according to step 1 that the configuration x1 = 1, x2 = 2 and x3 = 3 gives a better result. The total number of the configurations that were examined is thus equal to 30 (15 during the first step plus 15 for the second). In other words, the time complexity is O(ñ2).

We represent in Figure 3aGo and b the execution of the previous algorithm on the graph associated with the 3D structure. In step t the black vertices correspond to the cores of which each ñ possible positions are tested in this step. With dotted lines we denote vertices already treated prior to this step t, namely vertices which were as a dotted line in step t 1 and vertices which were in black and linked to vertices in black or to vertices in dotted lines in step t – 1. The remaining vertices (not yet considered) are depicted in white.



View larger version (3K):
[in this window]
[in a new window]
 
Fig. 3. Graph representation.

 
Let us assume now that there is an edge between cores 1 and 3 of the 3D structure. As an interaction exists between each pair of cores, the computation of the positions of the cores minimizing energy will require us to consider all the possible configurations. In this case, the diagrammatic representation of the algorithm is that represented in Figure 3cGo and the number of configurations that were tested is equal to 35 (all possible configurations), which is higher than in the previous case. In other words, the time complexity is O(ñ3).

We can note the correlation that exists between the structural complexity of the 3D structure and time complexity of the dynamic programming algorithm. Indeed, the graph in Figure 1aGo is structurally less complex than that in Figure 3cGo and time complexity in the first case is lower than that in the second case.

This complexity is crucial information. Hence it is important to be able to determine, to compute and to minimize it. Graphically it is equal to the maximum number of black vertices that appear in the graph during the computation. This computation can be carried out in different ways (order in which the vertices are considered) and we aimed at determining an order of execution that minimizes the maximum number of black vertices that appear in the graph during the execution of the algorithm.

Dynamic programming method

Throughout this paper, the 3D structure will be denoted by a graph G = (VG, EG) where the vertices symbolize the cores and the edges the interaction between the cores (functions gij != 0).

In this section we aim to solve problem (2) using the dynamic programming method that we have informally described previously. In each step t a sub-problem of the original problem [restricted to the set S composed of t variables (t cores)] is solved. In step t + 1 we add one more variable to S and we solve the new associated sub-problem using the results obtained at step t. We denote by X, Y and R three disjoint sets that form a partition of the set VG of the original problem, where the set PX (set of the positions of the elements of X) denotes the set of free variables, PY the parameters, and PR the ignored variables. Note that S = X {cup} Y and R = VGS. So we define

The function gYX(.) can be seen as a function whose the parameters are PY and the local variables are PX. The value gYX(y) is the optimal value of the problem (2) according to PX, where y is an instance of PY, and the variables in PR are ignored.

Let gYX(.) be the array of dimension |Y|, solution of the sub-problem (2) at step t and let b R. The computation of gY'X'(.) at step t+1 is performed as follows:


At step t+1 the black vertices that we have defined in the previous section are the elements of the set Y {cup} {b}, the vertices with dotted lines are the elements of the set S – Y and the white vertices are the elements of the set R'.

The cost of the computation of gY'X'(.) that we denote by |BTTCP(t)| is equal to |Y {cup} {b}|, thus the time complexity of a dynamic programming algorithm to solve the problem (2) is m x maxt[1..m] |BTTCP(t)|.

A dynamic programming approach may turn out to be inefficient if the value maxt[1..m] |BTTCP(t)| is high. Nevertheless, the minimum value (with regard to the order in which the variables are considered) of maxt[1..m] |BTTCP(t)| is tightly linked to the structural complexity of the protein and represents important information that must be taken into account to solve an algorithm. The information, which is exploited through the dynamic programming algorithm, only concerns the structure of the graph and not the value the edges can take. Thus it appears that such an algorithm must be seen as a module that may be incorporated in a more general algorithm like a branch and bound algorithm. Without giving more details (which is beyond the scope of this paper), we wish to compute the quantity min maxt[1..m] |BTTCP(t)| at a node of the decomposition tree of a branch and bound algorithm. It appears crucial to be able to compute this quantity:

The problem of the determination of the order in which the variables must be considered to minimize the time complexity is called TTCP.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Appendix
 References
 
In this section we briefly explain that the determination of a sequential order of the parameters which minimize maxt[1..m] |BTTCP(t)| is an NP-hard problem. We just sketch the proof; full mathematical details are given in the Appendix. Then we propose a polynomial heuristic based on a ‘divide and conquer’ method to solve this problem. Furthermore, we show how our heuristic can be extended to the case where the domain of variables is an arbitrary interval.

TTCP is an NP-hard problem

To understand the proof of the NP-completeness, it is sufficient to understand that our problem looks like the MC problem (Minimum Cut Linear Arrangement problem). [In the proof we will use a modified version of MC (namely MCM) for technical reasons.] Briefly, the MC problem can be formulated as follows: let GMC=(VMC, EMC) be a graph where VMC is the set of vertices and EMC the set of edges. At each step f such that 1 <= f <= |VMC| a vertex (not yet visited) is considered and the date f is given to this vertex (i.e. its ‘open’ date). In a second step, let i be a date such that 1 <= i <= |VMC|, we say that an edge (u, v) is ‘active’ if it connects a vertex u whose open date is f(u) <= i and a vertex v whose open date is f(v) >= i. Given an integer KMC, the MC problem consists in determining a date for each vertex of GMC such that at each step i no more than KMC edges are active. Given a date i, we can represent with dotted lines the edges which have been active before i and are no longer active at date i. Black edges represent the active edges at the date i and white edges represent edges which are not active at the date i and have not been active before. This representation allows to realize that the MC problem looks like the TTCP problem we represent in Figure 3Go. The TTCP problem takes into account active vertices (black vertices) whereas the MC problem takes into account active edges (black edges). Our proof consists in transforming the GMC graph into a graph GTTCP as follows: an edge a = (u, v) is replaced by a vertex A, that is by the edges (u, A) and (A, v). More precisely, as GTTCP must be hamiltonian, a vertex x of GMC is replaced by a set of vertices X which constitutes a clique (c-vertex of type v) and an edge a is already replaced by a c-vertex A named c-vertex of type e. The correspondence between the MC problem and the TTCP problem works in the following way: suppose that at the date i the vertex x of GMC becomes active, at the ‘same’ date we make active first all the vertices of X and second all the vertices of A if A is linked to X. The converse is more difficult to understand but works in the same way owing to technical arguments that can be found in the proof. In mathematical language, we can say that the TTCP problem is closed to the dual of the MC problem except that the graph must be hamiltonian.

Finally, the proof is organized as follows. First, we give a formulation of |BTTCP(t)| using the sequential order (a one-to-one function denoted by in) and a formulation of the problem TTCP. To prove that TTCP is NP-hard we need a set of hamiltonian graphs and we make the proof using the MCM problem. We define a set of hamiltonian graphs by using the definition of Gp and we prove that the MCM problem is NP-hard from the NP-hard Minimum Cut Linear Arrangement problem (MC). We prove that TTCP is NP-hard by setting a correspondence between an instance of MCM and an instance of TTCP whose the graph is in Gp and thus is a hamiltonian graph.

Decomposition of threading problem

We propose now a polynomial algorithm based on a ‘divide and conquer’ method to determine the order in which the variables must be taken into account in order to minimize time complexity.

We modify the definition of gYX(.) in the following way:

Let us assume there exists a partition of set B into Z,X1, ..., Xk such that the sets X1, ..., Xk are independent [i.e. {forall}x Xi,y Xj, i != j, (x, y) {notin}EG]. Problem (2) can then be broken down as follows:

(3)
gYX(.) is described as an atom if it is not decomposed.

Let us illustrate this principle with the example in Figure 5aGo. Black vertices in a circle denote variables in Z at a fixed step. White vertices are vertices which belonged to a set Z in a previous step (set of parameters Y). The dash edges are edges whose two extremities are in Y [computation of rY(.)]. The set of vertices Z={1,4} ‘disconnects’ the graph into two components (X1={2,3} and X2={5}). We apply this process recursively on the graph in Figure 5bGo, and we obtain the graphs in Figures 5e and dGo as the atoms of the decomposition.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 5. Illustration of decomposition.

 
In this example the function to be minimized can be decomposed as follows:

where g1,42,3(x1, x4) = minC,x2, x3g1,2(x1, x2) + g2,3(x2, x3) + g3,4(x3, x4) + g1,3(x1, x3),
and g1,45 (x1, x4) = minC,x5g4,5(x4, x5) + g1,5(x1, x5) and r1,4(x1, x4) = g1,4(x1, x4)

An optimal solution for the original problem can be determined in the following way: compute the tables g1,42,3(.) and g1,45 (.) (by recursive calls), then merge these two tables by taking into account the values of r1,4(.) to obtain the optimal value. The computation of the minimal value is given by the function Solve in which we do not specify how the decomposition has been obtained in each step. The function Split_min will be detailed further in the paper.


Function Solve (G,Z) Function Merge ((G'i, Zi), Z)
    L = Split_min (G,Z)     Let = ({cup} Zi)–Z
    If L = 0 then X = 0,     and X = ({cup} VG'i)–Z
        gZX(.) = Compute_Atom(G,Z)     For all x = (xj)j such that
    Else let L = ((G'i, Zi))i where G'i is a sub-graph         xj PZ, xj [l..ñ]
        of G and Zi is the parameter set of gZiXi(.)         gZX(x) = minPy,y(r{cup}Zi(x,PY)
        {forall}i, 1 <= i <= |L|, gZiXi (.) = Solve (G'i, Zi)             +{sum}gZiXi (PY))
        gZX(.) = Merge ((G'i, Zi), Z)     where Xi = VG'i–Zi
    End if     return (gZX(.))
    return (gZX(.))

Each call of the function Compute_Atom having a graph G as a parameter computes the value gZX(y) for each position y of the vertices of G.

Note that at each non-terminal step of Solve(G,Z), we must merge the gZiXi(.). The time complexity (the number of write access) of a merge is O(ñ|{cup}Zi|) where Zi defines a set of the ‘boundary’ vertices of each sub-problem. We propose an heuristic approach which is based on this observation and we define the function split_min(G,Z) in order to minimize time complexity.

Decomposition using connectivity

Let G=(VG, EG) be the graph associated with the 3D structure of a protein so that the edges are weighted by gij. A decomposition of G is valid if a partition (Z, X1, ..., Xk) of VG exists such that (2) is equivalent to (3). More precisely, a decomposition is valid if the subsets X1, ..., Xk are independent, i.e. {forall}x Xi,y Xj, i != j, (x, y) ISOdia/EG. In other words, if a and b are two distinct vertices of G, and Z is a set of vertices of G not containing a and b, we say that Z disconnects a and b if after removing Z from G there does not exist a path re-linking a and b. Likewise, Z disconnects G if two distinct vertices, say a and b, of G exist such that Z disconnects a and b. The set Z is called vertex cut of G. If it is of minimal cardinality the set Z is called the connectivity set of G and will be denoted by {kappa}(G). Our problem comes down to finding a connectivity set of G.

Connectivity set of G

{kappa}(G) can be computed in polynomial time from the following theorem of Menger (Berge, 1983Go): a necessary and sufficient condition for a simple graph to be h-connected (the minimum number of vertices to remove in order to disconnect this graph is greater than or equal to h) is that we can link two distinct vertices a and b by h disjoint elementary chains (i.e. each pair of chains will have only the vertices a and b in common). The proof of this theorem leads to the following algorithm: construct a transport network R(G, a, b), with two distinct vertices a and b as source and sink respectively, for which a maximal flow {phi} of value ||{phi}|| between a and b determines ||{phi}|| disjoint elementary chains of G between a and b. R(G, a, b) is a direct graph constructed from G: first replace each edge (x, y) of G by two arcs (x, y) and (y, x). Each vertex x != a, b is duplicated in two vertices x' and x'' linked by an arc (x'x'') of capacity 1. Each arc that goes towards x will be replaced by an arc of infinite capacity going towards x', and each arc going out of x will be replaced by an arc of infinite capacity going out of x''. Add an arc from b to a. It is well known that the value of a maximal flow is equal to the capacity of a minimal split. The capacity of a minimal cut sharing a and b represents the minimum number of vertices that need to be removed from G to disconnect a and b.

This result is used for computing the smallest set of vertices Z X {cup} Y that minimize |Z {cup} Y| and that disconnects the graph associated with gYX. This is carried out at each step of the decomposition of a sub-problem gYX.

The set of parameters Z in the function gZX(.) will be called the boundary or frontier of the associated problem. We can now define the function Split_min that takes a graph G of boundary Z as an input and that returns a set of graphs G'i of boundary Zi, as a result of the decomposition of G. Let {delta}(x) denote the set of neighbors of a vertex x [i.e. (x,{delta}(x)) EG], G(X) denotes the sub-graph generated by the subset of the vertices X. (Notation simplification: if G is a graph and Z a set of vertices, we denote by G–Z the sub-graph of G generated by the set VG–(VG {cap} Z). Moreover, we will note x G to depict a vertex x of G).


Function Split_min (G,Z)
    Cmin = VG
    For each pair (a, b) VG x (VG – Z), b != a do
        GR = R((G – Z) {cup} {a, b},a, b)
        C = Cut_min(GR)
        if |C|<|Cmin| then Cmin = C, as = a, bt = b
    if Cmin != VG (i.e. G is not atomic) then Function Decomposition_min (G,Z)
        CT = max(CT, |Z {cup} Cmin|)     L = Split_min (G,Z)
        Let (Gi = (Vi, Ei))i be the connected     While L != 0 do
        components of ((G – Z) {cup} {as, bt}) – Cmin)         Let (G',Z') L
        Let Xi = {z Z, {delta} (z) {cap} Vi != 0} be the set of         Decomposition_min (G',Z')
        nodes of Z connected to Gi in G         L = L – (G',Z')
        Let Zi = Xi {cup} Cmin, CS = max(CS, maxi|Zi|))
        Let G'i = (Zi {cup} Vi, EG(Z {cup} Cmin))
        L = (G'i, Zi)i
    Else CA = max(CA, |VG|); L = 0
    Return(L)

The function Cut_min(GR) returns a vertex set of G corresponding to the edges of a minimal cut of GR. The variables CT, CS and CA are global variables which are initialized to zero being equal to the maximum value (throughout the decomposition of the graph) of |Z {cup} Cmin|, |Zi| if G is not atomic, and |VG| if G is atomic, respectively. These variables will be useful to express the space and time complexities.

After the computation of Cmin, the connected components G'i are constructed from the graphs Gi (see Figure 6Go). The graphs Gi are connected components of {[(G – Z) {cup} {as, bt}] – Cmin}. The edges, whose two extremities are in Cmin, are taken into account in the Merge step and will not appear in the proper decomposition.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 6. Illustration of the Split_min function.

 
It is easy to show that the set C = Z {cup} Cmin is a vertex cut of G and that this set defines a vertex cut C of G that minimizes |Z {cup} C|.

Complexity of the decomposition

The function Split_min will be called recursively on the set of sub-graphs obtained by the decomposition. In this way, we obtain a graph which is hierarchically decomposed and which isolates the complex parts of the initial graph (see Figure 5Go as an example). We will now prove that our decomposition is polynomial in the following theorem.

Theorem 1. The number of recursive calls of Decomposition_ min(G,Z) is bounded by |VG|2

Proof. Let m denote the number of vertices of the initial graph G associated with the 3D structure. Let |V'G| be the number of vertices of a graph G' on which the function Decomposition_min will be applied during a recursive call. Let t(G,Z) be the number of recursive calls made by Decomposition_min (G,Z). We will prove this theorem by induction. Then, we will show that t(G,Z) <= m(|VG| – |Z|). Indeed, if G is atomic then t(G,Z) = 0, otherwise Split_min(G, Z) = ((G'i, Zi))1 <= i <= k. The set of vertices VG'iZi contains the vertices of G not belonging

to Z {cup} Cmin. Therefore, we have Then

As the graph G is not atomic, Cmin >= 1. Moreover, the number of connected components k cannot be greater than the number of the vertices of the graph, so k <= mCmin. This leads to the conclusion t(G, Z) <= m|VG – Z|.{blacksquare}

Time complexity of Decomposition_min. The number of recursive calls of the function Decomposition_min is overestimated by m2 and the maximum flow algorithm of Ford–Fulkerson is of time complexity O(m|EG|2). Hence the complexity of Decomposition_min is O(m5|EG|2).

Note that this complexity remains high but this is only an overestimation. Moreover, we may strongly expect to improve it in practice by observing the following: (1) to determine the connectivity of a graph it is unnecessary to compute a maximum flow for all the pairs of vertices as only m|{kappa}(G)| pairs are sufficient to scan; (2) we can re-use the result of the computation of a maximum flow at a given decomposition step in one of its recursive calls without having to re-compute it; (3) we can also expect to reduce this bound by adapting the efficient algorithm for connectivity computation (Henzinger et al., 1996Go) where the complexity is |{kappa}(G)||EG||VG|log|VG|.

Our decomposition differs from that of Xu et al. (Xu et al., 1998Go) in several respects, particularly: (1) our algorithm is polynomial and could be used for example in branch and bound method; and (2) we do not try to divide the problem by determining a minimal edge cut on G (i.e. by acting on the edges of the graph), but by determining disconnection sets (i.e. by acting on the vertices of the graph).

Time complexity of Solve. Note that the number of recursive calls of Solve is equal to the number of recursive calls of Decomposition_min. Moreover, the cost (number of write access memory) of a merging step is at most nCT, the cost of a Compute_Atom is at most nCA and the time complexity of Solve is then in O(m2(nCT+nCA)).

Space complexity of Solve. In each step k of Solve, the computation of a merging step needs to have in memory the tables gZiXi(.) of dimension |Zi|. That is overestimated by mnmax(CS,CA). After step k these tables are not used and will not be kept in memory. The complexity of Solve is then O(mnmax(CS,CA)).

We observe that as m is smaller than n and that the expressions of space and time complexities proposed are upper bounds then the time complexity depends essentially on max(CT, CA) and the space complexity on max(CS, CA). We will call these two coefficients index of time complexity and index of space complexity, respectively. These two indices are relevant to the intrinsic complexity of the 3D structure.

Extension to a weighed graph. We can extend this decomposition to a graph the vertices of which are weighted by the function p, such that for each vertex x, p(x) is an interval of values corresponding to the set of positions that the core x can take. We can apply the same decomposition algorithm as previously by modifying it as follows: during the construction of the network GR, the capacity on the arc between x' and x'' obtained by duplicating a vertex x is no longer equal to 1 but will be equal to the logarithm of the size of the interval associated with the vertex x, i.e. log|p(x)|. Then the problem comes down to finding a cut of minimal weight in GR (the sum of the weights). The Ford–Fulkerson method determines a maximal flow and saturates all the edges of a minimal cut. The minimal cut found by the Ford–Fulkerson algorithm determines a disconnected set of minimal weight.


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Appendix
 References
 
We have shown that minimizing the time complexity of an algorithm based on a dynamic programming method is an NP-hard problem. We have proposed a new polynomial decomposition algorithm based on the connectivity of the graph induced by the 3D structure of a protein that allows the evaluation of its intrinsic complexity.

A decomposition process of the associated mathematical formulation induces our vertex cut decomposition. It recursively divides the graph associated with the 3D structure by computing vertex cuts in order to minimize the time complexity. Moreover, we have proven that our algorithm is polynomial. This distinguishes our method from the ‘divide and conquer’ algorithm of Xu et al. (Xu et al., 1998Go). It permits the isolation of structurally complex sub-graphs and provides two indices of complexity (space and time). We have evaluated an upper bound of the time complexity O(m5|EG|2) of our decomposition algorithm. This bound remains high but we may hope to reduce it in practice if necessary (see the remarks earlier). Moreover, if the indices of time and space complexity provided by our algorithm are high, the solution of the problem by a ‘divide and conquer’ approach (Solve algorithm) can be proved very expensive and inefficient. Our decomposition algorithm permits us to evaluate the structural complexity of the threading problem rather than to solve it.

Note that the information exploited in our decomposition only concerns the structure of the graph and not the values that the edges take. More precisely, if the graph G is a clique, our algorithm will make no decomposition. We therefore have shown that our algorithm could be extended to a weighed graph if it is necessary to solve the problem on a sub-space of the search space. Consequently, it can be incorporated into an algorithm based on a branch and bound method so that structural and quantitative information is taken into account simultaneously.


    Appendix
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Appendix
 References
 
We give in the section all the details of the proof of the NP-completness of the TTCP problem.

Formulation of the problem TTCP

At each step t, the variables St are defined from St 1 plus a new variable b not yet considered. Let G be the associated graph with the 3D structure and Yt – 1 a subset of the vertices of G. A vertex (a variable) y belongs to Yt 1 if at step t – 1 there exists an edge whose one extremity is y and the other is in Rt – 1. Let in be a one-to-one function, in: VG->[1, ..., |VG|] such that in(y) = t if y is incorporated into S at step t. Then y is in Yt – 1 if there exists z Rt – 1 such that (y, z) is an edge of the graph, i.e. if z neighbors of y such that in(y) <= t < in(z). Thus, at step t, the set Yt – 1 {cup} {b} is equal to BTTCP(t) = {x VG, y {Gamma}x {cup} {x}, in(x) <= i <= in(y)}, where {Gamma}x is the set of the neighbor of x.

Note that the graph G is necessarily hamiltonian (i.e. there exists a path in which each vertex appears one and only one time) as there always exists an edge between two consecutive cores. TTCP can be formulated as follows:

TTCP: instance: Hamiltonian graph GTTCP, positive integer KTTCP.

Question: Is there a one-to-one function in: VG->{1, 2, ..., |VG|} such that for all i, 1 <= i <= |VG|, |BTTCP(i) = {x VG, y {Gamma}x {cup} {x}, in(x) <= i <= in(y)}| <= KTTCP?

Definition and preliminary properties

Let G = (VG, EG) be a connected graph. A c-partition of G is a partition P = {X1,..Xn} of VG such that for all i, the sub-graph whose vertices are in Xi is a clique (all pairs of vertices are linked) and such that either the sub-graph whose vertices are Xi {cup} Xj is a clique, or there does not exist an edge between a vertex of Xi and a vertex of Xj. Xi is called c-vertex of (G, P) and (Xi, Xj) is a c-edge of (G, P) if the sub-graph whose the vertices are Xi {cup} Xj is a clique. Xj is then a c-neighbor of Xi. Let w(Xi) denote the weight of Xi equal to the cardinality of Xi. The c-degree of Xi is the number of c-edges incident to Xi.

Gp: Gp is a set of pairs (G, P) where G is a connected graph with at least two vertices and P a c-partition such that (G, P) is formed from a graph Ginit as follows: each vertex x of Ginit is replaced by a c-vertex X of weight w(X) greater than or equal to the degree of x. If (x, y) is an edge of Ginit then (X, Y) is a c-edge of (G, P) (see Figure 4Go as an example).



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 4. Construction of Gp.

 
Let X be a c-vertex of (G, P) and x a new vertex, let include (X, x) denote the following procedure: X <- X {cup} {x} and x is linked to all vertices of X and all the vertices of all the c-neighbors of X.

We define GIp as follows:

GIp: GIp is the set of pairs (G, P) where G is a graph and P a c-partition of G defined by:

  1. K2 (clique with two vertices x, y) and the c-partition P = {{x},{y}} belongs to GIp;
  2. let (G, P) GIp, the new pair (G, P) obtained after the execution of the following procedure, belongs to GIp:
    1. increasing the weight of a vertex: let x be a new vertex and X be a c-vertex of (G, P), then include(X, x);
    2. adding a c-edge: let x and y be two new vertices, X and Y two c-vertices of (G, P), then include(X, x); include (Y, y); and (X, Y) becomes a c-edge of (G, P);
    3. adding a new c-vertex: let x, y be two new vertices, X a c-vertex of (G, P), then include(X, x); {y} is a new c-vertex of (G, P); (X, {y}) is a new c-edge of (G, P).

Lemma 1. If (G, P) belongs to Gp then (G, P) is in GIp

Proof. It is easy to make the proof by induction.{blacksquare}

Lemma 2. If (G, P) belongs to Gp, then there exists a hamiltonian path in G.

Proof. If (G, P) belongs to Gp, then (see lemma 1), (G, P) belongs to GIp. We show by induction that there exists a hamiltonian path in any graph that belongs to GIp. Suppose that there exists a hamiltonian path h in G GIp and that we apply procedure (a), i.e. we increase the c-vertex X with a vertex x. Let s0 be the first vertex of X belonging to the hamiltonian path and a its predecessor, h = h1, h2, ..., a, s0, ..., hn. After applying procedure (a), the path h' = h1, h2, ..., a, x, s0, ..., hn is a hamiltonian path.

We proceed in the same way if we apply procedure (b).

If procedure (c) is applied, let s0 be the first vertex of X appearing in the hamiltonian path and a its predecessor, h = h1, h2, ..., a, s0, ..., hn. After applying procedure (c), the path: h' = h1, h2, ..., a, x, y, s0, ..., hn is a hamiltonian path.{blacksquare} To prove that TTCP is NP-hard, we will use a modified version (MCM) of the Minimum Cut Linear Arrangement (denoted by MC) problem that is NP-hard (Garey and Johnson, 1979Go). The Minimum Cut Linear Arrangement problem can be formulated as follows:

MC: instance: graph G = (VG, EG), positive integer KMC.

Question: is there a one-to-one function f: VG -> {1, 2, ..., |VG|} such that for all i, 1 < i < |VG|, |BMC(i) = {(u,v) EG : f(u) <= i <= f(v)}| <= KMC?

Also, we will show that MCM (used to prove that our problem is NP-hard) is an NP-hard problem.

MCM: instance: graph G = (VG, EG) connected, positive integer KMCM.

Question: is there a one-to-one function f': VG -> {1, 2, ..., |VG|} such that for all i, 1 <= i <= |VG|, |BMCM(i) = {(u,v) EG : f(u) <= i < f(v)}| <= KMCM?

Lemma 3. MCM is NP-hard

Proof. It is obvious that MCM is in NP and it easy to check that MC remains NP-hard if the graph is connected. Let us show that MCM remains NP-hard if i [1..|VG|] and no longer i ]1..|VG|[. Let GMC = (VMC, EMC) and KMC an instance of MC problem. We build an instance of MCM problem: GMC = GMCM = (VMCM, EMCM) and KMC = KMCM. Let us now demonstrate that if there exists f that satisfies MC then there exists f' that satisfies MCM. Given such a function f, we define the function f'in the following way:

If |BMC (1)| <= |BMC(2)| then {forall}i [1...,|VMCM|], f'(i) = f(i).

Otherwise, let x = f(1) and y = f(2). In this case we can show that (x, y) EMC. We define f' such that f'(1) = y, f'(2) = x and {forall}i [3...|VMCM|], f'(i) = f(i).

We can easily check that if f meets MC and if G is a connected graph with at least three vertices, then f' satisfies MCM.

Conversely, it is obvious that if there exists f' that satisfies MCM, then f = f' satisfies MC.{blacksquare} TTCP is an NP-hard problem. We will show first that TTCP is in NP. Indeed, given a graph GTTCP = (VTTCP, ETTCP) and given a function in, for all i [1...|VTTCP|], for all vertex x of the graph we visit its neighbors. |BTTCP(i)| can be computed in O(|VTTCP|2) and then we can check that in is a solution with O(|VTTCP|3) units of time.

We now demonstrate that TTCP is NP-hard.

Graph construction. Given a graph GMCM = (VMCM, EMCM) and a constant KMCM, we define a graph GTTCP = (VTTCP, ETTCP) and its associated c-partition P and a constant KTTCP as follows.

Let D be the maximum degree of the vertices of GMCM. If x is a vertex of GMCM, then X is a c-vertex of GTTCP with a weight equal to w(X) = QV with QV = max(D, 2KMCM + 1). X and any vertex that belongs to X are said to be of type v (i.e. they come from a vertex of GMCM). We say that X is the c-vertex associated with the vertex x.

If (x,y) is an edge of GMCM, then A(x,y) is a c-vertex of GTTCP with a weight equal to w(A(x,y)) = 2. Any vertex that belongs to A(x,y) is said to be of type e (i.e. it comes from an edge of GMCM). Furthermore, if X and Y are the c-vertices built from x and y, respectively, then (X, A(x,y)) and (Y, A(x,y)) become c-edges of GTTCP.

The pair (GTTCP, P) that is constructed this way belongs to the set Gp. Then, there exists a hamiltonian path in GTTCP (see lemma 2).

The bound KTTCP is fixed to QV + 2KMCM.

Equivalence. Let f be a one-to-one function such that f satisfies MCM that is for each i in [1, ..., |VMCM|], |BMCM(i)| <= KMCM. We will define a function in such that TTCP is satisfied: {forall}i [1, ..., |VTTCP|]|BTTCP(i)| <= KTTCP. We define first a function f': VTTCP -> N2 (with N2 sorted in the lexicographic order) and the function in is then defined as in = nof', where n is a one-to-one function that takes value in [1, ..., |VTTCP|] such that n respects the order.

Definition of f'. Step 1: Let x VMCM such that f(x) = 1. Let X be the c-vertex of type v associated with x and let s be the kth element of X, then f'(s) = (1, k). Let V+X be the set of vertices of type e connected to the vertices of X. Let a V+X then f'(a) = (1, k') with QV < k'.

Step n: Let x VMCM such that f(x) = n. Let X be a c-vertex of type v associated with x and let s be the kth element of X, then f'(s) = (n, k). Let V+X be the set of vertices of type e connected to each vertex of X and to each vertex of Y where Y is a c-vertex not yet considered. Let a V+X then f'(a) = (n, k') with QV < k'.

We have to show that {forall}i [1, ..., |VMCM|] |BMCM(i)| <= KMCMSSSS {forall}i [1, ..., |VTTCP|] |BTTCP(i)| <= KTTCP. To prove that we just have to show that {forall}i, j 2|BMCM(i)|+Qv >= |BTTCP(n((i, j))|. Indeed, the definition of f' implies that BTTCP(n((i, j))) never contains more than one c-vertex that is Qv vertices of type v. Furthermore, if one or the two vertices of A(x,y) of type e belongs to BTTCP(n((i, j))) then the edge (x, y) belongs to BTTCP(n((i, j))) never contains more than 2|BMCM(i)| vertices of type e. This leads to the following result: |BTTCP (n((i, j))| <= 2|BMCM(i)|+Qv <= KTTCP.

Conversely, we have to show that {forall}i [1, ..., |VTTCP|] |BTTCP(i)| <= KTTCP SSSS {forall}i [1, ..., |VMCM|] |BMCM(i)| <= KMCM. Let in be a one-to-one function such that TTCP is satisfied. Let x be a vertex of VTTCP of type v. We say that x is out at step t and we denote out(x) = t if t is the shortest integer such that: in(x) <= t and for all neighbor y of x, in(y) <= t. We say that x is open at step t if in(x) <= t <= out(x). We define also out(X) =maxxXout(x).

Then we define a one-to-one function f for the MCM problem: f(x) = n(out(X)), where X is the c-vertex associated with x, and n a one-to-one function which takes value in [1,...,|VMCM|] and which respects the order.

First we show that f defined in such a way is a one-to-one function. Indeed, if out(X) = out(Y) = t, then at step t, each of the vertices of X and Y are in, then at least |X| + |Y| vertices are open. This implies that |BTTCP(t)| >= |X| + |Y|. This is impossible because |X| + |Y| = 2Qv > KTTCP.

Second, to prove the result we only have to show that 2BMCM(t) + Qv <= BTTCP(n–1(t)). Let us show that if an edge (x, y) belongs to BMCM(t) then the two vertices a(x,y) of A(x, y) belong to BTTCP(n–1(t)). Indeed, let X and Y the c-vertices associated with x and y, respectively, if (x, y) BMCM(t) then f(x) <= t < f(y), i.e. out(X) <= n–1(t) < out (Y). If X is out at step lower or equal to n–1(t) then in(a(x,y)) <= n–1(t). Furthermore, there exists a c-vertex Z of GTTCP such that n–1(t) = out(Z). Taking into consideration that out(Z) < out(Y) then Z != Y. At step out(Z), at least all vertices of Z are open. As |Z|+|Y| > KTTCP, then all vertices of Y can not be open and therefore a(x,y) cannot be out. Hence there exists a neighbor w of a(x,y) such that n–1(t) < in(w), and consequently a(x,y) BTTCP(n–1(t)). If na is the number of vertices of type e which belongs to BTTCP(n–1(t)), then 2|BMCM(t)| <= na.

Before concluding, we note that at each step n–1(t), a c-vertex X becomes out and it is necessary that all the vertices of X be open. Consequently, at least Qv vertices of type v are open at step n–1(t). Hence na + Qv <= KTTCP, then 2|BMCM(t)| <= na <= KTTCPQv = 2KMCM.{blacksquare}


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Appendix
 References
 
Berge,C. (1983) Les Graphes. Gauthier Villars, Paris.

Bryant,S. and Lawrence,C. (1993) Proteins: Struct. Funct. Genet., 16, 9–112.

Garey,M.R. and Johnson,D.S. (1979) Computers and Intractability, a Guide to the Theory of NP-Completeness. Freeman, San Francisco.

Henzinger,M.R., Rao,S. and Gabow,H.N. (1996) In Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science (FOCS 1996). pp. 462–471.

Lathrop,R.H. (1994) Protein Eng., 7, 1059–1068.[Abstract]

Lathrop,R.H. (1999) J. Comput. Biol., 6, 405–418.[CrossRef][ISI][Medline]

Lathrop,R.H. and Smith,T.F. (1996) J. Mol. Biol., 225, 641–665.[CrossRef]

Madej,T., Gibrat,J. and Bryant,S. (1995) Proteins, 23, 356–369.[ISI][Medline]

Xu, Y. and Xu,D. (2000) Proteins: Struct. Funct. Genet., 40, 343–354.[CrossRef][ISI][Medline]

Xu,Y., Xu,D. and Uberbacher,E.C. (1998) J. Comput. Biol., 5, 597–614.[ISI][Medline]

Yadgari,J., Amir,A. and Unger,R. (1998) In: Proc. Int. Con. on Intelligent Systems for Molecular Biology, TSMB 98, 193–202, AAAI Press.

Received June 14, 2002; revised October 31, 2002; accepted December 3, 2002.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Request Permissions
Google Scholar
Articles by Calland, P.-Y.
PubMed
PubMed Citation
Articles by Calland, P.-Y.