Institute for Bioinformatics, GSFNational Research Center for Environment and Health, Ingolstädter Landstrasse 1, 85764 Neuherberg, Germany
E-mail: d.frishman{at}gsf.de
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: fold recognition/genome analysis/sequence clustering/structural genomics
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
One particular aspect of structural genomics involves the systematic structural exploration of complete proteomes. Each genome of a free-living organism codes for a complete set of functions and hence corresponding protein structures necessary to support cellular life. Statistically, structural tendencies in complete genomes, such as the fraction of residues in -helical and ß-sheet conformations, are well conserved between different species, but differ significantly from the observed distribution in the current collection of known protein structures (Frishman and Mewes, 1997a
). This observation led us to suggest that determining the complete set of structures encoded in a small model organism would be of great value for structural biology and would have the potential to provide us quickly with a more objective view of the diversity of protein folds. Structural knowledge can also help to decipher the function of the majority of proteins in each genome that cannot be characterized though the application of standard bionformatics approaches, such as similarity searches (Kim, 2000
). In particular, genome-wide structure determination is the most direct way to address the problem of genomic `ORFans', i.e. proteins without known function occurring in only one organism (Fisher and Eisenberg, 1999). The benefits of structural genomics on microbes are especially evident with pathogens and also microorganisms adapted to extreme environments because of the immediate relevance for medicine and biotechnology, respectively (Terwilliger et al., 1998
). Efforts to obtain complete structural complements of several microbial species are now under way [Mycobacterium tuberculosis (http://www.doe-mbi.ucla.edu/TB/index.html), Pseudomonas aerophilum (Mallick et al., 2000
), Haemophilus influenzae (http://s2f.umbi.umd.edu), Methanococcus jannaschii (http://sb3.lbl.gov/genomics/proteinlist.html), Methanobacterium thermoautotropicum (http://nmr.oci.utoronto.ca/arrowsmith/proteomics/index.html)]. In Japan, the Structurome Project (Yokoyama et al., 2000
) pursues the determination of all structures from the thermophilic eubacterium Thermus thermophilis (http://www.rsgi.riken.go.jp/). This genome was selected for a large-scale protein structure study because of its compactness, thermostability, presumed ease of crystallization and the availability of genetic tools for further functional essays. Projects of this type are already beginning to bear fruit. For example, Hwang et al. assigned a function to a previously uncharacterized gene product of M.jannaschii by means of the crystallographic analysis of its three-dimensional structure (Hwang et al., 1999
).
At a given state of the technology to determine structures, selection of the most economical set of targets is a major cost-saving factor in any experimental structural genomics project. The principal requirement of any such target list is that it must reveal the minimal collection of gene products that possess all structural domains with yet unknown folds in the entire data set under study. Given the high abundance of duplication modules, both on the level of whole genes or parts of genes intrinsic to all complete genomes, a crucial step in creating the list of putative structural targets involves grouping together proteins sharing similar sequence segments. This is typically achieved through single-linkage clustering of amino acid sequences based on pairwise similarity comparisons. Owing to the well known phenomenon of domain chaining, totally unrelated protein sequences may end up in the same cluster. Sophisticated approaches have been developed to partition single-linkage clusters further into groups of proteins that are guaranteed to share sequence similarity (Sonnhammer and Kahn, 1994; Koonin et al., 1996
; Park and Teichmann, 1998
; Matsuda et al., 1999
; Yona et al., 1999
; Enright and Ouzounis, 2000
). However, in many cases joint mosaic occurrence of multiple conserved protein modules (Bork et al., 1997
) gives rise to very large sequence groups with a complex structure of inter-sequence similarity relationships. Partitioning the corresponding single-linkage clusters into single domain clusters may represent a significant algorithmic challenge.
It should be noted that the sequence clustering tools mentioned above were developed with the purpose of studying the family relationships between proteins for better functional, structural and evolutionary inferences. While this information is also invaluable in the context of target selection for structural genomics, the immediate technical objective here is much more limited: we want to exclude protein domains that either do not belong to our targeted class (e.g. transmembrane proteins if we are interested in soluble proteins) or already have been structurally characterized. In this paper, we argue that the computational complexity of the target selection process can be significantly reduced if the knowledge about predicted structural features and other relevant protein properties is directly incorporated into the clustering procedure. It is sufficient to perform the simple step of initial single-linkage clustering. After the application of a number of filtering criteria, many of these clusters will be excluded from consideration because all sequences constituting them have been discarded. In some other cases, single-linkage clusters will be reduced to just one candidate sequence. Finally, the remaining clusters will include sequences all of which are potential structural targets. The complicated procedure of resolving the domain structure of single-linkage clusters thus becomes obsolete. An important prerequisite of this approach is the availability of a comprehensive annotated database of completely sequenced genomes.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this work we considered completely sequenced genomes of 25 eubacterial and seven archaebacterial species (Table I). Exhaustive automatic annotation of these genomes was conducted using the PEDANT genome analysis suite (Frishman and Mewes, 1997b
; Frishman et al., 2001) and can be accessed through the PEDANT genome database (http://pedant.gsf.de).
|
Structural categorization of gene products involves a highly sensitive comparison of each gene product with the SCOP database of known structural domains (Brenner et al., 2000; Lo Conte et al., 2000) and the sequences of proteins with known three-dimensional structure (Berman et al., 2000
) using the novel IMPALA software (Schaffer et al., 1999
). This program allows one to compare a query protein sequence with a collection of position-specific scoring matrices generated by BLAST and is thus perfectly suitable for similarity-based fold recognition (Wolf et al., 1999
). Our current approach to genomic fold recognition involves the following steps: (i) create a complete non-redundant protein sequence database, (ii) run a PSI-BLAST search with 10 iterations with each SCOP domain or PDB sequence against the non-redundant protein sequence database and save the resulting profiles, (iii) construct a SCOP or PDB profile library using the IMPALA software suite and (iv) run an IMPALA search with each genomic sequence against the SCOP or PDB library. Additionally, for each genomic sequence a number of structural features are predicted, including secondary structure (Frishman and Argos, 1997
), low-complexity regions (Wootton and Federhen, 1993
), membrane regions (Klein et al., 1985
), coiled coils (Lupas et al., 1991
) and signal peptides (Nielsen et al., 1997
).
Single-linkage clustering of complete genomic protein complements
An all-against all comparison of proteins within each genome was effected using PSI-BLAST, with low-complexity sequence regions masked. Sequences possessing a sufficient degree of similarity in a reciprocal fashion (BLAST similarity score >45 bits) were joined into single-linkage groups. In cases where reciprocal BLAST comparisons produced only one local alignment between two sequences in each direction, this hit was made symmetrical by taking into account only the longer alignment. Optionally, it is also possible to take into account results of sensitive recognition of PFAM domains (Bateman et al., 2000) through HMMER searches (Eddy, 1998
). If two or more proteins in a genome display similarity to the same PFAM domain with a significant E value (typically 0.001), it may be safely assumed that the corresponding protein sequence spans are similar to each other, even if BLAST fails to recognize such relationships.
The lists of clustered sequences for all identified clusters in 32 completely sequenced genomes are available through the PEDANT Web site (see category sequence clusters).
Algorithm for target selection
The flow chart of our algorithm, which we dub STRUDEL (STRUcture DEtermination Logic), is presented in Figure 1, using the genome of Escherichia coli as an example. The analysis of each protein complement begins with single-linkage clustering. As a result, all sequences are partitioned into two sets: singlets, i.e. sequences without homology to other gene products of the genome considered and hence not participating in any cluster and clustered sequences. Both sets are subjected to filtering according to user-specified criteria. Throughout this work we excluded from further consideration predicted transmembrane proteins and sequences with known three-dimensional structure. More specifically, sequences were filtered out if they (i) had more than one predicted transmembrane region or (ii) the maximum length of a sequence span not covered by IMPALA similarity hits to proteins of known structure was below a certain threshold (denoted 3D_UNCOVERED; Figure 2
), reflecting the expected length of a structural domain. Singlets remaining after the filtering are considered structure determination targets. The entire pool of sequences possessing paralogs is re-clustered and a number of newly created singlets are attributed to the structural target set.
|
|
The default values of SEQ_UNCOVERED and 3D_UNCOVERED used in this work are both 100 amino acid residues, corresponding to the expected favorable size of a structural domain (Xu and Nussinov, 1998). The influence of these parameters on the results of target selection is detailed below.
The particular succession of steps shown in Figure 1 is not mandatory and is mainly dictated by simple practical considerations. For example, it would be possible to do the preliminary sequence filtering first and then cluster the remaining proteins. However, clustering requires much more time than filtering and it is much more likely that the user of the system will want to change filtering parameters than clustering parameters. Hence it is sensible to do the clustering of the complete set of proteins only once and save the result in the PEDANT relational database. Subsequent steps of the analysis can be quickly performed according to user-specified filtering conditions.
Graphical representation of single-linkage clusters
Owing to the multi-domain composition of many proteins, single-linkage clusters often include sequences that are totally unrelated to each other. In order to facilitate the analysis of the resulting groups we have implemented a visual representation of single-linkage clusters, further referred to as circlegram (Figure 2). A circlegram may include any number of concentric circles, each for a certain protein feature. In this work, the inner-most circle on such graphics schematically represents polypeptide chains as black sectors, with the N- to C-terminal direction corresponding to the clockwise direction on the circlegram. On the next circle of larger radius, IMPALA similarity hits to proteins of known structure are depicted as brown sectors. Finally, the outermost circle indicates the location of predicted transmembrane regions in blue. Owing to the small scale of the graphics and relatively low resolution, several features, e.g. transmembrane domains, may be lumped into one contiguous sector. BLAST similarity hits between the proteins constituting the cluster are shown as stripes originating from respective black sectors, with boundaries corresponding to the start and end positions of local alignments. The stripes are colored according to the BLAST similarity scores (see the color key in the bottom of each picture). Similarity relationships between proteins based on the presence of PFAM domains (see above) may be additionally shown as stripes of a single color, irrespective of the E values of the underlying PFAM search hits. Circlegrams represent a convenient means of displaying any number of sequence-related structural and functional features together with intra-protein similarity relationships. They are similar in spirit to the circular depiction of correlated structural features within one protein sequence developed by Pazos et al. (Pazos et al., 1997
) and complementary to the linear diagrams of domain similarity implemented by Storm and Sonnhammer (Storm and Sonhammer, 2001).
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The main distinctive feature of our method is the direct incorporation of predicted protein structural features into the clustering procedure. This approach allows to discard a large number of gene products at early stages of the target selection process and radically reduce the complexity of the resulting single-linkage clusters. Figure 3 provides an example of a single-linkage cluster from Aeropyrum pernix which is collapsed to a singlet if standard settings described in the Materials and methods are applied. All sequences forming the cluster possess the thioredoxin domain with known tertiary structure. Two genes, gi_5106134 and gi_5106201, code for apparently single domain proteins and are discarded since they are almost completely covered by three-dimensional information. Two further proteins, gi_5105410 and gi_5105974, will be discarded since they have a large membrane-spanning domain on the C-terminus and an additional putative hydrophobic region on the N-terminus. The remaining gene product gi_5104297 is retained as a structural target because in addition to the C-terminal thioredoxin domain it includes a completely uncharacterized soluble domain on the N-terminus. The results for this cluster will be different if the user is interested in shorter domains and sets both 3D_UNCOVERED and SEQ_UNCOVERED to ~75 amino acid residues. Then gi_5106201 also becomes a structural target since its N-terminal portion encompassing approximately the first 8090 amino acids displays no similarity to any other known protein.
|
|
Depending on the objectives of a particular structure determination project, the requirement that the potential structural targets must not have significant transmembrane domains may be relaxed in order to take into account individual, sufficiently long globular domains of membrane-associated proteins. The M.tuberculosis sequence cluster centered around the `fused nitrate reductase' rv1736c (Figure 2) provides a good illustration of this point. rv1736c is the result of re-arrangement and fusion of the
,
and
chains of membrane-bound nitrate reductase, encoded by genes rv1161 and rv1163 and rv1164, respectively (Cole et al., 1998
). The soluble
subunit (together with the ß subunit) is anchored to the plasma membrane by the
subunit, while the
polypeptide is not part of the final enzyme and is presumably important for the stability of the
ß complex prior to its membrane attachment (Moreno-Vivian et al., 1999
). rv1161, in its turn, shares a weak domain similarity with the biotin sulfoxide reductase rv1442. Since the tertiary structure of the entire rv1161 gene product is known via its homology to the R.capsulatis dimethyl sulfoxide reductase (Schneider et al., 1996
), the fused protein rv1736 efficiently consists of one globular domain with known structure, one more globular domain with unknown structure, corresponding to rv1163 and one transmembrane domain on the C-terminus, corresponding to rv1163. Consequently, STRUDEL yields rv1163 as the only target from this cluster. However, if rv1163 did not exist, the corresponding globular domain in rv1736c would have been overlooked because of the presence of the transmembrane domain in the latter. We have implemented an option in the STRUDEL software that allows to consider mixed membrane/soluble proteins as potential structural targets.
Choice of the analysis parameters
The interactive application of STRUDEL allows to use what-if scenarios to explore different outcomes for a given cluster or for the genome as a whole and to optimize the analysis parameters to suit the goal of a particular structure determination project. In this section we demonstrate the global influence of the parameters 3D_UNCOVERED and SEQ_UNCOVERED on the results of the target selection process using the E.coli genome as an example.
The total number of single-linkage clusters found in E.coli using the reciprocal BLAST score threshold of 45 bits is 479, encompassing 2235 proteins. As seen in Figure 1, after the first round of filtering (removing sequences with completely known three-dimensional information and membrane proteins), 701 sequences, or 16% of the protein complement, remain clustered in 193 clusters. The next stage of the algorithm, elimination of global redundancy between clustered sequences, results in reducing the number of clustered sequences and clusters to a mere 139 and 39, respectively. Finally, resolving mixed domain problems has a rather insignificant effect, leading to 127 sequences in 38 clusters. Thus, the application of STRUDEL with default parameters (3D_UNCOVERED = 100 and SEQ_UNCOVERED = 100) reduces the fraction of E.coli sequences grouped in single-linkage clusters from 52.3% to a mere 3% of the protein complement.
The parameters 3D_UNCOVERED and SEQ_ UNCOVERED strongly influence the outcome of the re-clustering procedure (Figure 5). Setting both of them to 50 amino acid residues, for example, results in an increase in the number of clustered sequences to >5% while raising the parameter value to 250 residues eliminates sequence clusters nearly completely (0.4%). Therefore, we conclude that the whole issue of sequence clustering is only of importance if one is interested in structural information on relatively short sequence domains of multi-domain proteins.
|
|
Figure 7 provides a comparison of the fraction of structural targets, both in singlets and in single-linkage clusters, in all completely sequenced bacterial genomes. On average, 48% of gene products in a genome are globular proteins with at least one structurally uncharacterized domain. After elimination of redundancy on the domain level this figure is lowered to 41.5%. Hence the application of STRUDEL results in the reduction of the number of structural targets by ~7.5%, on average, with respect to the situation where sequence clusters are not taken into account.
|
|
|
Our procedure automatically yields the minimum set of gene products without any structural homologues and those partially covered by known structural domains. For the latter, structure determination of only individual uncharacterized domains is required. The main observation that we want to demonstrate in this paper is that our pragmatic filtering/re-clustering procedure allows for a dramatic reduction of the number of sequences participating in single-linkage clusters and thus makes the problem of algorithmically rigorous clustering and resolving complex domain similarity problems much less severe. As seen in Figure 1, out of 2235 E.coli sequences initially contained in single-linkage clusters, only 127 still remain clustered after the application of the complete target selection procedure.
By default, our algorithm takes as input the complete set of gene products from a given organism. However, the area of application of our technique is not necessarily limited to completely sequenced genomes. The same protocol is suited for any sufficiently large and diverse group of proteins of interest, including proteins known to interact with each other and those involved in a certain cellular process (Terwilliger et al., 1998). STRUDEL has an option to start the analysis with a manually pre-selected protein list.
In this work, we considered as initial targets all predicted soluble proteins possessing substantially large sequence domains without available structural information. All parameters of the analysis are dynamic and can be changed. For example, the choice of the minimum allowed sequence similarity required to join to proteins in a single-linkage cluster depends on the objective of the project. Two proteins sharing a common structural motif at a very low similarity level can be joined in one cluster if the purpose is to obtain a general idea about the folding topology. For detailed studies on structures involving the analysis of individual structural elements, ligand binding sites, etc., a much higher level of homology will be required to join sequences into the same target family.
The decision tree shown in Figure 1 should be considered a rough prototype of the target selection process in a realistic structural genomics project. Each step of the procedure will certainly require further detail. For example, distinguishing between soluble and insoluble proteins is a complex task which goes far beyond mere membrane region prediction. Christendat et al. (Christendat et al., 2000
) developed a specialized data mining technique for this purpose which involves consideration of hydrophobic stretches in addition to Gln, Asp, Glu and aromatic composition. The process of mapping known three-dimensional structures on genomic sequences should ideally take into account discontinuous domains.
Using the wealth of pre-computed sequence attributes available through the PEDANT database, it is easy to apply a variety of other user-specified criteria for initial screening of gene products that are more likely to yield to expression and crystallization. Those should include protein size and pI, the number of cysteine and methionine residues, information on amino acid repeats, predicted exposed surface area and non-globular regions, to name just a few (E.Ulrich, personal communication). Other important features include predicted cellular localization, functional category and the size and phylogenetic distribution of the protein family to which a given protein belongs. Furthermore, the entire body of experimental evidence produced by functional analysis studies (availability of mutants, expression data, proteinprotein interactions) should ideally be taken into account. We conclude that the target selection for structural genomics can be best explored in conjunction with extensive high-quality genome annotation.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and Sonnhammer,E.L. (2000) Nucleic Acids Res, 28, 263266.
Berman,M.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank Nucl. Acids Res, 28, 235242.
Bork,P., Schultz,J. and Ponting,C.P. (1997) Trends Biochem. Sci., 22, 296298.[CrossRef][ISI][Medline]
Brenner,S.E., Koehl,P. and Levitt,M. (2000) Nucleic Acids Res., 28, 254256.
Bycroft,M., Hubbard,T.J., Proctor,M., Freund,S.M. and Murzin,A.G. (1997) Cell, 88, 235242.[ISI][Medline]
Christendat,D., Yee,A., Dharamsi,A., Kluger,Y., Savchenko,A., Cort,J.R., Booth,V., Mackereth,C.D., Saridakis,V., Ekiel,I. et al. (2000) Nat. Struct. Biol., 7, 903909.[CrossRef][ISI][Medline]
Cole,S.T., Brosch,R., Parkhill,J., Garnier,T., Churcher,C., Harris,D., Gordon,S.V., Eiglmeier,K., Gas,S., Barry,C.E. et al. (1998) Nature, 393, 537544.[CrossRef][ISI][Medline]
Eddy,S.R. (1998) Bioinformatics, 14, 755763.[Abstract]
Enright,A.J. and Ouzounis,C.A. (2000) Bioinformatics, 16,451457.[Abstract]
Fischer,D. and Eisenberg,D. (1999) Bioinformatics, 15, 759762.
Frishman,D. and Argos,P. (1997) Proteins, 27, 329335.[CrossRef][ISI][Medline]
Frishman,D. and Mewes,H.W. (1997a) Nat. Struct. Biol., 4, 626628.[ISI][Medline]
Frishman,D. and Mewes,H.W. (1997b) Trends Genet., 13, 415416.[CrossRef][ISI]
Frishman,D. and Mewes,H.W. (1999) Prog. Biophys. Mol. Biol., 72, 117.[CrossRef][ISI][Medline]
Frishman,D., Albermann,K., Hani,J., Heumann,K., Metanomski,A., Zollner,A. and Mewes,H.W. (2001) Bioinformatics, 17, 4457.[Abstract]
Gerstein,M. (1998) Proteins, 33, 518534.[CrossRef][ISI][Medline]
Hegyi,H. and Gerstein,M. (1999) J. Mol. Biol., 288, 147164.[CrossRef][ISI][Medline]
Hwang,K.Y., Chung,J.H., Kim,S.H., Han,Y.S. and Cho,Y. (1999) Nat. Struct. Biol., 6, 691696.[CrossRef][ISI][Medline]
Kim,S.H. (2000) Curr. Opin. Struct. Biol., 10, 380383.[CrossRef][ISI][Medline]
Kimura,M., Foulaki,K., Subramanian,A.R. and Wittmann-Liebold,B. (1982) Eur. J. Biochem., 123, 3753.[Abstract]
Klein,P., Kanehisa,M. and DeLisi,C. (1985) Biochim. Biophys. Acta, 815, 468476.[ISI][Medline]
Koonin,E.V., Tatusov,R.L. and Rudd,K.E (1996) Methods Enzymol., 266, 295322.[CrossRef][ISI][Medline]
Lo Conte,C.L., Ailey,B., Hubbard,T.J., Brenner,S.E., Murzin,A.G. and Chothia,C. (2000) Nucleic Acids Res., 28, 257259.
Lupas,A.N., van Dyke,M. and Stock,J. (1991) Science, 252, 11621164.[ISI][Medline]
Mallick,P., Goodwill,K.E., Fitz-Gibbon,S., Miller,J.H. and Eisenberg,D. (2000) Proc. Natl Acad. SciUSA., 97, 24502455.
Matsuda,H., Ishihara,T. and Hashimoto,A. (1999) Theor. Comput. Sci., 210, 305325.[CrossRef][ISI]
Milburn,D., Laskowski,R.A. and Thornton,J.M. (1998) Protein Eng., 11, 855859.[Abstract]
Moreno-Vivian,C., Cabello,P., Martinez-Luque,M., Blasco,R. and Castillo,F. (1999) J. Bacteriol., 181, 65736584.
Nielsen,H., Engelbrecht,J., Brunak,S. and von Heijne,G. (1997) Protein Eng., 10, 16.[Abstract]
Park,J. and Teichmann,S.A. (1998) Bioinformatics, 14, 144150.[Abstract]
Pazos,F., Olmea,O. and Valencia,A. (1997) Comput. Appl. Biosci., 13, 319321.[Medline]
Sali,A. (1998) Nat. Struct. Biol., 5, 10291032.[CrossRef][Medline]
Schaffer,A.A., Wolf,Y.I., Ponting,C.P., Koonin,E.V., Aravind,L. and Altschul,S.F. (1999) Bioinformatics, 15, 10001011.
Schneider,F., Lowe,J., Huber,R., Schindelin,H., Kisker,C. and Knablein,J. (1996) J. Mol. Biol., 263, 5369.[CrossRef][ISI][Medline]
Siomi,H., Matunis,M.J., Michael,W.M. and Dreyfuss,G. (1993) Nucleic Acids Res., 21, 11931198.[Abstract]
Skolnick,J. and Fetrow,J.S. (2000) Trends Biotechnol., 18, 3439.[CrossRef][ISI][Medline]
Sonnhammer,E.L.L. and Kahn,D. (1994) Protein Sci., 3, 482492.
Storm,C.E. and Sonnhammer,E.L. (2001) Bioinformatics, 17, 343348.[Abstract]
Terwilliger,T.C., Waldo,G., Peat,T.S., Newman,J.M., Chu,K. and Berendzen,J. (1998) Protein Sci., 7, 18511856.
Wolf,Y.I., Brenner,S.E., Bash,P.A. and Koonin,E.V. (1999) Genome Res., 9, 1726.
Wolf,Y.I., Grishin,N.V. and Koonin,E.V. (2000) J. Mol. Biol., 299, 897905.[CrossRef][ISI][Medline]
Wootton,J.C. and Federhen,S. (1993) Comput. Chem., 17, 149163.[CrossRef][ISI]
Xu,D. and Nussinov,R. (1998) Fold. Des., 3, 1117.[ISI][Medline]
Yokoyama,S., Matsuo,Y., Hirota,H., Kigawa,T., Shirouzu,M., Kuroda,Y., Kurumizaka,H., Kawaguchi,S., Ito,Y., Shibata,T. et al. (2000) Prog. Biophys. Mol. Biol., 73, 363376.[CrossRef][ISI][Medline]
Yona,G., Linial,N. and Linial,M. (1999) Proteins, 37, 360378.[CrossRef][ISI][Medline]
Received June 26, 2001; revised November 8, 2001; accepted November 21, 2001.