VaZyMolO: a tool to define and classify modularity in viral proteins

François Ferron{dagger}, Corinne Rancurel{dagger}, Sonia Longhi, Christian Cambillau, Bernard Henrissat and Bruno Canard

Architecture et Fonction des Macromolécules Biologiques, UMR 6098, CNRS and Universités Aix-Marseille I and II, ESIL, 163 Avenue de Luminy, Case 925, F-13288 Marseille Cedex 9, France

Correspondence
Bruno Canard
bruno.canard{at}afmb.cnrs-mrs.fr


   ABSTRACT
Top
ABSTRACT
MAIN TEXT
REFERENCES
 
Viral structural genomic projects aim at unveiling the function of unknown viral proteins by employing high-throughput approaches to determine their 3D structure and to identify their function through fold-homology studies. The ‘viral enzyme module localization’ (VaZyMolO) tool has been developed, which aims at defining viral protein modules that might be expressed in a soluble and functionally active form, thereby identifying candidates for crystallization studies. VaZyMolO includes 114 complete viral genome sequences of both negative- and positive-sense, single-stranded RNA viruses available from NCBI. In VaZyMolO, a module is defined as a structural and/or functional unit. Modules were first identified by homology search and then validated by the convergence of results from sequence composition analysis, motif search, transmembrane region search and domain definitions, as found in the literature. The public interface of VaZyMolO, which is accessible from http://www.vazymolo.org, allows comparison of a query sequence to all VaZyMolO modules of known function.

{dagger}These authors have equally contributed to this work.


   MAIN TEXT
Top
ABSTRACT
MAIN TEXT
REFERENCES
 
The advent of the genomic era implies that biologists are now confronted with vast quantities of raw sequence data. The number of available viral genome sequences at the NCBI has increased by 7·1 % between October 2003 and March 2004. The last release is composed of a total set of 32 308 viral proteins, of which 7472 proteins have been biochemically assessed, 22 099 have a postulated function and 2737 are classified as unknown. Such large volumes of data require the development of tools capable of distilling this information, thereby aiding scientists to devise a rational approach to the study of viral proteins. Viral structural genomic projects attempt to assign functions to uncharacterized proteins, by solving their structures and identifying function through fold homology. High-throughput structure determination combines computer-based analysis of proteins, automated expression and purification of gene products, and determination of their 3D structure. One of the bottlenecks in this integrative approach is the production of pure soluble proteins suitable for crystallogenesis. We have previously reported that many viral proteins have a modular organization, containing regions (hydrophobic or disordered) that are often not compatible with the crystallization process (Ferron et al., 2002; Karlin et al., 2003). To increase the chance of producing protein domains suitable for crystallization, we have developed the ‘viral enzyme module localization’ (VaZyMolO) tool which serves to define and classify viral protein modularity.

VaZyMolO enables the handling of viral sequences at the protein level in order to define their modularity. Sequence analysis is made possible by implementation of softwares such as BLASTP (Altschul et al., 1997), multalin (Corpet, 1988) and hydrophobic cluster analysis (HCA) (Callebaut et al., 1997). The two main pillars of VaZyMolO are the ‘protein sequence motif’ and the ‘protein domain’ definition. We define a ‘protein sequence motif’ as a set of conserved amino acids located within a short distance from one another that are both important for function and structure. A ‘protein domain’ is a structurally compact, autonomously folding unit that forms a stable structure and shows a certain level of evolutionary conservation. In VaZyMolO, a ‘module’ is defined as a structural and/or functional unit, which may contain one or several protein domains. VaZyMolO organizes information about modularity on viral open reading frames from complete genome sequences derived from GenBank and RefSeq (Benson et al., 2002; Pruitt et al., 2003). We focused on single-stranded (both negative- and positive-sense) RNA viruses. We used an approach derived from that used by Coutinho & Henrissat (1999a, 1999b) to construct the carbohydrate-active enzymes modular organization database. In VaZyMolO, modules rely on structure definition and sequence similarity, and also on sequence properties and biological data (e.g. membrane anchor, protein–protein interactions, solubility, etc). Moreover, our classification system allows us to overcome divergence due to both orthologous and paralogous origin of sequences, as long as they have significant sequence similarity. Those modules whose structures are known or that display high solubility, serve as a reference. To allow comparison between viruses, we take into account basic taxonomic information. We have also developed a module-function classification. The identification of conserved or missing modules is a valuable tool in the comparison of different virus genomes, helping us to derive information about the viral life cycle. The final information is stored in a library of modules, which can be interrogated using the VaZyMolO interface, via a BLAST P engine.

Here, we describe the construction and organization of VaZyMolO. Virions are organized into three layers: surface proteins, matrix proteins and non-structural proteins. The organization of VaZyMolO has been directly inspired by this organization and is therefore organized into three layers reflecting surface (layer S), matrix (layer M) and non-structural proteins (layer F). This first classification is a way to detect and highlight the common modularity between two proteins belonging to different layers.

For the global annotation procedure, virus sequences and basic information are collected from complete viral genome sequences deposited at the NCBI. Complete genomes are identified from the ‘Viruses.ids' files available from the NCBI ftp site (ftp://ftp.ncbi.nih.gov/genomes/IDS/Viruses.ids). Each virus file is downloaded and parsed by a semi-automatic data-processing program. Thus, each entry in the database contains multiple information, such as NCBI accession number, GenBank Identifier (GI number), taxon number, virus name, taxonomy, product name, gene name, sequence length and fasta sequence (Fig. 1b). The use of complete genomes avoids redundancy in these data. Sometimes however, files do not refer to proteins that are not coded in-frame, such as when there is a frame-shift (e.g. V protein of measles virus, NC_001498). To bridge this gap, such protein sequences are manually processed.



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 1. (a) General work scheme of the annotation process of VaZyMolO. (b) Snapshot of the annotator interface (not user) of VaZyMolO. An example of Newcastle disease virus fusion protein entry (NP_071469.1).

 
A library of full fasta sequences (full library) is then built from VaZyMolO. Analysis of related proteins is based on sequence similarities using BLASTP. A two-step clustering-bounding procedure is followed. First, each protein sequence is compared with the full library. The conservation of at least one region between sequences leads to an initial classification into subgroups. Then, the degree of similarity between the query and target sequences is analysed from the BLAST results. This procedure yields a first classification, leading to a kernel of families based on the strongest similarity.

The modular annotation process for each sequence includes an analysis procedure that is detailed below and summarized in Fig. 1(a). After cross-validating the information as described in Fig. 1(a), modules of sequences are annotated. At this stage, a modular library of fasta sequences is built into the ModulO library.

Structural data are the basis of our module definition. Whenever possible, we have included structural information on viral proteins. This information is extended to homologous protein families using an internal BLAST procedure. The retrieval of structural information is done by searching all viral proteins against the Protein Data Bank (Berman et al., 2000) using BLAST. When the result is in the twilight zone of BLAST (i.e. according to our criteria, when the E-value is >10–3), we consider the candidate as a distant protein and perform threading analysis using a combination of 3D-PSSM (Kelley et al., 2000), mGenthreader (Jones, 1999) and InBGU (Fischer, 2000) servers.

Another consideration is that the results from the different threading analyses should converge with high scores (according to the criteria of the selected program). The hit is then analysed, and it should present the same function and key motif residues as the query. We perform a secondary structure prediction on the query using Predict Protein (Rost, 1996), in order to check for structural compatibility with the hit. A protein region is defined as a module only after this cross-validation procedure has been completed.

In order to produce modules suitable for crystallization, we attempt to define precisely protein regions that may contain hydrophobic (peptide signal, hydrophobic domain and transmembrane) or natively disordered (Uversky et al., 2000) patterns. In the absence of 3D data, we perform a systematic sequence analysis, to define globular and disordered regions. Disordered regions are defined by combining the results from the analysis of the mean hydrophobicity/mean charge ratio (Uversky et al., 2000), as well as from PONDR (Iakoucheva et al., 2001) and DisEMBL (Linding et al., 2003a). We use HCA (Callebaut et al., 1997) to refine the boundaries of the modules and to identify linker regions. To define globularity we combine two approaches: HCA and GLOBplot (Linding et al., 2003b). HCA plots give patterns reflecting structural elements, for example coiled-coils. It is known that structural homology may not be reflected in terms of sequence similarity. For this reason, each module for which a 3D structure is known is analysed using HCA in order to define the corresponding HCA pattern. These patterns allow grouping of solved or predicted distantly related modules. The power of HCA in deciphering structural homology in the absence of significant sequence similarity is well illustrated in the case of the P multimerization domain (PMD) of Sendai and measles viruses (Fig. 2). The structure of Sendai virus PMD has been solved by X-ray crystallography and consists of a coiled-coil (Tarbouriech et al., 2000). The sequence similarity between the Sendai virus PMD and the corresponding region in measles virus P (aa 304–375) is 11 %, which is not high enough to be detected by PSI-BLAST. However, the measles virus PMD region exhibits a HCA profile similar to that of the Sendai virus PMD, thus designating it as a promising candidate for crystallographic studies. Indeed, this region turned out to be expressed in a soluble form in E. coli, with purification yields suitable for crystallographic studies. Finally, transmembrane regions are predicted by both TMHMM (Krogh et al., 2001) and HCA.



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 2. The multimerization domain (PMD) of Sendai virus phosphoprotein (P) is a coiled-coil structure (Tarbouriech et al., 2000). The corresponding HCA pattern is interpreted as long hydrophobic stretches (underlined region). The same kind of pattern has been found in measles virus P, despite no significant sequence similarity. In VaZyMolO, the two modules are thus considered as belonging to the same class. The measles virus PMD module was indeed expressed and purified from the soluble fraction of E. coli. Its crystallization is in progress.

 
We have developed a simple functional classification to assign proteins to broad functional classes that reflect typical viral processes. So far we have defined the following classes: structural proteins, proteases, helicases, replicases and capping enzymes.

The different homologous protein families have been manually assigned to these classes and given a short functional description. We start with the original NCBI annotations in the ‘NC_xxx’ files to assign the protein to a functional group. Whenever possible, we correlate the findings of this ‘in-house’ procedure with experimental data retrieved by literature search using ‘Entrez’ (http://www.ncbi.nlm.nih.gov/Entrez/index.html). Moreover, experts working on Paramyxoviridae within a collaborative network in which we are involved (http://virrnapoldrugtarget.univ-lyon1.fr/jdc_publicHomePage.htm) contribute to functional annotation of viral proteins. We strongly encourage virologists to provide us with functional data that will greatly help us in defining module boundaries. These data can be deposited by completing a form that is available Online (http://afmb.cnrs-mrs.fr/stgen/vazymolo.html).

All modules defined in VaZyMolO are related to taxon and virus name. This allows assessment of viral phylogenetic distribution of each module. We have used the nomenclature from the International Committee on Taxonomy of Viruses (ICTV) (http://ictvdb.mirror.ac.cn/index.htm) to name species, genera and subfamilies of each virus entry. The search by virus name is facilitated by a list of standard virus names.

The VaZyMolO modular assignment is accessible Online through a web interface (http://www.vazymolo.org). The VaZyMolO interface lists the number of complete genomes in the current release as well as taxonomic and structural information (Table 1). It contains a link to a listing of the complete genome sequences of viruses sorted by virus name and family. The protein module library as defined by VaZyMolO, can be queried by a sequence search via a BLAST server. In future development we are planning to integrate an interactive graphical interface allowing, for each entry, an easy navigation between schematic domain modularity, protein information, alignment, structure and phylogeny.


View this table:
[in this window]
[in a new window]
 
Table 1. Current viral entries in VaZyMolO

The numbers in the right column indicate the present number of viral members described in VaZyMolO.

 
VaZyMolO makes use of a novel approach to define protein modularity, thus rendering it complementary to other modular databases and not redundant. Indeed, most of the modular databases are based on extensive and mainly automated annotation procedures (Bateman et al., 2000; Corpet, 1988). Conversely, VaZyMolO annotations are based on a stringent manual checking, specially concerning the boundaries of the modules, and it benefits from virologists' knowledge. As for motif definition, the fact that VaZyMolO deals only with viral protein sequences allows us to overcome the problems of bias that can be found in other motif databases, and enables us to derive motifs reflecting the evolution of viral proteins. The VaZyMolO interface allows the fast and easy retrieval of information on the modular organization of a query sequence, which represents a critical step in view of structural studies. Indeed, this tool is the keystone in the selection of the best targets in the SPINE structural genomics project (http://www.ebi.ac.uk/msd-srv/msdtarget) in which our laboratory is engaged. Targets defined in this way are processed by an ‘in-house’ high-throughput platform for expression, purification and crystallization. Feedback of the behaviour of each tested protein allows biochemical validation of module boundaries. Since this structural genomic project began in 2002, VaZyMolO analysis has proven to be crucial for the structural and functional characterization of the 2'-O-methyltransferase domain of dengue virus NS3 (Egloff et al., 2002), the X domain of measles virus phosphoprotein (Johansson et al., 2003; Karlin et al., 2003) and Nsp9 of SARS virus (Egloff et al., 2004).

VaZyMolO is a tool devoted to the modular description and classification of both non-structural and structural viral proteins. Viral sequences are retrieved from different plant and animal virus families. Non-redundant complete genome sequences derived from NCBI are automatically clustered into homologous protein families, following a process of pre-classification, and modules are then defined. The primary basis of this classification are structural motifs detected by a variety of complementary methods. Protein families are a rich source of information for functional and evolutionary studies. Sequence alignments of conserved regions highlight important conserved amino acids, allowing the definition of new motifs within proteins. VaZyMolO is presently tailored to studies of Mononegavirales, coronaviruses and flaviviruses. It will be updated with each new GenBank release, and we are currently incorporating other animal virus families. Functional annotation should benefit from contributions and feedback from collaborating experts in the field, via an Online form. This comprehensive analysis facilitates the identification of many previously undetected module families of unknown function, thereby paving the way for their structural and functional analysis.


   ACKNOWLEDGEMENTS
 
We want to thank Pedro Couthino, Eric Blanc and Emeline Deleury for their support and contribution. We also thank Denis Gerlier for useful discussion. We are grateful to David Bhella and Alexander E. Gorbalenya for critical reading of the manuscript. This work was funded by the European Commission as ‘SPINE’ (contract no. QLG2-CT-2002-00988) and ‘VirRNApoldrugtarget’ (‘Towards the design of new potent antiviral drug: structures–function analysis of Paramyxoviridae RNApolymerase’, contract no. QLK2-CT2001-01225) under the specific RTD programme ‘Quality of Life and Management of Living Resources’. It does not necessarily reflect its views and in no way anticipates the Commission's future policy in this area.


   REFERENCES
Top
ABSTRACT
MAIN TEXT
REFERENCES
 
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402.[Abstract/Free Full Text]

Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L. & Sonnhammer, E. L. (2000). The Pfam protein families database. Nucleic Acids Res 28, 263–266.[Abstract/Free Full Text]

Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A. & Wheeler, D. L. (2002). GenBank. Nucleic Acids Res 30, 17–20.[Abstract/Free Full Text]

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Res 28, 235–242.[Abstract/Free Full Text]

Callebaut, I., Labesse, G., Durand, P., Poupon, A., Canard, L., Chomilier, J., Henrissat, B. & Mornon, J. P. (1997). Deciphering protein sequence information through hydrophobic cluster analysis (HCA): current status and perspectives. Cell Mol Life Sci 53, 621–645.[CrossRef][Medline]

Corpet, F. (1988). Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16, 10881–10890.[Abstract]

Couthino, P. & Henrissat, B. (1999a). Carbohydrate-active enzyme: an integrated approach. In Recent Advances in Carbohydrate Bioengineereing, pp. 3–12. Edited by H. Gilbert, G. Davies, B. Henrissat & B. Svensson. Cambridge: The Royal Society of Chemistry.

Couthino, P. & Henrissat, B. (1999b). The modular structure of cellulases and other carbohydrate-active enzymes: an integrated database approach. In Genetics, Biochemistry and Ecology of Cellulose Degradation, pp. 15–23. Edited by K. Ohmiya, K. Hayashi, K. Sakka, Y. Kobayashi, S. Karita & T. Kimura. Tokyo: Uni Publishers.

Egloff, M. P., Benarroch, D., Selisko, B., Romette, J. L. & Canard, B. (2002). An RNA cap (nucleoside-2'-O-)-methyltransferase in the flavivirus RNA polymerase NS5: crystal structure and functional characterization. EMBO J 21, 2757–2768.[Abstract/Free Full Text]

Egloff, M. P., Ferron, F., Campanacci, V. & 7 other authors (2004). The severe acute respiratory syndrome-coronavirus replicative protein nsp9 is a single-stranded RNA-binding subunit unique in the RNA virus world. Proc Natl Acad Sci U S A 101, 3792–3796.[Abstract/Free Full Text]

Ferron, F., Longhi, S., Henrissat, B. & Canard, B. (2002). Viral RNA-polymerases – a predicted 2'-O-ribose methyltransferase domain shared by all Mononegavirales. Trends Biochem Sci 27, 222–224.[CrossRef][Medline]

Fischer, D. (2000). Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pac Symp Biocomput, 119–130.

Iakoucheva, L. M., Kimzey, A. L., Masselon, C. D., Bruce, J. E., Garner, E. C., Brown, C. J., Dunker, A. K., Smith, R. D. & Ackerman, E. J. (2001). Identification of intrinsic order and disorder in the DNA repair protein XPA. Protein Sci 10, 560–571.[Abstract/Free Full Text]

Johansson, K., Bourhis, J. M., Campanacci, V., Cambillau, C., Canard, B. & Longhi, S. (2003). Crystal structure of the measles virus phosphoprotein domain responsible for the induced folding of the C-terminal domain of the nucleoprotein. J Biol Chem 278, 44567–44573.[Abstract/Free Full Text]

Jones, D. T. (1999). GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287, 797–815.[CrossRef][Medline]

Karlin, D., Ferron, F., Canard, B. & Longhi, S. (2003). Structural disorder and modular organization in Paramyxovirinae N and P. J Gen Virol 84, 3239–3252.[Abstract/Free Full Text]

Kelley, L. A., MacCallum, R. M. & Sternberg, M. J. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 299, 499–520.[Medline]

Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, 567–580.[CrossRef][Medline]

Linding, R., Jensen, L. J., Diella, F., Bork, P., Gibson, T. J. & Russell, R. B. (2003a). Protein disorder prediction: implications for structural proteomics. Structure (Camb) 11, 1453–1459.[Medline]

Linding, R., Russell, R. B., Neduva, V. & Gibson, T. J. (2003b). GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31, 3701–3708.[Abstract/Free Full Text]

Pruitt, K. D., Tatusova, T. & Maglott, D. R. (2003). NCBI reference sequence project: update and current status. Nucleic Acids Res 31, 34–37.[Abstract/Free Full Text]

Rost, B. (1996). PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol 266, 525–539.[CrossRef][Medline]

Tarbouriech, N., Curran, J., Ruigrok, R. W. & Burmeister, W. P. (2000). Tetrameric coiled coil domain of Sendai virus phosphoprotein. Nat Struct Biol 7, 777–781.[CrossRef][Medline]

Uversky, V. N., Gillespie, J. R. & Fink, A. L. (2000). Why are "natively unfolded" proteins unstructured under physiologic conditions? Proteins 41, 415–427.[CrossRef][Medline]

Received 7 September 2004; accepted 8 December 2004.