Architecture et Fonction des Macromolécules Biologiques, UMR 6098, CNRS and Universités Aix-Marseille I and II, ESIL, 163 Avenue de Luminy, Case 925, F-13288 Marseille Cedex 9, France
Correspondence
Bruno Canard
bruno.canard{at}afmb.cnrs-mrs.fr
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() |
---|
These authors have equally contributed to this work.
![]() |
MAIN TEXT |
---|
![]() ![]() ![]() ![]() |
---|
VaZyMolO enables the handling of viral sequences at the protein level in order to define their modularity. Sequence analysis is made possible by implementation of softwares such as BLASTP (Altschul et al., 1997), multalin (Corpet, 1988
) and hydrophobic cluster analysis (HCA) (Callebaut et al., 1997
). The two main pillars of VaZyMolO are the protein sequence motif and the protein domain definition. We define a protein sequence motif as a set of conserved amino acids located within a short distance from one another that are both important for function and structure. A protein domain is a structurally compact, autonomously folding unit that forms a stable structure and shows a certain level of evolutionary conservation. In VaZyMolO, a module is defined as a structural and/or functional unit, which may contain one or several protein domains. VaZyMolO organizes information about modularity on viral open reading frames from complete genome sequences derived from GenBank and RefSeq (Benson et al., 2002
; Pruitt et al., 2003
). We focused on single-stranded (both negative- and positive-sense) RNA viruses. We used an approach derived from that used by Coutinho & Henrissat (1999a
, 1999b)
to construct the carbohydrate-active enzymes modular organization database. In VaZyMolO, modules rely on structure definition and sequence similarity, and also on sequence properties and biological data (e.g. membrane anchor, proteinprotein interactions, solubility, etc). Moreover, our classification system allows us to overcome divergence due to both orthologous and paralogous origin of sequences, as long as they have significant sequence similarity. Those modules whose structures are known or that display high solubility, serve as a reference. To allow comparison between viruses, we take into account basic taxonomic information. We have also developed a module-function classification. The identification of conserved or missing modules is a valuable tool in the comparison of different virus genomes, helping us to derive information about the viral life cycle. The final information is stored in a library of modules, which can be interrogated using the VaZyMolO interface, via a BLAST P engine.
Here, we describe the construction and organization of VaZyMolO. Virions are organized into three layers: surface proteins, matrix proteins and non-structural proteins. The organization of VaZyMolO has been directly inspired by this organization and is therefore organized into three layers reflecting surface (layer S), matrix (layer M) and non-structural proteins (layer F). This first classification is a way to detect and highlight the common modularity between two proteins belonging to different layers.
For the global annotation procedure, virus sequences and basic information are collected from complete viral genome sequences deposited at the NCBI. Complete genomes are identified from the Viruses.ids' files available from the NCBI ftp site (ftp://ftp.ncbi.nih.gov/genomes/IDS/Viruses.ids). Each virus file is downloaded and parsed by a semi-automatic data-processing program. Thus, each entry in the database contains multiple information, such as NCBI accession number, GenBank Identifier (GI number), taxon number, virus name, taxonomy, product name, gene name, sequence length and fasta sequence (Fig. 1b). The use of complete genomes avoids redundancy in these data. Sometimes however, files do not refer to proteins that are not coded in-frame, such as when there is a frame-shift (e.g. V protein of measles virus, NC_001498). To bridge this gap, such protein sequences are manually processed.
|
The modular annotation process for each sequence includes an analysis procedure that is detailed below and summarized in Fig. 1(a). After cross-validating the information as described in Fig. 1(a)
, modules of sequences are annotated. At this stage, a modular library of fasta sequences is built into the ModulO library.
Structural data are the basis of our module definition. Whenever possible, we have included structural information on viral proteins. This information is extended to homologous protein families using an internal BLAST procedure. The retrieval of structural information is done by searching all viral proteins against the Protein Data Bank (Berman et al., 2000) using BLAST. When the result is in the twilight zone of BLAST (i.e. according to our criteria, when the E-value is >103), we consider the candidate as a distant protein and perform threading analysis using a combination of 3D-PSSM (Kelley et al., 2000
), mGenthreader (Jones, 1999
) and InBGU (Fischer, 2000
) servers.
Another consideration is that the results from the different threading analyses should converge with high scores (according to the criteria of the selected program). The hit is then analysed, and it should present the same function and key motif residues as the query. We perform a secondary structure prediction on the query using Predict Protein (Rost, 1996), in order to check for structural compatibility with the hit. A protein region is defined as a module only after this cross-validation procedure has been completed.
In order to produce modules suitable for crystallization, we attempt to define precisely protein regions that may contain hydrophobic (peptide signal, hydrophobic domain and transmembrane) or natively disordered (Uversky et al., 2000) patterns. In the absence of 3D data, we perform a systematic sequence analysis, to define globular and disordered regions. Disordered regions are defined by combining the results from the analysis of the mean hydrophobicity/mean charge ratio (Uversky et al., 2000
), as well as from PONDR (Iakoucheva et al., 2001
) and DisEMBL (Linding et al., 2003a
). We use HCA (Callebaut et al., 1997
) to refine the boundaries of the modules and to identify linker regions. To define globularity we combine two approaches: HCA and GLOBplot (Linding et al., 2003b
). HCA plots give patterns reflecting structural elements, for example coiled-coils. It is known that structural homology may not be reflected in terms of sequence similarity. For this reason, each module for which a 3D structure is known is analysed using HCA in order to define the corresponding HCA pattern. These patterns allow grouping of solved or predicted distantly related modules. The power of HCA in deciphering structural homology in the absence of significant sequence similarity is well illustrated in the case of the P multimerization domain (PMD) of Sendai and measles viruses (Fig. 2
). The structure of Sendai virus PMD has been solved by X-ray crystallography and consists of a coiled-coil (Tarbouriech et al., 2000
). The sequence similarity between the Sendai virus PMD and the corresponding region in measles virus P (aa 304375) is 11 %, which is not high enough to be detected by PSI-BLAST. However, the measles virus PMD region exhibits a HCA profile similar to that of the Sendai virus PMD, thus designating it as a promising candidate for crystallographic studies. Indeed, this region turned out to be expressed in a soluble form in E. coli, with purification yields suitable for crystallographic studies. Finally, transmembrane regions are predicted by both TMHMM (Krogh et al., 2001
) and HCA.
|
The different homologous protein families have been manually assigned to these classes and given a short functional description. We start with the original NCBI annotations in the NC_xxx files to assign the protein to a functional group. Whenever possible, we correlate the findings of this in-house procedure with experimental data retrieved by literature search using Entrez (http://www.ncbi.nlm.nih.gov/Entrez/index.html). Moreover, experts working on Paramyxoviridae within a collaborative network in which we are involved (http://virrnapoldrugtarget.univ-lyon1.fr/jdc_publicHomePage.htm) contribute to functional annotation of viral proteins. We strongly encourage virologists to provide us with functional data that will greatly help us in defining module boundaries. These data can be deposited by completing a form that is available Online (http://afmb.cnrs-mrs.fr/stgen/vazymolo.html).
All modules defined in VaZyMolO are related to taxon and virus name. This allows assessment of viral phylogenetic distribution of each module. We have used the nomenclature from the International Committee on Taxonomy of Viruses (ICTV) (http://ictvdb.mirror.ac.cn/index.htm) to name species, genera and subfamilies of each virus entry. The search by virus name is facilitated by a list of standard virus names.
The VaZyMolO modular assignment is accessible Online through a web interface (http://www.vazymolo.org). The VaZyMolO interface lists the number of complete genomes in the current release as well as taxonomic and structural information (Table 1). It contains a link to a listing of the complete genome sequences of viruses sorted by virus name and family. The protein module library as defined by VaZyMolO, can be queried by a sequence search via a BLAST server. In future development we are planning to integrate an interactive graphical interface allowing, for each entry, an easy navigation between schematic domain modularity, protein information, alignment, structure and phylogeny.
|
VaZyMolO is a tool devoted to the modular description and classification of both non-structural and structural viral proteins. Viral sequences are retrieved from different plant and animal virus families. Non-redundant complete genome sequences derived from NCBI are automatically clustered into homologous protein families, following a process of pre-classification, and modules are then defined. The primary basis of this classification are structural motifs detected by a variety of complementary methods. Protein families are a rich source of information for functional and evolutionary studies. Sequence alignments of conserved regions highlight important conserved amino acids, allowing the definition of new motifs within proteins. VaZyMolO is presently tailored to studies of Mononegavirales, coronaviruses and flaviviruses. It will be updated with each new GenBank release, and we are currently incorporating other animal virus families. Functional annotation should benefit from contributions and feedback from collaborating experts in the field, via an Online form. This comprehensive analysis facilitates the identification of many previously undetected module families of unknown function, thereby paving the way for their structural and functional analysis.
![]() |
ACKNOWLEDGEMENTS |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() |
---|
Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L. & Sonnhammer, E. L. (2000). The Pfam protein families database. Nucleic Acids Res 28, 263266.
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A. & Wheeler, D. L. (2002). GenBank. Nucleic Acids Res 30, 1720.
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Res 28, 235242.
Callebaut, I., Labesse, G., Durand, P., Poupon, A., Canard, L., Chomilier, J., Henrissat, B. & Mornon, J. P. (1997). Deciphering protein sequence information through hydrophobic cluster analysis (HCA): current status and perspectives. Cell Mol Life Sci 53, 621645.[CrossRef][Medline]
Corpet, F. (1988). Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16, 1088110890.[Abstract]
Couthino, P. & Henrissat, B. (1999a). Carbohydrate-active enzyme: an integrated approach. In Recent Advances in Carbohydrate Bioengineereing, pp. 312. Edited by H. Gilbert, G. Davies, B. Henrissat & B. Svensson. Cambridge: The Royal Society of Chemistry.
Couthino, P. & Henrissat, B. (1999b). The modular structure of cellulases and other carbohydrate-active enzymes: an integrated database approach. In Genetics, Biochemistry and Ecology of Cellulose Degradation, pp. 1523. Edited by K. Ohmiya, K. Hayashi, K. Sakka, Y. Kobayashi, S. Karita & T. Kimura. Tokyo: Uni Publishers.
Egloff, M. P., Benarroch, D., Selisko, B., Romette, J. L. & Canard, B. (2002). An RNA cap (nucleoside-2'-O-)-methyltransferase in the flavivirus RNA polymerase NS5: crystal structure and functional characterization. EMBO J 21, 27572768.
Egloff, M. P., Ferron, F., Campanacci, V. & 7 other authors (2004). The severe acute respiratory syndrome-coronavirus replicative protein nsp9 is a single-stranded RNA-binding subunit unique in the RNA virus world. Proc Natl Acad Sci U S A 101, 37923796.
Ferron, F., Longhi, S., Henrissat, B. & Canard, B. (2002). Viral RNA-polymerases a predicted 2'-O-ribose methyltransferase domain shared by all Mononegavirales. Trends Biochem Sci 27, 222224.[CrossRef][Medline]
Fischer, D. (2000). Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pac Symp Biocomput, 119130.
Iakoucheva, L. M., Kimzey, A. L., Masselon, C. D., Bruce, J. E., Garner, E. C., Brown, C. J., Dunker, A. K., Smith, R. D. & Ackerman, E. J. (2001). Identification of intrinsic order and disorder in the DNA repair protein XPA. Protein Sci 10, 560571.
Johansson, K., Bourhis, J. M., Campanacci, V., Cambillau, C., Canard, B. & Longhi, S. (2003). Crystal structure of the measles virus phosphoprotein domain responsible for the induced folding of the C-terminal domain of the nucleoprotein. J Biol Chem 278, 4456744573.
Jones, D. T. (1999). GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287, 797815.[CrossRef][Medline]
Karlin, D., Ferron, F., Canard, B. & Longhi, S. (2003). Structural disorder and modular organization in Paramyxovirinae N and P. J Gen Virol 84, 32393252.
Kelley, L. A., MacCallum, R. M. & Sternberg, M. J. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 299, 499520.[Medline]
Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, 567580.[CrossRef][Medline]
Linding, R., Jensen, L. J., Diella, F., Bork, P., Gibson, T. J. & Russell, R. B. (2003a). Protein disorder prediction: implications for structural proteomics. Structure (Camb) 11, 14531459.[Medline]
Linding, R., Russell, R. B., Neduva, V. & Gibson, T. J. (2003b). GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31, 37013708.
Pruitt, K. D., Tatusova, T. & Maglott, D. R. (2003). NCBI reference sequence project: update and current status. Nucleic Acids Res 31, 3437.
Rost, B. (1996). PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol 266, 525539.[CrossRef][Medline]
Tarbouriech, N., Curran, J., Ruigrok, R. W. & Burmeister, W. P. (2000). Tetrameric coiled coil domain of Sendai virus phosphoprotein. Nat Struct Biol 7, 777781.[CrossRef][Medline]
Uversky, V. N., Gillespie, J. R. & Fink, A. L. (2000). Why are "natively unfolded" proteins unstructured under physiologic conditions? Proteins 41, 415427.[CrossRef][Medline]
Received 7 September 2004;
accepted 8 December 2004.