Annotating the Human Proteome
Sandra Orchard
,
Henning Hermjakob and
Rolf Apweiler
From EMBL-The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
 |
ABSTRACT
|
---|
The completion of the human genome has shifted the attention from deciphering the sequence to the identification and characterization of the encoded components. The identification and functional annotation of the proteome is here of special interest and starts with the identification of genes and transcripts as a prerequisite of proteome annotation. Gene predictions are very powerful in predicting most of the exons in a genome, but reliable gene structure predictions of both known and novel genes are dependent on existing transcript and protein information. An enormous amount of data already exists on the function of many human proteins, but this is scattered over many resources. Public domain databases are required to manage and collate this information and present it to the user community in both a human and machine readable manner.
In November 2004, an article was published in Nature by the International Human Genome Sequencing Centre announcing the finishing of the sequencing of the human genome (1). The published sequence covered 99% of the euchromatic genome and contained only 341 gaps. This incredible achievement has rightly been hailed as a foundation for biomedical research in the decades ahead but, in practice, is only the first step in a long and complicated path to decipher the complexity of the proteome content of the human cell.
To fully understand the workings of the human proteome, scientists must first be able to identify every protein coding region contained within the genome and the amino acid sequence of the proteins that these regions encode. In addition to this basic information, an incredible amount of metadata needs to be assembled. For example, the signals that trigger the expression of these proteins must be identified, the actual protein expression experimentally observed and catalogued. The subsequent duration of gene expression along with the factors can control its eventual repression, the stability of mRNA transcripts and the rates at which they are translated into protein products must also be known and understood. Every potential site of posttranslational modification of the protein should be identified, the conditions under which these modifications are made and their biological significance understood. The biological function of each protein molecule needs to be catalogued along with how this varies according to the cell type in which it is expressed and the temporal position within the cells life cycle. The significance of the intermolecular interactions each protein makes with other proteins, lipids, and nucleic acids also needs be understood at both a temporal and functional level and in conjunction with knowledge of the intracellular pathways and processes performed by the cell.
Not only does all this information first need to be generated, and this task is currently being tackled in laboratories all over the world, but the information then needs to be collated, annotated, and stored in a manner that makes it easily accessible to anyone with an interest in the field. Although much of this data is already available in published literature, and is being added to with every journal issue, the potential user is faced with a daunting task should they wish to search out information on a particular gene product and compare this with that pertaining to other related sequences. To assist in this task, public domain databases exist to gather this information and curate it to a similar standard allowing easy comparison across individual records, while still allowing the user to access the original underlying data.
 |
ASSESSING THE SIZE OF THE CHALLENGE
|
---|
Just how many protein coding genes are present in the human genome has been a question that has interested scientists since long before the start of large-scale sequencing efforts. In 1994, Antequera estimated the number to be 80,000 (2) based on the number of CpG islands, while approximations based on EST data varied between 35,000 and 64,000 (3, 4). Deriving an accurate count entails reconciling the large number of experimentally determined sequences stored in the nucleotide databases, which range from individually sequenced mRNAs to large-scale collections of cDNAs, to the output of ab initio gene prediction tools. This is done by Ensembl (www.ebi.ac.uk/ensembl) (5), a database that organizes biological information around the sequences of large genomes. Developed in response to the acceleration of the public effort to sequence the human genome, Ensembl employs two gene prediction programs: GeneWise, which predicts gene structure using similar protein sequences (6), and Genomewise, which provides a gene structure final parse across cDNA- and EST-defined spliced structure (6). Both algorithms provide high-specificity gene prediction at the expense of some loss of sensitivity. The number of protein coding genes predicted by such algorithms has varied with each build and rebuild of the genomic sequence. However, the current prediction by Ensembl of the number of coding sequences based on the recently announced final release of the human genome is 22,221, excluding pseudogenes (Release 26.35.1). RefSeq also provide predicted coding sequences derived automatically from the human genome (www.ncbi.nlm.nih.gov/RefSeq/) (7).
Reconciliation of these datasets is performed by the International Protein Index (IPI)1 (www.ebi.ac.uk/IPI) (8), which was first developed for the original analysis of the human genome draft. IPI merges the experimentally determined protein sequences held in the UniProt sequence database (9) with the protein predictions of Ensembl and both protein predictions and experimentally derived datasets provided by RefSeq to provide a minimally redundant yet maximally complete set of human, mouse, rat, and zebrafish proteins consisting of one sequence per transcript. All annotated splice variants are included in IPI as separate entries (unless their protein sequences are identical). IPI is produced automatically by mapping between the different datasets on the basis of protein similarity and maintains cross-references between the primary data sources.
IPI is updated monthly but maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases. When proteins disappear from source databases and a corresponding sequence cannot be identified, IPI identifiers are archived and can be traced by researchers who used the identifier in a particular dataset. Similarly if two IPI entries are merged as a result of changing data within the source databases, a secondary identifier will be maintained to allow the original entry to be traced.
Version 3.0 of the human IPI suggests there to be 47,094 unique transcripts (including splice variants) produced from the human genome, with only 1,500 of those suggested solely by predictive programs. It is to be hoped that the existence of these final 1,500 gene products can be experimentally confirmed (or disproven) over the next few years, to give a full profile of the human transcriptome, although it is to be expected that the number of splice variants will increase as methods to both predict their existence and to experimentally confirm the predictions are improved.
The Reference Sequence (RefSeq) collection also aims to provide a comprehensive, integrated, nonredundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for the human proteome.
 |
MANUAL ANNOTATION OF PROTEIN SEQUENCE AND FUNCTION
|
---|
With protein sequence information coming from a variety of sources, including the translation of transcripts from many different origins such as genome projects, cDNAs, and individual gene sequencing in addition to data generated by direct protein sequencing, there arose the need for a single, central database where these sequences could be merged into a unique entry and annotated with additional functional and structural information. UniProt (www.ebi.ac.uk/uniprot/) (9) was created to fulfil this role and was formed through the merger of the existing Swiss-Prot (10), TrEMBL (10), and PIR (11) sequence databases. UniProt is produced in a collaboration between the European Bioinformatics Institute, the Swiss Institute of Bioinformatics, and Protein Information Resource, Washington, D.C. UniProt is comprised of three components, each optimized for different uses. The UniProt Knowledgebase (UniProt) is the central access point for extensive curated protein information, including function, protein classification, and cross-references. The UniProt Nonredundant Reference (UniRef) databases combine closely related sequences into a single record to speed up searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.
The central UniProt Knowledge base consists of two core databases, UniProt/Swiss-Prot and UniProt/TrEMBL. Within Swiss-Prot, protein sequences from many sources are merged to provide a single entry, which describes all unique protein products produced by an individual gene from a particular species. Sequences are curated to correct sequencing errors and to identify both splice variants and sites of polymorphisms (12). These observations are mapped and given unique identifiers such that each original sequence may be recreated from within the entry. Potential sites of posttranslational modification are identified, and those confirmed by experimental observation are recorded as such. The protein is given both a systematic protein and gene name, and all known synonyms are recorded. Taxonomic data and citation information are checked and amended, if necessary. If further information on the protein is available, the entries contain detailed annotation on items such as the function(s) of the protein, enzyme-specific information (catalytic activity, cofactors, metabolic pathway, regulation mechanisms), biologically relevant domains and sites, molecular weight determined by mass spectrometry, subcellular location(s) of the protein, tissue-specific expression, developmentally specific expression of the protein, secondary structure, quaternary structure, similarities to other proteins, use of the protein in a biotechnological process, diseases associated with deficiencies in the protein, use of the protein as a pharmaceutical drug, etc. Extensive (and increasing) use of controlled vocabularies improves computer readability.
High-quality manual annotation is time consuming and limits the rate at which the UniProt/Swiss-Prot dataset can grow. TrEMBL (translation of EMBL nucleotide sequence database) was established in 1996 and consists of computer-annotated entries derived from the translation of all coding sequences in the nucleotide sequence databases, except for coding sequences already included in Swiss-Prot. It also contains those protein sequences extracted from the literature or submitted directly by the user community that are not directly entered in Swiss-Prot. TrEMBL has a certain degree of sequence redundancy, namely a single gene from an individual species may be represented by more than one entry. The UniProt/TrEMBL data content is enhanced by extensive automatic annotation procedures (13). The UniProt Knowledgebase contains a nonredundant set of
29,000 human sequences; however, this will include many splice variants, which will eventually be merged into a single entry within UniProt/Swiss-Prot.
One of the many strengths of the UniProt Knowledgebase is the extensive cross-referencing made to other, more-specialized databases. No one database can hold all the diverse pieces of information on a protein but UniProt cross-references to more than 60 other data sources, including model organism, protein classification, and structural and disease databases (Fig. 1). UniProt may be regarded as a central hub of knowledge, which extends out to many additional sources to expand the information summarized in the source record.
UniProt/Swiss-Prot has initiated a major project to annotate all known human sequences according to the quality standards of Swiss-Protthe Human Proteome Initiative (HPI) (14). To date, 11,638 human protein records have been fully manually annotated with an additional 4,932 splice variants being identified within these entries (Table I).
 |
PROTEIN CLASSIFICATION AND AUTOMATIC ANNOTATION OF FUNCTION
|
---|
As previously stated, the process of manual annotation is necessarily slow and can only represent data that has been experimentally verified for a given protein in a particular species. In order to transfer some or all of this information to closely related proteins within the same species or across species, there must be a means of identifying closely related families of proteins or particular functional domains or regions within less closely related sequences. A number of groups have individually developed signature and sequence cluster-based methods for protein classification. Many of these have been collated and merged into an integrated resource, InterPro (www.ebi.ac.uk/interpro/) (15). InterPro (Release 8.1) is formed from signatures provided by PROSITE (16), PRINTS (17), Pfam (18), ProDom (19), SMART (20), TIGRFAMs (21), PIRSF (22), and SUPERFAMILY (23), with InterPro protein matches calculated for all UniProt proteins and cross-references within the UniProt entries. InterPro release 8.1 contains 11,330 entries, representing 2,933 domains, 8,126 families, 222 repeats, 27 active sites, 21 binding sites, and 20 posttranslational modification sites. Structural links are generated automatically to the CATH and SCOP databases through residue-by-residue mappings with UniProt proteins and there are links to all the PDB entries for proteins that match the InterPro entry, provided they cover the signatures within that entry.
By using the tool provided, InterProScan (www.ebi.ac.uk/InterProScan/) (24), users have the ability to take a novel protein sequence and ascribe function by similarity to known protein families and to identify functional domains, active sites or binding sites within the molecule. InterPro is utilized within UniProt as the basis for automatically transferring annotation from the manually annotated Swiss-Prot entries to similar, closely related proteins sequences in the TrEMBL database. This adds valuable information to a large percentage of the 1.5 million protein sequences currently residing in the UniProt/TrEMBL database (Release 28.2).
 |
CAPTURING THE PROTEIN EXPRESSION AND INTERACTIONS
|
---|
While the human genome encodes all potentially expressed proteins, our understanding of the mechanisms governing protein expression is still too limited to reliably predict the protein content of a given cell in a given state. The systematic experimental analysis of protein expression is currently being pursued in a number of large-scale proteomics projects, e.g. the HUPO Plasma Proteome Project (25). A major challenge in the systematic capture of protein expression data is the diversity of experimental technologies and data formats in the field. The HUPO Proteomics Standards Initiative (PSI) (26, 27) develops community standards for proteomics to facilitate the capture, analysis, and distribution of proteomics data. Two data formats have now been produced by the MS group within the PSI: mzData, which allows the capture and interchange of peak list information, and mzIdent, which describes both protein identity and the corresponding peptides from which the identification was made. The PRIDE (PRoteomics IDEntification) database (www.ebi.ac.uk/pride) implements these standards and provides a public repository for protein identification data, which is extensively cross-referenced to UniProt and further external data sources (L. Martens, in preparation).
Proteins do not function in isolation, and the role of a protein may vary with the point in a cell cycle at which the molecule is expressed, the tissue in which it is present, and the availability of the other molecules with which it is capable of interacting. It is impossible to capture such a level of detail in any one database. UniProt/Swiss-Prot summarizes this information within the Comment lines but enhances this by extensive cross-referencing to other, more specialized data sources. For example, protein interaction data is captured in IntAct (www.ebi.ac.uk/intact), a freely available, open source database (28). Information within IntAct is manually curated from two sources: either extracted from existing literature by the curation team or directly submitted by laboratories prior to publication and made available to the journal reader concomitant to publication. IntAct also makes freely available a number of tools for viewing and analyzing the data, for example ProViz (29), a graph visualization system, and MiNe, an application that computes minimal connecting networks for protein sets.
The IntAct data model has three main components: Experiment, Interaction, and Interactor. An Experiment groups a number of Interactions from one publication and classifies the experimental conditions under which these Interactions have been generated. An Experiment may have only a single interaction, or hundreds of interactions in the case of large-scale experiments. An Interactor is a biological entity participating in an Interaction, usually a protein, but potentially also a DNA sequence, or a small molecule. An Interaction contains one or more Interactors participating in the Interaction. Extensive use of controlled vocabularies enables both data consistency and increases the ability of computers to easily parse and extract specific portions of the data, for example it is easy to select all interactions identified by x-ray crystallography or deselect all that were generated using yeast two-hybrid technology.
IntAct is fully compatible with the Proteomic Standards Initiative XML interchange standard and can import and export data in both PSI-MI Level 1 and 2 (30). IntAct is also a founder member of the IMEx consortium, a collaboration of interaction databases, currently also including BIND (31), DIP (32), MINT (33), and MIPS (MPACT) (34), which plan to regularly exchange curated interaction data to ensure users may eventually access an identical dataset at any one of the member databases.
Higher level information, namely the metabolic and signal transduction pathways that these molecules participate in, is collected and annotated in pathway databases such as Reactome (www.reactome.org) (35). Reactome is authored by biological researchers with expertise in their field and maintained and curated by the Reactome editorial staff. Reactome maintains links to the underlying proteins by cross-linking to specific UniProt records, with corresponding links to Reactome from the UniProt entry giving information as to which pathways or reactions each specific protein plays a role in.
 |
MAINTAINING DATA COMPATIBILITY
|
---|
Data on the human proteome is now spread over an increasing number of databases and a certain degree of compatibility must be maintained to allow all information on a particular protein to be parsed and collated. Use of a stable protein identifier, such as the UniProt accession number, or of a stable gene identifier, such as those generated by the Human Gene Nomenclature Committee (35), allows one degree of compatibility in that the protein can be unambiguously identified across all the databases. Other efforts in establishing data standardization largely center on the increasing use of controlled vocabularies and ontologies. Leaders in this field are the GO Consortium (geneontology.org) that produce terms to describe the attributes of gene products, enabling the description of their molecular function, the biological processes in which they play a role, and cellular components in which they are expressed (36). The GO Annotation project, which combines GO annotations from a number of different sources, have added 34,791 manual GO annotations to 9,387 UniProt human proteins. Again, manual annotation is slow and the process can be supplemented by automatic annotation based on InterPro pattern matches. In this manner, a further 65,855 terms have been added to 22,624 human proteins (38). GO terms are used throughout many of the databases cross-referenced by UniProt and facilitate database querying and comparability.
Other efforts in this field include the development of ontologies to more accurately describe gene expression data, for example the work of the EVOC group, which have developed orthogonal ontologies to describe anatomical system, cell type, pathology, and developmental stage (38), and also the developing Sequence Ontology aimed at describing biological sequences. Many of these controlled vocabularies are hosted at the OBO (Open Biological Ontologies) site (obo.sourceforge.net).
 |
SUMMARY
|
---|
We are still a long way from a full understanding of the human proteome, in particular of the specific role each molecule plays in the cellular context, but our knowledge is increasing daily, through both small-scale, detailed studies, and large-scale proteomics approaches. A wealth of data is being generated and made publicly accessible through an array of interlinked databases. UniProt/Swiss-Prot provides a high-quality reference set of carefully manually annotated protein sequences. It is supplemented by UniProt/TrEMBL, which contains automatically annotated protein sequences not yet in UniProt/Swiss-Prot. Together, they form the UniProt protein knowledge base, a central, high-quality, and extensively cross-referenced information hub for protein sequences. The IPI combines the UniProt human protein sequences with Ensembl and RefSeq human sequence sets into a nonredundant database of all publicly known human protein sequences. UniProt and IPI provide extensive cross-references to more than 60 external databases, among them Ensembl (genomic sequence), InterPro and GO (functional classification), PRIDE (protein identifications), IntAct (protein interactions), and Reactome (pathways), allowing to access the wealth of publicly available human proteome knowledge in a systematic, well-structured manner, providing a solid basis for new discovery and research.
 |
FOOTNOTES
|
---|
Received, January 11, 2005, and in revised form, February 2, 2005.
Published, MCP Papers in Press, February 2, 2005, DOI 10.1074/mcp.R500003-MCP200
1 The abbreviation used is: IPI, International Protein Index. 
To whom correspondence should be addressed: EMBL-The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom. Tel.: 44-(0)-1223 494-675; E-mail: orchard{at}ebi.ac.uk
 |
REFERENCES
|
---|
- Stein, L. D.
(2004) Human genome: end of the beginning.
Nature
431, 915
916[CrossRef][Medline]
- Antequera, F., and Bird, A.
(1994) Predicting the total number of human genes.
Nat. Genet.
8, 114[Medline]
- Ewing, B., and Green P.
(2000) Analysis of expressed sequence tags indicates 35,000 human genes.
Nat. Genet.
25, 232
234[CrossRef][Medline]
- Fields, C., Adams, M. D., White, O., and Venter, J. C.
(1994) How many genes in the human genome?
Nat. Genet.
7, 345
346[CrossRef][Medline]
- Birney, E., Andrews, T. D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cuff, J., Curwen, V., Cutts, T., Down, T., Eyras, E., Fernandez-Suarez, X. M., Gane, P., Gibbins, B., Gilbert, J., Hammond, M., Hotz, H. R., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E., Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta-Vidal, A., Woodwark, K. C., Cameron, G., Durbin, R., Cox, A., Hubbard, T., and Clamp, M.
(2004) An overview of Ensembl.
Genome Res.
14, 925
928[Abstract/Free Full Text]
- Birney, E., Clamp, M., and Durbin, R.
(2004) GeneWise and Genomewise.
Genome Res.
14, 988
995[Abstract/Free Full Text]
- Pruitt, K. D., Tatusova, T., and Maglott, D. R.
(2005) NCBI Reference Sequence (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins.
Nucleic Acids Res.
33, 501
504[CrossRef]
- Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., and Apweiler, R.
(2004) The International Protein Index: An integrated database for proteomics experiments.
Proteomics
4, 1985
1988[CrossRef][Medline]
- Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Natale, D. A., ODonovan, C., Redaschi, N., and Yeh, L. S.
(2005) The Universal Protein Resource (UniProt).
Nucleic Acids Res.
33, 154
159[CrossRef]
- Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M. C., Estreicher, A., Gasteiger, E., Martin, M. J., Michoud, K., ODonovan, C., Phan, I., Pilbout, S., and Schneider, M.
(2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
Nucleic Acids Res.
31, 365
370[Abstract/Free Full Text]
- Wu, C. H., Yeh, L. S., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Kourtesis, P., Ledley, R. S., Suzek, B. E., Vinayaka, C. R., Zhang, J., and Barker, W. C.
(2003) The Protein Information Resource.
Nucleic Acids Res.
31, 345
347[Abstract/Free Full Text]
- Farriol-Mathis, N., Garavelli, J. S., Boeckmann, B., Duvaud, S., Gasteiger, E., Gateau, A., Veuthey, A. L., and Bairoch, A.
(2004) Annotation of post-translational modifications in the Swiss-Prot knowledge base.
Proteomics
4, 1537
1550[CrossRef][Medline]
- Wieser, D., Kretschmann, E., and Apweiler, R.
(2004) Filtering erroneous protein annotation.
Bioinformatics
20, (Suppl. 1)I342
I347[CrossRef][Medline]
- ODonovan, C., Apweiler, R., and Bairoch, A.
(2001) The human proteomics initiative (HPI).
Trends Biotechnol.
19, 178
181[CrossRef][Medline]
- Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., Cerruti, L., Copley, R., Courcelle, E., Das, U., Durbin, R., Fleischmann, W., Gough, J., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McDowall, J., Mitchell, A., Nikolskaya, A. N., Orchard, S., Pagni, M., Ponting, C. P., Quevillon, E., Selengut, J., Sigrist, C. J., Silventoinen, V., Studholme, D. J. Vaughan, R., and Wu, C. H.
(2005) InterPro, progress and status in 2005.
Nucleic Acids Res.
33, 201
205[Abstract/Free Full Text]
- Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C. J. A., Hofmann, K., and Bairoch A.
(2002) The PROSITE database, its status in 2002.
Nucleic Acids Res.
30, 235
238[Abstract/Free Full Text]
- Attwood, T. K.
(2002) The PRINTS database: A resource for identification of protein families.
Brief Bioinform.
3, 252
263[Medline]
- Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S. R., Griffiths-Jones, S., Howe, K. L., Marshall, M., and Sonnhammer E. L.
(2002) The Pfam protein families database.
Nucleic Acids Res.
30, 276
280[Abstract/Free Full Text]
- Corpet, F., Servant, F., Gouzy, J., and Kahn D.
(2000) ProDom and ProDom-CG: Tools for protein domain analysis and whole genome comparisons.
Nucleic Acids Res.
28, 267
269[Abstract/Free Full Text]
- Ponting, C. P., Schultz, J., Milpetz, F., and Bork, P.
(1999) SMART: Identification and annotation of domains from signalling and extracellular protein sequences.
Nucleic Acids Res.
27, 229
232[Abstract/Free Full Text]
- Haft, D. H., Selengut, J. D., and White O.
(2003) The TIGRFAMs database of protein families.
Nucleic Acids Res.
31, 371
373[Abstract/Free Full Text]
- Huang, H., Xiao, C., and Wu, C. H.
(2000) ProClass protein family database.
Nucleic Acids Res.
28, 273
276[Abstract/Free Full Text]
- Andreeva, A., Howort, H. D., Brenner, S. E., Hubbard, T. J. P., Chothia, C., and Murzin, A. G.
(2004) SCOP database in 2004: Refinements integrate structure and sequence family data.
Nucleic Acids Res.
32, D226
D229[Abstract/Free Full Text]
- Zdobnov, E. M., and Apweiler, R.
(2001) InterProScanAn integration platform for the signature-recognition methods in InterPro.
Bioinformatics
17, 847
848[Abstract/Free Full Text]
- Omenn, G. S.
(2004) The Human Proteome Organization Plasma Proteome Project pilot phase: Reference specimens, technology platform comparisons, and standardized data submissions and analyses.
Proteomics
4, 1235
1240[CrossRef]
- Orchard, S., Hermjakob, H., Julian, R. K., Jr., Runte, K., Sherman, D, Wojcik, J., Zhu, W., and Apweiler, R.
(2004) Common interchange standards for proteomics data: Public availability of tools and schema.
Proteomics 4,490
491[CrossRef][Medline]
- Orchard, S., Taylor, C. F., Hermjakob, H., Weimin-Zhu, Julian, R. K., Jr., and Apweiler, R.
(2004) Advances in the development of common interchange standards for proteomic data.
Proteomics
4, 2363
2365[CrossRef][Medline]
- Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron, M., Roechert, B., Roepstorff, P., Valencia, A., Margalit, H., Armstrong, J., Bairoch, A., Cesareni, G., Sherman, D., and Apweiler, R.
(2004) IntAct: An open source molecular interaction database.
Nucleic Acids Res.
32, D452
D455[Abstract/Free Full Text]
- Iragne, F., Nikolski, M., Mathieu, B., Auber, D., and Sherman D.
(2005) ProViz: Protein interaction visualization and exploration.
Bioinformatics
21, 272
274[Abstract/Free Full Text]
- Hermjakob, H., Montecchi-Palazzi, L., Bader, G., Wojcik, J., Salwinski, L., Ceol, A., Moore, S., Orchard, S., Sarkans, U., von Meringm C., Roechertm B., Poux, S., Jung, E., Mersch, H., Kersey, P., Lappe, M., Li, Y., Zeng, R., Rana, D., Nikolski, M., Husi, H., Brun, C., Shanker, K., Grant, S.G., Sander, C., Bork, P., Zhu, W., Pandey, A., Brazma, A., Jacq, B., Vidal, M., Sherman, D., Legrain, P., Cesareni, G., Xenarios, I., Eisenberg, D., Steipe, B., Hogue, C., and Apweiler, R.
(2004) The HUPO PSIs molecular interaction formatA community standard for the representation of protein interaction data.
Nat. Biotechnol.
22, 177
183[CrossRef][Medline]
- Bader, G. D., Betel, D., and Hogue, C. W. V.
(2003) BIND: The Biomolecular Interaction Network database.
Nucleic Acids Res.
31, 248
250[Abstract/Free Full Text]
- Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Kim, S., and Eisenberg, D.
(2002) DIP: The Database of Interacting Proteins. A research tool for studying cellular networks of protein interactions.
Nucleic Acids Res.
30, 303
305[Abstract/Free Full Text]
- Zanzoni, A. Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer-Citterich, M., and Cesareni, G.
(2002) MINT: A Molecular INTeraction database.
FEBS Lett.
513, 135
140[CrossRef][Medline]
- Pagel, P., Kovac, S., Oesterheld, M., Brauner, B., Dunger-Kaltenbach, I., Frishman, G., Montrone, C., Mark, P., Stumpflen, V., Mewes, H. W., Ruepp, A., and Frishman, D.
(2004). The MIPS mammalian protein-protein interaction database.
Bioinformatics [Epub ahead of print]
- Robertson, M.
(2004) Reactome: Clear view of a starry sky.
Drug Discov. Today
9, 684
685[CrossRef][Medline]
- Wain, H. M., Lush, M. J., Ducluzeau, F., Khodiyar, V. K., and Povey, S.
(2004) Genew: The Human Gene Nomenclature Database, 2004 updates.
Nucleic Acids Res.
32, D255
D257[Abstract/Free Full Text]
- Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C., Richter, J., Rubin, G. M., Blake, J. A., Bult, C., Dolan, M., Drabkin, H., Eppig, J. T., Hill, D. P., Ni, L., Ringwald, M., Balakrishnan, R., Cherry, J. M., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S., Fisk, D. G., Hirschman, J. E., Hong, E. L., Nash, R. S., Sethuraman, A., Theesfeld, C. L., Botstein, D., Dolinski, K., Feierbach, B., Berardini, T., Mundodi, S., Rhee, S. Y., Apweiler, R., Barrell, D., Camon, E., Dimmer, E., Lee, V., Chisholm, R., Gaudet, P., Kibbe, W., Kishore, R., Schwarz, E. M., Sternberg, P., Gwinn, M., Hannick, L., Wortman, J., Berriman, M., Wood, V., de la Cruz, N., Tonellato, P., Jaiswal, P., Seigfried, T., and White, R.
(2004) Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource.
Nucleic Acids Res.
32, D258
D261[Abstract/Free Full Text]
- Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., and Apweiler, R.
(2004) The Gene Ontology Annotation (GOA) Database: Sharing knowledge in Uniprot with Gene Ontology.
Nucleic Acids Res.
32, D262
D266[Abstract/Free Full Text]
- Kelso, J., Visagie, J., Theiler, G., Christoffels, A., Bardien, S., Smedley, D., Otgaar, D., Greyling, G., Jongeneel, C. V., McCarthy, M.I., Hide, T., and Hide, W.
(2003) eVOC: A controlled vocabulary for unifying gene expression data.
Genome Res.
13, 1222
1230[Abstract/Free Full Text]