Gene homology resources on the World Wide Web

Alexander Turchin1 and Isaac S. Kohane2

1 Department of Medicine, New England Medical Center, Boston 02111
2 Children’s Hospital Medical Informatics Program, Division of Endocrinology, Children’s Hospital, Boston, Massachusetts 02215


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 SEQUENCE ALIGNMENT TOOLS
 DATABASES
 PORTALS
 CONCLUSION
 References
 
As the amount of information available to biologists increases exponentially, data analysis becomes progressively more challenging. Sequence homology has been a traditional tool in the researchers’ armamentarium; it is a very versatile instrument and can be employed to assist in numerous tasks, from establishing the function of a gene to determination of the evolutionary development of an organism. Consequently, numerous specialized tools have been established in the public domain (most commonly, the World Wide Web) to help investigators use sequence homology in their research. These homology databases differ both in techniques they use to compare sequences as well as in the size of the unit of analysis, which can be the whole gene, a domain, or a motif. In this paper, we aim to present a systematic review of the inner details of the most commonly used databases as well as to offer guidelines for their use.

ortholog; database; internet


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 SEQUENCE ALIGNMENT TOOLS
 DATABASES
 PORTALS
 CONCLUSION
 References
 
AS THE PACE OF DISCOVERY in biological sciences proceeds to accelerate exponentially, navigation of this wealth of knowledge becomes increasingly more challenging. GenBank alone contains (as of March 2002) more than 17 billion bases of DNA sequences from over 100,000 species, and it doubles in size approximately every 14 mo (34). Therefore, for us to achieve a qualitative transition from plain data gathering to a systematic understanding of the biological function, it is imperative that we find a way to organize and process the information.

A number of techniques have been developed to aid researchers in this daunting task. One of the time-tested methods is to compare the newly obtained sequence data to the ones already known and then attempt to infer function from sequence homology. As homology implies evolutionary and structural relationship (as opposed to a possibly randomly occurring similarity) between sequences, this type of analysis requires a scientifically rigorous approach. Over the years this approach has been refined to assist researchers in a number of tasks, from determination of the elementary functional blocks of a newly discovered protein to the analysis of the evolutionary development of the organism. Numerous homology resources have been devised and implemented in public domain [typically, the World Wide Web (WWW)], many of them specialized in solving a specific type of problem. In this paper we aim to offer a systematic review of the ones most commonly used; their uniform resource locators (URLs) can be found in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1. Web-based homology databases in common use

 

    SEQUENCE ALIGNMENT TOOLS
 TOP
 ABSTRACT
 INTRODUCTION
 SEQUENCE ALIGNMENT TOOLS
 DATABASES
 PORTALS
 CONCLUSION
 References
 
Sequence alignment programs were the first tools available to biologists to look for putative homology, and these are still very popular today. A query sequence (for example, that of a newly discovered gene) is submitted to the program which then searches through a database of known sequences (e.g., GenBank or SWISS-PROT) to find possible homologs.

Of the programs still in common use today, FASTA (Fast-All) (38) was the first one to be developed. It achieved significantly higher speeds than previously available algorithms by matching sequence patterns or words, called "k-tuples," as opposed to comparing individual residues in two sequences, and then building a local alignment by extending these word matches with a predefined penalty for gaps. Several implementations of FASTA exist, including TFASTA, which compares a protein sequence to a DNA sequence by translating each DNA sequence into all six possible reading frames and then comparing each frame to the protein sequence, and LFASTA, which identifies one or more regions of similarity between two sequences. FASTA searches are available from numerous WWW servers throughout the world.

Basic Local Alignment Search Tool (BLAST) (2) was developed later and added further improvements in processing speed. The original version of BLAST produced only local alignments (as opposed to global ones for FASTA) so that several different matching segments [high-scoring pairs (HSPs)] could be reported for a pair of sequences. The scores for these were combined to achieve a minimum score required for the pair of sequences to be presented as a match; if one or more of HSPs were missed, then the entire sequence pair may not be reported, resulting in somewhat lower sensitivity. A later version of the algorithm, gapped BLAST, improved sensitivity and speed by adding the requirement that two words (short sequence segments) be matched within close predefined distance from each other for the match to be extended. As the extension process was the most time-consuming, this gain in speed allowed for setting lower thresholds for word matches and thus overall improvement in sensitivity. A number of different versions of BLAST algorithm exist, allowing alignment of different types of nucleic acid and protein sequences. Just like FASTA, BLAST implementations are also widely available on the Internet [e.g., from National Center for Biotechnology Information (NCBI)].


    DATABASES
 TOP
 ABSTRACT
 INTRODUCTION
 SEQUENCE ALIGNMENT TOOLS
 DATABASES
 PORTALS
 CONCLUSION
 References
 
Sequence homology analysis has been employed by scientists to answer a number of different questions. First, when evaluated at the "gene" level, the sequence of the entire gene can be used to look for paralogs (derived from a common ancestor gene by a duplication event; Ref. 13) and orthologs (derived by a speciation event; Ref. 13). These can help elucidate the function of a newly discovered gene or establish evolutionary relationships between species being studied. A second approach is to break the gene sequence up into domains (large blocks of protein sequence evolutionarily conserved across multiple genes). This type of analysis can be used to determine the function of the gene product from its components when no obvious paralogs or orthologs for the entire gene can be found; it can also be employed to guide bioengineering techniques to alter the function of the protein being studied. When applied to study of whole genomes, it has been shown to increase fraction of proteins with tentative predictive functions from 40–60% (based on high-level sequence similarity alone) to 75–85% (14). Finally, at the highest level of sequence "resolution", homology analysis is applied to discern motifs, short (10–20 amino acid residues) stretches of protein sequence responsible for the most elementary structural and functional components of a protein (e.g., a protein kinase C phosphorylation site or an endoplasmic reticulum targeting sequence). These can help to establish the function of a new domain or to determine the regulatory pathway for the enzyme under investigation.

According to the needs of the scientists, specialized homology databases have been developed. They fall into two main categories: ortholog databases and protein subcomponent databases.

Ortholog databases (see Table 2; also see the Supplemental Table, published online at the Physiological Genomics web site)1 employ the "holistic" or "gene level" approach. An example of an ortholog database is EGO ("eukaryotic gene orthologs"), whose snapshot can be found in Fig. 1. They contain pairs of genes from different species which have been determined (either by manual analysis of available experimental evidence or computationally by DNA sequence alignment, in which case the assignment is putative) to be orthologous. These genes are typically similar in overall sequence, and their product proteins perform closely related functions. As all gene pairs are predetermined, these databases frequently cannot be searched using a newly discovered sequence but can only be queried (by gene name or database ID) for homologs of already "established" genes. They are typically also limited in the number of species they cover.


View this table:
[in this window]
[in a new window]
 
Table 2. Ortholog databases: basic features

 


View larger version (58K):
[in this window]
[in a new window]
 
Fig. 1. EGO ("eukaryotic gene orthologs") database.

 
Protein subcomponent databases contain information on commonly encountered protein motifs or domains that can be either curated completely or partially from the published literature or calculated (in which case, again, the assignments are only tentative). A screenshot of SMART ("simple modular architecture research tool"), a typical representative of subcomponent databases, can be seen in Fig. 2. Many sites combine manual "seeding" of a domain family with curated data with further computational search through sequence databases for other genes potentially containing the same motif or domain and final manual "pruning" of the resulted alignment. These sites can usually be queried using a protein sequence, which is then matched against motifs and/or domains in the database. Various matching techniques [regular expressions, fingerprints, hidden Markov models (HMMs), etc.] are employed by different sites, resulting in differences in sensitivity and specificity of the queries.



View larger version (67K):
[in this window]
[in a new window]
 
Fig. 2. SMART ("simple modular architecture research tool") database.

 
Ortholog Databases
COG.
COG ("clusters of orthologous groups"; Ref. 43) stores orthologs in unicellular organisms whose complete genomes have been sequenced. The basic premise of the COG construction technique is that any three proteins from different species whose sequences are more similar to each other than to any other proteins in these species are likely to form an ortholog group. Putative orthologs are first established by running all-against-all protein sequence comparison using the gapped BLAST program (2); triangles of mutually consistent genome-specific best hits are thus determined, and triangles with common sides are then merged into orthologous groups. These tentative groups are then proofread by an expert to weed out false-positives. The database can be searched by a protein sequence using a Cognitor program (43), which employs a similar algorithm of determining reciprocal best matches to establish whether the query sequence belongs to any of the COGs in the database. Sensitivity/specificity thresholds of this query can be set by choosing the number of BeTs ("best hits") to clades (in other words, the number of phylogenetic lineages with which the query sequence should be a reciprocal best match).

WIT.
WIT ("what is there"; Ref. 36) is a database supported by Argonne National Laboratory whose main focus in on organizing sequenced genomes into functional metabolic pathways. Among other features of the site is "ortholog cluster retrieval." The algorithm used to form the clusters is (similarly to COG) based on the concept of bi-directional best hits; however, WIT imposes some additional restrictions, which result in its selection process being less sensitive but more specific. Like COG, its data is largely limited to prokaryotes (except that it also adds Caenorhabditis elegans). Query options are limited: users can only browse clusters belonging to a subset of species and cannot query the database using a specific sequence or a gene name.

HomoloGene.
HomoloGene (22) is an NCBI database which includes orthologs between 12 animal and plant species. Some of the ortholog pairs are curated from the literature and curated databases, such as the Mouse Genome Database (MGD) and the Zebrafish Information database (ZFIN). The rest of the ortholog pairs (considered putative) are calculated by the way of finding genes that constitute a two-way reciprocal best match (i.e., gene A in species a comes up as the best match when the sequence of gene B from species b is used to search all available sequences from species a AND gene B is the best match when gene A’s sequence is used to search all available sequences from species b) using the MegaBLAST program (47); cases where a three-way reciprocal best match is found are considered more rigorous and are termed "consistent ortholog groups." mRNA sequences are used for the computation.

LocusLink.
LocusLink (38), while mostly geared toward describing curated sequence and descriptive information about genetic loci, also includes in its distribution a downloadable file "homol_seq_pairs.gz." This file contains pairs of LocusLink loci from different species whose mRNA sequences (as found in RefSeq; Ref. 38) have been found to constitute a two-way reciprocal best match by BLAST (2) alignment (29). At this time it is not accessible for queries directly from the LocusLink WWW interface; therefore, a user needs to download the file and either upload it into a relational database (e.g., Microsoft Access), which would allow complex queries, or search the actual text file for a keyword, thus forgoing the capability to discriminate between different fields in the table.

Mammalian Homology and Comparative Maps.
Unlike the previously described databases, Mammalian Homology and Comparative Maps (7) does not contain any calculated (and therefore considered putative) orthologs but is a purely curated repository of mammalian gene homologies. All ortholog pairs stored in the database are based on published evidence. The following categories of evidence have been used by the curators: amino acid sequence comparison, coincident expression, conserved map location, cross-hybridization to the same molecular probe, formation of functional heteropolymers, functional complementation, immunologic cross reaction, nucleotide sequence comparison, similar response to specific inhibitors, similar subcellular location, similar substrate specificity, and similar subunit structure. Literature references are provided for each category of evidence that was used to establish a particular orthologous relationship. All of these ortholog relationships are also available through HomoloGene; the advantage of Mammalian Homology and Comparative Maps is in its focus on curated relationships (thus increasing the specificity of the query result) as well as in very detailed documentation of the evidence.

TIGR EGO.
EGO, formerly referred to as The Institute for Genomic Research (TIGR) Orthologous Gene Alignments, or TOGA (27), is a purely computed ortholog database. It is generated by pair-wise comparison between the tentative consensus (TC) sequences that comprise the TIGR Gene Indices from individual organisms. Tentative ortholog groups (TOGs) are identified as reciprocal best matches between at least three organisms with a minimum of 75% sequence identity over at least 400 bp for any single sequence match. One important difference between EGO and other ortholog databases is that the TC sequences it uses to calculate orthologs are themselves computed from alignments of expressed sequence tags (ESTs) as well as mRNA sequences.

Protein Subcomponent Databases
Unlike ortholog databases, protein subcomponent databases’ shorter length of the sequence being analyzed allows for a large number of different techniques employed to establish homology. These fall into two large groups: pattern methods (5) and sequence alignment. A summary of the main features of the most commonly used subcomponent databases can be found in Tables 4 and 5.


View this table:
[in this window]
[in a new window]
 
Table 4. Protein subcomponent databases: basic features

 

View this table:
[in this window]
[in a new window]
 
Table 5. Protein subcomponent databases: continued

 
Pattern databases.
The simplest approach to pattern matching is a "regular expression." In this method, every position in the sequence being matched is assigned a set of possible amino acid residues which can vary from a single possible amino acid to "any." A variation of this approach, which some term a "permissive" regular expression, does not use single amino acids when specifying the nature of different positions in the sequence, but rather small (sometimes partially overlapping) subsets of amino acids grouped by common chemical properties and empirically found to frequently replace each other in known protein sequences (35) (Table 3).


View this table:
[in this window]
[in a new window]
 
Table 3. Substitution amino acid groups (Ref. 34)

 
A number of analytical approaches take advantage of the fact that many protein families are characterized by not one but several common motifs. One method, known as "frequency matrices" or "fingerprinting" (36), excises all sequences corresponding to a series of conserved motifs in question and calculates the observed frequency of each amino acid residue at every position of the motifs. This technique works especially well in large protein families with abundant sequence data; however, in smaller families sufficient sequence information to cover all possible amino acid substitutions may not be known. In this situation, a modification of the fingerprinting approach that uses position-specific mutation/substitution weight matrices [also known as position-specific scoring matrices (PSSMs)] is employed [PAM (10) and BLOSUM (20) are the two matrices most commonly used].

An alternative group of techniques analyzes the entire domain, emphasizing the importance not only of the conserved motifs but also of the variable sequence between them. One approach, termed a "profile" (18, 30), defines which residues are allowed at given positions, which positions are highly conserved and which degenerate, and which positions, or regions, can tolerate insertions or deletions (with various scoring penalties applied). Evolutionary weights and results from structural studies are also often incorporated into the scoring system. Another "holistic" domain method is HMMs (Ref. 24). This is a probabilistic model which consists of a series of sequentially connected states: match, insert, or delete (which are assigned based on a sequence alignment). At every position each of the states is assigned a certain cost, and alignment algorithms attempt to find the lowest-cost pathway through an HMM.

PROSITE. PROSITE (12) is a mixed motif/domain-based database that uses regular patterns and profile algorithms for homology searches. Its patterns and profiles are both curated from the literature as well as manually designed by the maintainers of the database. Typically, a core pattern is designed first, and its sensitivity and specificity is later refined through repeated "runs" through SWISS-PROT database.

PRINTS. PRINTS (4) is a fingerprint-oriented database based at the University of Manchester. All its fingerprints are manually designed by the curators with references supplied in the annotation. Similarly to PROSITE, the starting "seed" fingerprint is fine-tuned by iterative searches through SWISS-PROT. The database can be searched by a protein sequence for a matching fingerprint or for a keyword in a fingerprint entry annotation. The web site also contains a compendium of protein sequences that match when all fingerprints are run through OWL, SWISS-PROT, and TrEMBL databases. A tool called BLAST PRINT (45) allows researchers to run their sequences through a BLAST (1) search on all sequences in the source databases that match fingerprints in PRINTS, thus combining the speed of BLAST queries with specificity of fingerprint diagnoses.

BLOCKS. BLOCKS (21) is a multiple-motif weight matrices (referred to by the database authors as "blocks") database maintained by Fred Hutchinson Cancer Research Center (FHCRC). Unlike most other pattern databases, it is automatically computed. Protein alignments from curated pattern databases (PROSITE is used for the original BLOCKS database; BLOCKS+ adds alignments from PRINTS, ProDom, DOMO, and Pfam) are used to create the weight matrices, which are then calibrated against SWISS-PROT database. The database can be searched by DNA or protein sequence using several algorithms, including RPS-BLAST (40) and IMPALA (Schaffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, and Altschul SF, unpublished observations). It also provides several additional services, such as the BLOCK MAKER, which allows an enterprising scientist to create his own blocks from a given sequence alignment, and CODEHOP, which designs PCR primers based on the provided alignment.

eMotif. eMotif is a single motif, permissive regular expression database supported by the Department of Biochemistry of Stanford University (35, 23). It calculates the regular expressions from the protein alignments found in BLOCKS+ (i.e., BLOCKS including alignments from PROSITE, PRINTS, ProDom, DOMO, and Pfam) and PRINTS databases. These regular expressions can be searched for a match for a given sequence using eMotif-Search tool. Two other software packages are provided: the eMotif-Maker program will devise a permissive regular expression pattern based on a protein alignment provided by the researcher, and eMotif-Scan will search SWISS-PROT and GenPept databases for a regular expression supplied by the user.

eMatrix. eMatrix (46) is another database run by the Department of Biochemistry at Stanford. Like eMotif, it uses protein alignments from BLOCKS+ and PRINTS but calculates minimal risk-scoring matrices (a type of PSSM) instead of regular expressions. Similarly to eMotif, it can be searched for a match to a protein sequence; it also provides eMatrix-Maker to calculate a matrix based on an alignment and eMatrix-Scan to scan SWISS-PROT using a matrix provided by the user.

Pfam. Pfam (6) is a domain database developed by a multi-institution Pfam Consortium. It consists of two main parts: Pfam-A and Pfam-B. Pfam-A is a curated database of HMMs and sequence alignments for protein domains. At first, a seed alignment is created manually for each domain; this seed alignment is then used to generate an HMM using the HMMER package (11). Finally, the resulting HMM is used to search pfamseq, a nonredundant protein sequence database derived from SWISS-PROT and SP-TrEMBL, to develop a "full" alignment. Manually created seed alignments are annotated and proofread by human experts, while "full" alignments are not, and are therefore putative. Pfam-B is a fully computed (i.e., no annotation or manual proofreading is done) database that was developed to complement Pfam-A. ProDom (9) family alignments (which themselves are automatically computed; see below for details) serve as a source for the seed alignments (those overlapping with Pfam-A alignments are excluded to ensure nonredundancy); HMMs and full alignments are then created similarly to Pfam-A. Sequence searches against the Pfam database automatically include both Pfam-A and Pfam-B. Minor variations in available functionalities exist between Sanger Laboratory and Washington University web sites.

SMART. SMART (28) is an HMM-based domain database supported by the Bork group at European Molecular Biology Laboratory (EMBL). It specializes in non-enzymatic regulatory domains of signaling proteins as well as domains associated with DNA, RNA, chromatin, and actin cytoskeleton functions. A seed alignment is created manually and used to develop an HMM. The alignment is then enriched by traversing protein sequence databases using the developed HMM and the HMMER (11) program as well as the raw alignment and PSI-BLAST (2). The resulting alignment is manually proofread and trimmed by human experts. Additionally, transmembrane domains are identified using the TMHMM2 (26) program. To make the query more comprehensive, users can request to include Pfam domains in the search. Finally, to detect outlier homologs that may not have been captured by the HMM methodology, a WU-BLAST (15) search through the actual sequences of the alignments may be employed.

CDD. CDD ("conserved domain database"; Ref. 31) is a PSSM-based domain database run by NCBI. The protein sequence alignments it uses to create PSSMs are taken from Pfam-A (6) and SMART (28) databases, with a few more added by NCBI curators. RPS-BLAST is used to match the query sequence to the PSSMs.

Superfamily. Superfamily (16) is an HMM-based database which provides superfamily [which may be a domain or a group of domains, as defined by SCOP ("structural classification of proteins"; Ref. 33) researchers] models for all proteins of known three-dimensional structure. The HMMs are generated from alignments handcrafted by the SCOP staff using the Sequence Alignment and Modeling system (25) package.

TIGRFAMs. Similarly to Superfamily, TIGRFAMs (19) aims to establish not just sequence but functional homology; its creators introduce a term "equivalogs" to describe proteins that are similar in both structure and function. Manually developed alignments of equivalogs from complete microbial genomes are used to create HMMs (employing the HMMER package for this purpose). Although thresholds in the HMM models have been fine tuned to exclude known orthologs and paralogs with dissimilar functions, it is of course ultimately up to the user to prove with certainty (experimentally, if necessary) that an unknown sequence matching one of the models does indeed share functional as well as structural homology.

Alignment databases.
An alternative approach to the protein subcomponent analysis is an automated sequence alignment, similar to the methods utilized by the ortholog databases. Unlike the pattern databases, which are for the most part manually designed and fine-tuned, alignment databases do not provide the advantage of expert validation and detailed annotation of the protein groups they describe, including references to the published evidence, etc. As a result, this technique has lower specificity; however, it gains in sensitivity and comprehensiveness and offers a great advantage in speed.

ProDom. ProDom (9) employs a two-pronged approach to domain modeling. First, a set of expert-validated sequence alignments (some designed by ProDom staff, and others acquired from Pfam-A) is used to generate PSSMs describing domains. In a second step, the software package MKDOM (17), based on PSI-BLAST (2), is used to find domains not described by the manual alignments used in the first step by analyzing sequences in SWISS-PROT and TrEMBL. A subset of the domains found by these two methods from sequences that belong to completely sequenced genomes is included in a separate database, ProDom-CG.

SBASE. SBASE (44) is a fully curated alignment-based database. It uses domain definitions by human experts at other databases [ProtFam (Ref. 32), Pfam, InterPro member databases (see below)] as well as those compiled from the original literature. It consists of two main subsets: SBASE-A and SBASE-B. Although SBASE-A contains well-known structural and functional domain types, SBASE-B comprises groups that are either less well characterized than those in SBASE-A, or are defined by composition (e.g., glycine rich) or cellular location (e.g., transmembrane). As a result, some of the SBASE-B and SBASE-A domains partially overlap (for example, an "extracellular domain" from SBASE-B may contain an "EGF module" from SBASE-A). Sequence queries on either of the subsets are run using different flavors of the BLAST algorithm.

integrative databases. As illustrated above, over the years a great number of different approaches to description and assignment of protein subcomponents have been developed. Each of them has its benefits and disadvantages, and limiting the search to only one or two of them could adversely affect either sensitivity or specificity (or both) of the result. At the same time, manually rerunning the query on each of the databases is laborious and increasingly unmanageable as the amount of data that needs to be processed grows. It has thus become imperative to develop methods that would integrate different techniques in one place; the two best-known sites that do just that are InterPro and EASY ("expert analysis system").

InterPro. InterPro (3) brings together the data and analytical methods of the majority of the commonly used protein subcomponent databases: PROSITE, PRINTS, Pfam, ProDom, SMART, and TIGRFAMs. All entries from the member databases are manually reviewed and compared against each other to create composite InterPro entries, which contain all information found in each of the corresponding records of the member databases. Fully overlapping entities from different databases are merged into a single InterPro entry, whereas subtype entities (e.g., a PROSITE motif that is part of a Pfam domain) are given their own InterPro entries. Each InterPro entry is annotated with a functional description, literature references, and links back to the individual databases. Sequence searches are performed using all of the techniques employed by the member databases simultaneously.

EASY. Unlike InterPro, which combines different databases manually, EASY (42) aims to develop a computational integration approach (see Fig. 3 for interface screenshot). Its current implementation uses six databases in its searches: SWISS-PROT, PROSITE, PRINTS, Pfam, Profiles (8), and BLOCKS. Some of the databases have a local copy, and others are searched remotely (the software itself can be configured to search any of the databases either locally or remotely). The results of a sequence query from all of the databases are parsed for most commonly encountered words and phrases (frequently encountered but uninformative words like "protein" are excluded). The system has been designed to appreciate the semantic difference between protein sequence and subcomponent pattern databases; it uses this knowledge and the parsing results to hypothesize the identity of the protein encoded by the sequence and the family to which it belongs. The software has been empirically validated by the authors as well as numerous users, and its analytical "skills" continue to be refined.



View larger version (54K):
[in this window]
[in a new window]
 
Fig. 3. EASY ("expert analysis system").

 

    PORTALS
 TOP
 ABSTRACT
 INTRODUCTION
 SEQUENCE ALIGNMENT TOOLS
 DATABASES
 PORTALS
 CONCLUSION
 References
 
This review would be incomplete without mentioning portals, i.e., web sites which house and/or link to many of the homology databases described above and thus serve as an important liaison between investigators and their tools. The NCBI Entrez at http://www.ncbi.nlm.nih.gov/Entrez/ is the largest one. NCBI develops and hosts, among others, GenBank, CDD, COG, HomoloGene, and LocusLink. It plays a preeminent role in the world of biotechnology and is the first stop on the Internet for many biologists.

A number of other important web portals exist. EnsEMBL houses DNA and protein sequences from several genome projects (human, mouse, zebrafish, etc.) with automatic baseline annotation, as well as providing links to other genomic resources. Expasy Molecular Biology Server (http://kr.expasy.org/) develops protein knowledge bases SWISS-PROT and TrEMBL and provides an entry point to a number of other bioinformatics web sites. In summary, web portals provide an invaluable service of bringing together homology resources from around the world and are recommended as a starting point on the WWW to any aspiring researcher.


    CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 SEQUENCE ALIGNMENT TOOLS
 DATABASES
 PORTALS
 CONCLUSION
 References
 
As we have reviewed here, a wide array of tools and resources are now available to a researcher investigating sequence homology. They differ in both the size of the sequence units analyzed as well as in relative sensitivity and specificity of the scoring techniques employed, and it would be wrong to emphasize one or two over the others for any task. Instead, an ideal approach would be to select the resource based on the specific question at hand (for example, someone trying to find out whether a newly discovered protein has GTP-binding sites would be best advised to use a motif-based package, such as PROSITE, whereas another scientist looking to identify the closest related protein in humans would benefit more from a whole gene method, as in HomoloGene). Finally, for those situations when this choice is difficult to make or more than one type of homology analysis is being sought at the same time, integrated databases, such as InterPro and EASY come to help. An example (by no means exclusionary) of an algorithm that could be used to select an appropriate database is given in Fig. 4. Consequently, as we learn more and more about human and other genomes, these tools frequently come to our aid to help cross the t’s and dot the i’s.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 4. Database selection algorithm.

 


    ACKNOWLEDGMENTS
 
We thank Dr. Alan Beggs for help with the discussion of homology databases’ use by biologists.


    FOOTNOTES
 
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: A. Turchin, Division of Endocrinology, Diabetes and Hypertension, Brigham and Women’s Hospital, 221 Longwood Ave., Boston, MA 02115 (E-mail: aturchin{at}iname.com).

10.1152/physiolgenomics.00112.2002.

The Supplemental Table to article is available online at http://physiolgenomics.physiology.org/cgi/content/full/11/3/165/DC1.

1 The Supplemental Table to this article is available online at http://physiolgenomics.physiology.org/cgi/content/full/11/3/165/DC1. Back


    References
 TOP
 ABSTRACT
 INTRODUCTION
 SEQUENCE ALIGNMENT TOOLS
 DATABASES
 PORTALS
 CONCLUSION
 References
 

  1. Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. Basic Local Alignment Search Tool. J Mol Biol 215: 403–410, 1990.[ISI][Medline]
  2. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402, 1997.[Abstract/Free Full Text]
  3. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F, Sigrist CJ, and Zdobnov EM (The InterPro Consortium). The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 29: 37–40, 2001.[Abstract/Free Full Text]
  4. Attwood TK, Blythe MJ, Flower DR, Gaulton A, Mabey JE, Maudling N, McGregor L, Mitchell AL, Moulton G, Paine K, and Scordis P. PRINTS and PRINTS-S shed light on protein ancestry. Nucleic Acids Res 30: 239–241, 2002.[Abstract/Free Full Text]
  5. Attwood TK. The quest to deduce protein function from sequence: the role of pattern databases. Int J Biochem Cell Biol 32: 39–155, 2000.
  6. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, and Sonnhammer ELL. The Pfam Protein Families Database. Nucleic Acids Res 30: 276–280, 2002.[Abstract/Free Full Text]
  7. Blake JA, Eppig JT, Richardson JE, Bult CJ, Kadin JA, and the Mouse Genome Database Group. The Mouse Genome Database (MGD): integration nexus for the laboratory mouse. Nucleic Acids Res 29: 91–94, 2001.[Abstract/Free Full Text]
  8. Bucher P, Karplus K, Moeri N, and Hofmann K. A flexible motif search technique based on generalized profiles. Comput Chem 20: 3–23, 1996.[ISI][Medline]
  9. Corpet F, Servant F, Gouzy J, and Kahn D. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res 28: 267–269, 2000.[Abstract/Free Full Text]
  10. Dayhoff MO, Schwartz RM, and Orcutt BC. A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, edited by Dayhoff MO. Washington, DC: National Biomedical Research Foundation, 1978, vol. 5, suppl. 3, p. 345–352.
  11. Eddy S. HMMER 2.2: profile hidden Markov models for biological sequence analysis [Online]. Department of Genetics, Washington University School of Medicine. http://hmmer.wustl.edu/ [revised August 5, 2001].
  12. Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, and Bairoch A. The PROSITE database, its status in 2002. Nucleic Acids Res 30: 235–238, 2002.[Abstract/Free Full Text]
  13. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool 19: 99–113, 1970.[ISI][Medline]
  14. Galperin MY and Koonin YV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption. In Silico Biol 1: 55–67, 1998.[Medline]
  15. Gish W. WU-BLAST [Online]. Washington University School of Medicine. http://blast.wustl.edu/ [revised August 14, 2002].
  16. Gough J and Chothia C. SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res 30: 268–272, 2002.[Abstract/Free Full Text]
  17. Gouzy J, Corpet F, and Kahn D. Whole genome protein domain analysis using a new method for domain clustering. Comput Chem 23: 333–340, 1999.[ISI][Medline]
  18. Gribskov M, McLachlan AD, and Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci USA 84: 4355–4358, 1987.[Abstract]
  19. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, and White O. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res 29: 41–43, 2001.[Abstract/Free Full Text]
  20. Henikoff JG and Henikoff S. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89: 10915–10919, 1992.[Abstract]
  21. Henikoff JG, Greene EA, Pietrokovski S, and Henikoff S. Increased coverage of protein families with the blocks database servers. Nucleic Acids Res 28: 228–230, 2000.[Abstract/Free Full Text]
  22. HomoloGene. National Center for Biotechnology Information [Online]. http://www.ncbi.nlm.nih.gov/HomoloGene/ [revised July 19, 2002].
  23. Huang JY and Brutlag DL. The EMOTIF database. Nucleic Acids Res 29: 202–204, 2001.[Abstract/Free Full Text]
  24. Hughey R and Krogh A. Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 12: 95–107, 1996.[Abstract]
  25. Karchin R and Hughey A. Weighting hidden Markov models for maximum discrimination. Bioinformatics 14: 772–782, 1998.[Abstract]
  26. Krogh A, Larsson B, von Heijne G, and Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305: 567–580, 2001.[ISI][Medline]
  27. Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F, Antonescu V, White J, Holt I, Liang F, and Quackenbush J. Cross referencing eukaryotic genomes: TIGR orthologous gene alignment (TOGA). Genome Res 12: 493–502, 2002.[Abstract/Free Full Text]
  28. Letunic I, Goodstadt L, Dickens NJ, Doerks T, Schultz J, Mott R, Ciccarelli F, Copley RR, Ponting CP, and Bork P. Recent Improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res 30: 242–244, 2002.[Abstract/Free Full Text]
  29. LocusLink. National Center for Biotechnology Information [Online]. http://www.ncbi.nlm.nih.gov/LocusLink/ [revised April 12, 2002].
  30. Luthy R, Xenarios U, and Bucher P. Improving the sensitivity of the sequence profile method. Protein Sci 3: 139–146, 1994.[Abstract/Free Full Text]
  31. Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, and Bryant SH. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30: 281–283, 2002.[Abstract/Free Full Text]
  32. Mewes HW, Frishman D, Gruber C, Geier B, Haase D, Kaps A, Lemcke K, Mannhaupt G, Pfeiffer F, Schuller C, Stocker S, and Weil B. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 28: 37–40, 2000.[Abstract/Free Full Text]
  33. Murzin AG, Brenner SE, Hubbard T, and Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536–540, 1995.[ISI][Medline]
  34. NCBI Databases. National Center for Biotechnology Information [Online]. http://www.ncbi.nlm.nih.gov/Database/index.html [revised March 22, 2002].
  35. Nevill-Manning CG, Wu TD, and Brutlag DL. Highly specific protein sequence motifs for genome analysis. Proc Natl Acad Sci USA 95: 5865–5871, 1998.[Abstract/Free Full Text]
  36. Overbeek R, Larsen N, Pusch GD, D’Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, and Selkov E. WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 28: 123–125, 2000.[Abstract/Free Full Text]
  37. Parry-Smith DJ and Attwood TK. ADSP: a new package for computational sequence analysis. Comput Appl Biosci 8: 451–459, 1992.[Abstract]
  38. Pearson WR and Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85: 2444–2448, 1988.[Abstract]
  39. Pruitt KD and Maglott DR. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 29: 137–140, 2001.[Abstract/Free Full Text]
  40. RPS-BLAST. The Blocks Server [Online]. Fred Hutchinson Cancer Research Center. http://blocks.fhcrc.org/blocks/help/about_rpsblast.html [revised Sept. 6, 2002].
  41. Sander C and Schneider R. Database of homology derived protein structures and the structural meaning of sequence alignment. Proteins 9: 56–68, 1991.[ISI][Medline]
  42. Selley JN and Attwood TK. EASY: an Expert Analysis SYstem for interpreting database search outputs. Bioinformatics 17: 105–106, 2001.[Abstract]
  43. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, and Koonin EV. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 29: 22–28, 2001.[Abstract/Free Full Text]
  44. Vlahovicek K, Murvai J, Barta E, and Pongor S. The SBASE protein domain library, release 9.0: an online resource for protein domain identification. Nucleic Acids Res 30: 273–275, 2002.[Abstract/Free Full Text]
  45. Wright W, Scordis P, and Attwood TK. BLAST PRINTS: alternative perspectives on sequence similarity. Bioinformatics 15: 523–524, 1999.[Abstract/Free Full Text]
  46. Wu TD, Nevill-Manning CG, and Brutlag DL. Minimal-risk scoring matrices for sequence analysis. J Comput Biol 6: 219–235, 1999.[ISI][Medline]
  47. Zhang Z, Schwartz S, Wagner L, and Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol 7: 203–214, 2000.[ISI][Medline]