1 Department of Medicine, New England Medical Center, Boston 02111
2 Childrens Hospital Medical Informatics Program, Division of Endocrinology, Childrens Hospital, Boston, Massachusetts 02215
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
ortholog; database; internet
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A number of techniques have been developed to aid researchers in this daunting task. One of the time-tested methods is to compare the newly obtained sequence data to the ones already known and then attempt to infer function from sequence homology. As homology implies evolutionary and structural relationship (as opposed to a possibly randomly occurring similarity) between sequences, this type of analysis requires a scientifically rigorous approach. Over the years this approach has been refined to assist researchers in a number of tasks, from determination of the elementary functional blocks of a newly discovered protein to the analysis of the evolutionary development of the organism. Numerous homology resources have been devised and implemented in public domain [typically, the World Wide Web (WWW)], many of them specialized in solving a specific type of problem. In this paper we aim to offer a systematic review of the ones most commonly used; their uniform resource locators (URLs) can be found in Table 1.
|
![]() |
SEQUENCE ALIGNMENT TOOLS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Of the programs still in common use today, FASTA (Fast-All) (38) was the first one to be developed. It achieved significantly higher speeds than previously available algorithms by matching sequence patterns or words, called "k-tuples," as opposed to comparing individual residues in two sequences, and then building a local alignment by extending these word matches with a predefined penalty for gaps. Several implementations of FASTA exist, including TFASTA, which compares a protein sequence to a DNA sequence by translating each DNA sequence into all six possible reading frames and then comparing each frame to the protein sequence, and LFASTA, which identifies one or more regions of similarity between two sequences. FASTA searches are available from numerous WWW servers throughout the world.
Basic Local Alignment Search Tool (BLAST) (2) was developed later and added further improvements in processing speed. The original version of BLAST produced only local alignments (as opposed to global ones for FASTA) so that several different matching segments [high-scoring pairs (HSPs)] could be reported for a pair of sequences. The scores for these were combined to achieve a minimum score required for the pair of sequences to be presented as a match; if one or more of HSPs were missed, then the entire sequence pair may not be reported, resulting in somewhat lower sensitivity. A later version of the algorithm, gapped BLAST, improved sensitivity and speed by adding the requirement that two words (short sequence segments) be matched within close predefined distance from each other for the match to be extended. As the extension process was the most time-consuming, this gain in speed allowed for setting lower thresholds for word matches and thus overall improvement in sensitivity. A number of different versions of BLAST algorithm exist, allowing alignment of different types of nucleic acid and protein sequences. Just like FASTA, BLAST implementations are also widely available on the Internet [e.g., from National Center for Biotechnology Information (NCBI)].
![]() |
DATABASES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
According to the needs of the scientists, specialized homology databases have been developed. They fall into two main categories: ortholog databases and protein subcomponent databases.
Ortholog databases (see Table 2; also see the Supplemental Table, published online at the Physiological Genomics web site)1 employ the "holistic" or "gene level" approach. An example of an ortholog database is EGO ("eukaryotic gene orthologs"), whose snapshot can be found in Fig. 1. They contain pairs of genes from different species which have been determined (either by manual analysis of available experimental evidence or computationally by DNA sequence alignment, in which case the assignment is putative) to be orthologous. These genes are typically similar in overall sequence, and their product proteins perform closely related functions. As all gene pairs are predetermined, these databases frequently cannot be searched using a newly discovered sequence but can only be queried (by gene name or database ID) for homologs of already "established" genes. They are typically also limited in the number of species they cover.
|
|
|
WIT.
WIT ("what is there"; Ref. 36) is a database supported by Argonne National Laboratory whose main focus in on organizing sequenced genomes into functional metabolic pathways. Among other features of the site is "ortholog cluster retrieval." The algorithm used to form the clusters is (similarly to COG) based on the concept of bi-directional best hits; however, WIT imposes some additional restrictions, which result in its selection process being less sensitive but more specific. Like COG, its data is largely limited to prokaryotes (except that it also adds Caenorhabditis elegans). Query options are limited: users can only browse clusters belonging to a subset of species and cannot query the database using a specific sequence or a gene name.
HomoloGene.
HomoloGene (22) is an NCBI database which includes orthologs between 12 animal and plant species. Some of the ortholog pairs are curated from the literature and curated databases, such as the Mouse Genome Database (MGD) and the Zebrafish Information database (ZFIN). The rest of the ortholog pairs (considered putative) are calculated by the way of finding genes that constitute a two-way reciprocal best match (i.e., gene A in species a comes up as the best match when the sequence of gene B from species b is used to search all available sequences from species a AND gene B is the best match when gene As sequence is used to search all available sequences from species b) using the MegaBLAST program (47); cases where a three-way reciprocal best match is found are considered more rigorous and are termed "consistent ortholog groups." mRNA sequences are used for the computation.
LocusLink.
LocusLink (38), while mostly geared toward describing curated sequence and descriptive information about genetic loci, also includes in its distribution a downloadable file "homol_seq_pairs.gz." This file contains pairs of LocusLink loci from different species whose mRNA sequences (as found in RefSeq; Ref. 38) have been found to constitute a two-way reciprocal best match by BLAST (2) alignment (29). At this time it is not accessible for queries directly from the LocusLink WWW interface; therefore, a user needs to download the file and either upload it into a relational database (e.g., Microsoft Access), which would allow complex queries, or search the actual text file for a keyword, thus forgoing the capability to discriminate between different fields in the table.
Mammalian Homology and Comparative Maps.
Unlike the previously described databases, Mammalian Homology and Comparative Maps (7) does not contain any calculated (and therefore considered putative) orthologs but is a purely curated repository of mammalian gene homologies. All ortholog pairs stored in the database are based on published evidence. The following categories of evidence have been used by the curators: amino acid sequence comparison, coincident expression, conserved map location, cross-hybridization to the same molecular probe, formation of functional heteropolymers, functional complementation, immunologic cross reaction, nucleotide sequence comparison, similar response to specific inhibitors, similar subcellular location, similar substrate specificity, and similar subunit structure. Literature references are provided for each category of evidence that was used to establish a particular orthologous relationship. All of these ortholog relationships are also available through HomoloGene; the advantage of Mammalian Homology and Comparative Maps is in its focus on curated relationships (thus increasing the specificity of the query result) as well as in very detailed documentation of the evidence.
TIGR EGO.
EGO, formerly referred to as The Institute for Genomic Research (TIGR) Orthologous Gene Alignments, or TOGA (27), is a purely computed ortholog database. It is generated by pair-wise comparison between the tentative consensus (TC) sequences that comprise the TIGR Gene Indices from individual organisms. Tentative ortholog groups (TOGs) are identified as reciprocal best matches between at least three organisms with a minimum of 75% sequence identity over at least 400 bp for any single sequence match. One important difference between EGO and other ortholog databases is that the TC sequences it uses to calculate orthologs are themselves computed from alignments of expressed sequence tags (ESTs) as well as mRNA sequences.
Protein Subcomponent Databases
Unlike ortholog databases, protein subcomponent databases shorter length of the sequence being analyzed allows for a large number of different techniques employed to establish homology. These fall into two large groups: pattern methods (5) and sequence alignment. A summary of the main features of the most commonly used subcomponent databases can be found in Tables 4 and 5.
|
|
|
An alternative group of techniques analyzes the entire domain, emphasizing the importance not only of the conserved motifs but also of the variable sequence between them. One approach, termed a "profile" (18, 30), defines which residues are allowed at given positions, which positions are highly conserved and which degenerate, and which positions, or regions, can tolerate insertions or deletions (with various scoring penalties applied). Evolutionary weights and results from structural studies are also often incorporated into the scoring system. Another "holistic" domain method is HMMs (Ref. 24). This is a probabilistic model which consists of a series of sequentially connected states: match, insert, or delete (which are assigned based on a sequence alignment). At every position each of the states is assigned a certain cost, and alignment algorithms attempt to find the lowest-cost pathway through an HMM.
PROSITE. PROSITE (12) is a mixed motif/domain-based database that uses regular patterns and profile algorithms for homology searches. Its patterns and profiles are both curated from the literature as well as manually designed by the maintainers of the database. Typically, a core pattern is designed first, and its sensitivity and specificity is later refined through repeated "runs" through SWISS-PROT database.
PRINTS. PRINTS (4) is a fingerprint-oriented database based at the University of Manchester. All its fingerprints are manually designed by the curators with references supplied in the annotation. Similarly to PROSITE, the starting "seed" fingerprint is fine-tuned by iterative searches through SWISS-PROT. The database can be searched by a protein sequence for a matching fingerprint or for a keyword in a fingerprint entry annotation. The web site also contains a compendium of protein sequences that match when all fingerprints are run through OWL, SWISS-PROT, and TrEMBL databases. A tool called BLAST PRINT (45) allows researchers to run their sequences through a BLAST (1) search on all sequences in the source databases that match fingerprints in PRINTS, thus combining the speed of BLAST queries with specificity of fingerprint diagnoses.
BLOCKS. BLOCKS (21) is a multiple-motif weight matrices (referred to by the database authors as "blocks") database maintained by Fred Hutchinson Cancer Research Center (FHCRC). Unlike most other pattern databases, it is automatically computed. Protein alignments from curated pattern databases (PROSITE is used for the original BLOCKS database; BLOCKS+ adds alignments from PRINTS, ProDom, DOMO, and Pfam) are used to create the weight matrices, which are then calibrated against SWISS-PROT database. The database can be searched by DNA or protein sequence using several algorithms, including RPS-BLAST (40) and IMPALA (Schaffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, and Altschul SF, unpublished observations). It also provides several additional services, such as the BLOCK MAKER, which allows an enterprising scientist to create his own blocks from a given sequence alignment, and CODEHOP, which designs PCR primers based on the provided alignment.
eMotif. eMotif is a single motif, permissive regular expression database supported by the Department of Biochemistry of Stanford University (35, 23). It calculates the regular expressions from the protein alignments found in BLOCKS+ (i.e., BLOCKS including alignments from PROSITE, PRINTS, ProDom, DOMO, and Pfam) and PRINTS databases. These regular expressions can be searched for a match for a given sequence using eMotif-Search tool. Two other software packages are provided: the eMotif-Maker program will devise a permissive regular expression pattern based on a protein alignment provided by the researcher, and eMotif-Scan will search SWISS-PROT and GenPept databases for a regular expression supplied by the user.
eMatrix. eMatrix (46) is another database run by the Department of Biochemistry at Stanford. Like eMotif, it uses protein alignments from BLOCKS+ and PRINTS but calculates minimal risk-scoring matrices (a type of PSSM) instead of regular expressions. Similarly to eMotif, it can be searched for a match to a protein sequence; it also provides eMatrix-Maker to calculate a matrix based on an alignment and eMatrix-Scan to scan SWISS-PROT using a matrix provided by the user.
Pfam. Pfam (6) is a domain database developed by a multi-institution Pfam Consortium. It consists of two main parts: Pfam-A and Pfam-B. Pfam-A is a curated database of HMMs and sequence alignments for protein domains. At first, a seed alignment is created manually for each domain; this seed alignment is then used to generate an HMM using the HMMER package (11). Finally, the resulting HMM is used to search pfamseq, a nonredundant protein sequence database derived from SWISS-PROT and SP-TrEMBL, to develop a "full" alignment. Manually created seed alignments are annotated and proofread by human experts, while "full" alignments are not, and are therefore putative. Pfam-B is a fully computed (i.e., no annotation or manual proofreading is done) database that was developed to complement Pfam-A. ProDom (9) family alignments (which themselves are automatically computed; see below for details) serve as a source for the seed alignments (those overlapping with Pfam-A alignments are excluded to ensure nonredundancy); HMMs and full alignments are then created similarly to Pfam-A. Sequence searches against the Pfam database automatically include both Pfam-A and Pfam-B. Minor variations in available functionalities exist between Sanger Laboratory and Washington University web sites.
SMART. SMART (28) is an HMM-based domain database supported by the Bork group at European Molecular Biology Laboratory (EMBL). It specializes in non-enzymatic regulatory domains of signaling proteins as well as domains associated with DNA, RNA, chromatin, and actin cytoskeleton functions. A seed alignment is created manually and used to develop an HMM. The alignment is then enriched by traversing protein sequence databases using the developed HMM and the HMMER (11) program as well as the raw alignment and PSI-BLAST (2). The resulting alignment is manually proofread and trimmed by human experts. Additionally, transmembrane domains are identified using the TMHMM2 (26) program. To make the query more comprehensive, users can request to include Pfam domains in the search. Finally, to detect outlier homologs that may not have been captured by the HMM methodology, a WU-BLAST (15) search through the actual sequences of the alignments may be employed.
CDD. CDD ("conserved domain database"; Ref. 31) is a PSSM-based domain database run by NCBI. The protein sequence alignments it uses to create PSSMs are taken from Pfam-A (6) and SMART (28) databases, with a few more added by NCBI curators. RPS-BLAST is used to match the query sequence to the PSSMs.
Superfamily. Superfamily (16) is an HMM-based database which provides superfamily [which may be a domain or a group of domains, as defined by SCOP ("structural classification of proteins"; Ref. 33) researchers] models for all proteins of known three-dimensional structure. The HMMs are generated from alignments handcrafted by the SCOP staff using the Sequence Alignment and Modeling system (25) package.
TIGRFAMs. Similarly to Superfamily, TIGRFAMs (19) aims to establish not just sequence but functional homology; its creators introduce a term "equivalogs" to describe proteins that are similar in both structure and function. Manually developed alignments of equivalogs from complete microbial genomes are used to create HMMs (employing the HMMER package for this purpose). Although thresholds in the HMM models have been fine tuned to exclude known orthologs and paralogs with dissimilar functions, it is of course ultimately up to the user to prove with certainty (experimentally, if necessary) that an unknown sequence matching one of the models does indeed share functional as well as structural homology.
Alignment databases.
An alternative approach to the protein subcomponent analysis is an automated sequence alignment, similar to the methods utilized by the ortholog databases. Unlike the pattern databases, which are for the most part manually designed and fine-tuned, alignment databases do not provide the advantage of expert validation and detailed annotation of the protein groups they describe, including references to the published evidence, etc. As a result, this technique has lower specificity; however, it gains in sensitivity and comprehensiveness and offers a great advantage in speed.
ProDom. ProDom (9) employs a two-pronged approach to domain modeling. First, a set of expert-validated sequence alignments (some designed by ProDom staff, and others acquired from Pfam-A) is used to generate PSSMs describing domains. In a second step, the software package MKDOM (17), based on PSI-BLAST (2), is used to find domains not described by the manual alignments used in the first step by analyzing sequences in SWISS-PROT and TrEMBL. A subset of the domains found by these two methods from sequences that belong to completely sequenced genomes is included in a separate database, ProDom-CG.
SBASE. SBASE (44) is a fully curated alignment-based database. It uses domain definitions by human experts at other databases [ProtFam (Ref. 32), Pfam, InterPro member databases (see below)] as well as those compiled from the original literature. It consists of two main subsets: SBASE-A and SBASE-B. Although SBASE-A contains well-known structural and functional domain types, SBASE-B comprises groups that are either less well characterized than those in SBASE-A, or are defined by composition (e.g., glycine rich) or cellular location (e.g., transmembrane). As a result, some of the SBASE-B and SBASE-A domains partially overlap (for example, an "extracellular domain" from SBASE-B may contain an "EGF module" from SBASE-A). Sequence queries on either of the subsets are run using different flavors of the BLAST algorithm.
integrative databases. As illustrated above, over the years a great number of different approaches to description and assignment of protein subcomponents have been developed. Each of them has its benefits and disadvantages, and limiting the search to only one or two of them could adversely affect either sensitivity or specificity (or both) of the result. At the same time, manually rerunning the query on each of the databases is laborious and increasingly unmanageable as the amount of data that needs to be processed grows. It has thus become imperative to develop methods that would integrate different techniques in one place; the two best-known sites that do just that are InterPro and EASY ("expert analysis system").
InterPro. InterPro (3) brings together the data and analytical methods of the majority of the commonly used protein subcomponent databases: PROSITE, PRINTS, Pfam, ProDom, SMART, and TIGRFAMs. All entries from the member databases are manually reviewed and compared against each other to create composite InterPro entries, which contain all information found in each of the corresponding records of the member databases. Fully overlapping entities from different databases are merged into a single InterPro entry, whereas subtype entities (e.g., a PROSITE motif that is part of a Pfam domain) are given their own InterPro entries. Each InterPro entry is annotated with a functional description, literature references, and links back to the individual databases. Sequence searches are performed using all of the techniques employed by the member databases simultaneously.
EASY. Unlike InterPro, which combines different databases manually, EASY (42) aims to develop a computational integration approach (see Fig. 3 for interface screenshot). Its current implementation uses six databases in its searches: SWISS-PROT, PROSITE, PRINTS, Pfam, Profiles (8), and BLOCKS. Some of the databases have a local copy, and others are searched remotely (the software itself can be configured to search any of the databases either locally or remotely). The results of a sequence query from all of the databases are parsed for most commonly encountered words and phrases (frequently encountered but uninformative words like "protein" are excluded). The system has been designed to appreciate the semantic difference between protein sequence and subcomponent pattern databases; it uses this knowledge and the parsing results to hypothesize the identity of the protein encoded by the sequence and the family to which it belongs. The software has been empirically validated by the authors as well as numerous users, and its analytical "skills" continue to be refined.
|
![]() |
PORTALS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A number of other important web portals exist. EnsEMBL houses DNA and protein sequences from several genome projects (human, mouse, zebrafish, etc.) with automatic baseline annotation, as well as providing links to other genomic resources. Expasy Molecular Biology Server (http://kr.expasy.org/) develops protein knowledge bases SWISS-PROT and TrEMBL and provides an entry point to a number of other bioinformatics web sites. In summary, web portals provide an invaluable service of bringing together homology resources from around the world and are recommended as a starting point on the WWW to any aspiring researcher.
![]() |
CONCLUSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
FOOTNOTES |
---|
Address for reprint requests and other correspondence: A. Turchin, Division of Endocrinology, Diabetes and Hypertension, Brigham and Womens Hospital, 221 Longwood Ave., Boston, MA 02115 (E-mail: aturchin{at}iname.com).
10.1152/physiolgenomics.00112.2002.
The Supplemental Table to article is available online at http://physiolgenomics.physiology.org/cgi/content/full/11/3/165/DC1.
1 The Supplemental Table to this article is available online at http://physiolgenomics.physiology.org/cgi/content/full/11/3/165/DC1.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|