1 Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, Massachusetts 02543
2 Institut de Génétique et Microbiologie, Centre National de la Recherche Scientifique UMR 8621, Bâtiment 409, Université de Paris-Sud, 91405 Orsay Cedex, France
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
module; sequence similarity; protein family; predicting protein function; annotation; evolution
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The most thoroughly studied single-cell organisms are the bacterium Escherichia coliand the single-cell eukaryote Saccharomyces cerevisiae. Both organisms use mainstream metabolic pathways that are recognizably similar to the corresponding metabolic functions in all life forms including higher eukaryotes. The entire genome sequence has been determined for both organisms. The relationships between genetics and biochemistry that constitute the fundamental processes of life in these single-cell organisms serve in a sense as a foundation for ongoing investigations on the more elaborate processes that operate in the more complex, higher forms of life.
With over 60 years of intensive study yielding a voluminous scientific literature, a great deal of the physiology and molecular biology of E. coli is experimentally known. We have updated recently the list of all the gene products and their immediate molecular functions (25). For E. coli, a high proportion, about half, of the genetically determined cell content is currently known directly by experiment. The genes are known in sequence and genetic location; the gene products (protein or RNA) have been characterized experimentally. A small fraction, 2.1%, are understood only in terms of their mutant phenotype. Function could be tentatively attributed to 29.5% of the total that are similar in sequence to genes of known functions in other organisms, but have not yet been experimentally checked in E. coli. Finally, 19.5% are similar to genes of unknown function in other organisms. Only 7% are currently specific to E. coli, and that number will decrease when comparisons with Salmonella species are complete. Thus we either know or have a good idea of what 81% of this organisms genes encode. To complete our understanding of the cell and its activities, we need to know not only what the gene products are and what they do, but we will also need to know how the gene products interact with one another and how their activities are regulated.
In this study, the organization of protein families of E. coli is addressed. Some of the proteins are multimodular, as if they arose by fusion of two or more independent genes. To identify families of proteins related by sequence, it was necessary to identify multifunctional proteins formed of two or more proteins of separate function and unrelated sequence (13, 31). These components of multifunctional proteins are what we term "modules." Different from a motif, a module represents an independent individual protein that has descended in the course of evolution in many cases as a single unit in some current genomes, but is found fused to another module in some other genomes. Unless multimodular proteins are identified and the components treated as separate entities before collecting groups of proteins of similar sequence, false connections can be made, as diagrammed in Fig. 1. Figure 1 portrays a case in which at least nine groups of proteins with different functions can be incorrectly grouped together through a multimodular protein, RNE (Swiss-Prot accession no. P21513).
|
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Pairwise similarities between all E. coli proteins.
Exhaustive pairwise sequence comparison for each E. coli module against all other E. coli modules in the genome was performed using a locally installed Data Analysis and Retrieval With Indexed Nucleotide/Peptide Sequences package (DARWIN, version 2.0) obtained from the Computational Biochemistry Research Group at ETH, Zurich, Switzerland (10). The SearchPepAll and LocalAlignBestPAM functions were used to generate a list of all qualifying aligned pairs of peptides. These employ the dynamic programming of both Needleman-Wunsch (18) and Smith-Waterman (27) algorithms and test appropriate PAM ("accepted point mutations") score matrixes for each sequence pair. The outputs contain information related to the alignment: identifiers of the sequence pairs, the start and end positions of alignment regions for both sequences, the PAM score, variance score for a panel of substitution matrices, and the percentages of sequence identity and similarity. (PAM score of 0 signifies complete identity; higher values represent progressively less of a sequence match.) To deal with entire proteins rather than motifs and binding sites within proteins, we stipulated that all alignments have a minimal length of 100 residues. To collect evolutionarily distant relationships yet avoid artifact, we set a limit for PAM at less than 200 to reduce false matches to a minimal level. Compared with the statistical cutoffs established by Altschul (1) as a significant sequence match (a minimum length of 83 residues and a maximal PAM distance of 250), our criteria are conservative. The DARWIN algorithms and use of multiple substitution matrices have been evaluated in relation to other sequence analysis approaches and have been given high credit for sensitivity and performance (21, 28).
Identification of modules.
Some proteins derive from compound genes such that direct translation of the genetic open reading frame (ORF) produces more than one protein functional unit. Other multifunctional proteins remain polycistronic polypeptides, expressing more than one function as a complex multisite protein. We identified such multimodular proteins in E. coli by the positions of regions of sequence similarity. By noting separate regions of alignment between pairs of proteins, modular composition of genes with partners in the E. coli genome were identified. Table 1 lists examples of multimodular proteins and functions of the modules, giving the start residue and end residue of regions of pairwise alignment with other proteins. We set arbitrary threshold values for inferring more than one module in a protein on the basis of properties of some known multimodular proteins. We defined modules to occupy more than 25% and less than 80% of an entire protein. Adjacent modules in the same protein were not allowed to overlap more than 10 residues. A heuristic method was developed to make these identifications (Le Bouder and Labedan, unpublished observations). After automatic processing, substantial hand adjustment and polishing was required to eliminate false positives and to complete missed connections.
|
Assembly of module-based sequence-similar groups.
The modules, instead of the whole proteins, were assembled into sequence-similar groups using a single-linkage chain clustering method to place all paired modules into groups whose memberships did not overlap. These processes were performed using the programs Module and Families, recently developed in the Labedan group (Le Bouder and Labedan, unpublished observations).
Genealogical trees of protein groups.
To analyze the genealogical relationships among members of the larger paralogous groups, protein sequences corresponding to the involved modules were extracted and a distance matrix in PAM values for the whole group was generated using DARWIN. Trees were inferred using Fitch-Margoliash and least-squares distance methods (6) in PHYLIP (5) based on the above distance matrix.
Statistical analysis.
The distribution patterns for PAM values are expressed in histograms. The differences of distribution patterns between groups were analyzed with the Wilcoxon rank sum test using StatView from SAS Institute (Cary, NC).
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A simple example is the case of the ADHE protein in E. coli. The fact that this protein is an alcohol dehydrogenase and has aldehyde reductase activity as well could lead one to assume that the two related activities reflect merely reversibility of the reaction at a single catalytic site. But an examination of the alignment regions of ADHE produced by DARWIN, or by BLAST2, immediately tells us that these two catalytic activities are located in two sequence-unrelated regions of the protein, with the aldehyde reductase activity located in the NH2-terminal region and the alcohol dehydrogenase activity in the COOH-terminal region. Clearly, ADHE is composed of two modules.
Of the 2,415 proteins with at least one sequence-related partner in E. coli, 287 proteins (11.8% of all paralogs) were identified as multimodular proteins, most of which contained two modules, some more. Of the 287 multimodular proteins, 229 have modules in more than one paralogous group and 58 have 2 or more modules in the same group. The former, larger class represents the occurrence of gene fusion, and the latter, smaller class represents internal duplication in the past. Both types were present in 40 of the proteins that contained 3 or more modules. Table 1 presents a sample list of multimodular proteins and the functions of the parts of the proteins, known or predicted. For example, protein ARAG has two modules belonging in the same group, both ATP binding cassettes of a transport protein, whereas protein PTAA, another type of transporter, contains three modules, three components of phosphotransferase enzyme II (A, B, and C). Combinations of the above two situations also exist. For instance, YHIH is composed of two ATP-binding components and a membrane component of an ATP-binding cassette (ABC) transporter (data not shown).
The total of 2,745 modules identified as having at least one homologous partner within E. coli were collected into groups having similar sequence by a clustering method by following paths of likeness among the pairs as described in METHODS. The groups were constructed transitively so that not all members of a group have detectable relatedness to all other members of the group, but no member is related in sequence to a member of any other group. There were 609 sequence-related nonoverlapping groups within the E. coli genome. The remaining 1,871 proteins, which we refer to as singles, do not match any of the other sequences in the same genome based on our criteria.
Homologous proteins encoded by genes in a single genome are defined as paralogs (7) and are believed to have arisen in the course of evolution by gene duplication followed by divergence in both sequence and function. Homologous sequences in different organisms are defined as orthologs.
Detailed information for the full list of the multimodular proteins and paralogous groups of E. coli with the position of each module within the proteins has been deposited into the GenProtEC web site (http://genprotec.mbl.edu).
Anatomy of the sequence-related groups.
Not all the groups are the same size or configuration. Groups differ in number of members ranging from 2 to 94. The number of groups of a given size is inversely related to how many members are in the group. There are many more instances of groups of two members (pairs) than of any larger size. There was only one instance of each of the groups of size larger than 28 (Table 2).
|
|
Many pairs of paralogs are isozymes that catalyze identical or very closely related reactions. For instance, there are two paralogous alanine racemases in E. coli K-12, one catalytic the other biosynthetic. Some polymeric enzymes are also closely related in this way. The -subunits of the isozymes ribonucleoside diphosphate reductase 1 and 2 (protein names PIR1 and PIR2, gene names nrdA and nrdE) are in one group, whereas their ß-subunits, PIR3 and PIR4 (gene names nrdB and nrdF) make up another group. They differ only in the redox cofactor preferred. Among the 160 enzymes in E. coli that have at least one isozyme partner listed by Riley and Serres (23), 132 (82.5%) are found in pairs or triplets related by sequence with scores better than our threshold values.
In most of these large protein groups, there is a relationship between function and deeply branching families. Figure 3 is a distance matrix tree of a family of the 79 proteins (94 modules) of a group of ATP-binding subunits of the multimeric ABC transporters. Substrate specificities are experimentally known for 52 of the 94 group members, while 42 are annotated as putative ATP-binding components of ABC transporter with no predicted substrate information. In the tree for the protein sequences (Fig. 3), clustering is observed for different types of substrates transported: amino acids, oligopeptides, and 5- and 6-carbon sugars. The clustering represents a likely evolutionary scenario by which differentiation of replicate ATP-binding components generated families of proteins similar for a given type of substrate. Such consistency provides opportunity to supplement annotation of genes of an organism with information from internal paralogs in addition to the more customary orthologous sequence matches. Internal groupings (paralogous relationships) illuminate evolutionary relationships to bear where subsets of sequence-similar groups may contain information on type of substrate and type of reaction. In this particular case, we can predict the types of substrates used by the 11 proteins with unknown substrates in the 3 branches that show good conservation for the transported substrates. The topology of the tree agrees in principle with the one constructed with fewer members by Linton and Higgins (15).
|
|
|
|
In addition, as illustrated by Fig. 4, the levels of sequence similarity in pairwise matches within families of enzymes, transporters, and regulators have different distributions of PAM distances. The pairwise sequence similarity between related enzymes ranges widely and has a median PAM value of 148 (Fig. 4A). In contrast, regulators are not as spread out at either the low or high ends as the enzymes are and, instead, have more instances in the midrange (Fig. 4C). The narrower distribution has a lower median PAM (137) than the enzymes. Transporters range as widely as enzymes, but their PAM values are more clustered at higher values and have the highest median PAM value (165) among the three types. A larger fraction of the sequence-related transporters have a PAM value over 170 compared with either the enzymes or the regulators, indicating they require the least conservation for their function as transporters compared with the other two functional categories (Fig. 4B). Figure 4D shows the differences in distribution profiles in terms of percentile as a function of PAM value.
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Data on protein families internal to one organism also bear on evolution mechanisms, allowing us to ask whether different categories of proteins have different characteristics of molecular evolution (9, 17). The field of genomics has provided us with complete genomic sequences of over 40 organisms, most of them from unicellular organisms. In each case, composition of protein families can be windows on protein evolution and the origin of life. A generally shared view of protein evolution is that a diversity of proteins was present in the last universal common ancestor (LUCA) (4, 30), namely, the collection of cellular entities that catalyze metabolism, cell components, macromolecule synthesis, and cell division capabilities. Early genes duplicated and diverged. The descendants are present in the majority of known life forms today. Thus the sequence-related groups of proteins found in E. coli that are also present in nearly all organisms in the tree of life must trace back to early ancestors. Other groups present only in bacteria, for instance, or in certain kinds of bacteria, must have arisen more recently after divergence of bacteria from other domains of the tree of life (Fig. 6).
|
One possible explanation is that transporters and regulators may have found a few "winning formulas" that were simply varied for specificity over and over, meeting the needs of the cell with this level of particularity. By contrast, enzymes may have diverged to a much greater extent to meet the very numerous and specific catalytic needs for the complex metabolic networks of the cell. A comparable analysis for functionally distinct paralogous groups for other genomes will tell us whether all transporters, regulators, and enzymes exhibit the same size distributions of internal protein families and thus that the course of evolution of types of proteins tends to be universal among all organisms.
Both amino acid sequence and function are closely related in the paralogous groups of E. coli. Particular types of transporters cluster together, types of transcription regulators cluster, and types of enzymes cluster (Table 3). For instance, looking at the larger families of transporters, the three types of subunits of the multimeric ABC transporters each cluster together by type (the membrane components, the ATP-binding components, the solute-binding transmembrane periplasmic subunits). Also, subsets of the major facilitator superfamily (MFS) and the amino acid/polyamine/choline (APC) superfamily of amino acid transporters each cluster by sequence similarities. In most cases the sequence-relatedness of transporters is consonant with the classification by function of types of transporters (Ref. 20; http://www.biology.ucsd.edu/ipaulsen/transport/).
Types of regulators also group by function and sequence. For transcriptional repressors of the LysR type or GntR type, transcriptional activators of the AraC type, and other groups of regulators in E. coli, the grouping by sequence agrees with grouping by class of regulator (26) (http://www.cifn.unam.mx/Computational_Genomics/regulondb). Also the modules of sensor and response regulator of two-component regulators each belong to distinct sequence families, sometimes joined in multimodular proteins, sometimes in unimodular, separate proteins (29).
Types of enzymes with distinct similarities in catalytic properties also cluster by sequence. Families of enzymes with distinct similarities in properties, such as ATP-dependent helicases, GTP-binding proteins, methyltransferases, acyl transferases, transaminases, sugar kinases, dehydratases, and acetyl-CoA synthetases, each fall into a discrete sequence-similar family. Members of each family are related by chemistry of reaction but differ in substrate specificity. Other enzyme activities have more than one solution and can be achieved more than one way. Such types of enzymes split into several families. For instance, there are several sequence types of NAD(P)-requiring dehydrogenases that separate into more than one family.
Multimodular proteins.
Multimodularity of some proteins has long been known (13). Their existence introduces complications to evolutionary reconstructions unless the individual components are identified and treated separately. Some genes are composed of multiple independently functioning modules that seem to have separate evolutionary histories but have come together by some processes of recombination leading to gene fusion.
Not all proteins in E. coli that are known experimentally to be multimodular were detected by our procedures, either because there are no paralogs within E. coli to one or more of the modules or because such relationships are not visible above the relatively conservative threshold we used to define sequence similarity. For example, with no other sequence similar to homoserine dehydrogenase detected within the E. coli genome, only the aspartokinase modules of AK1H (gene thrA) and AK2H (gene metL) were detected. There are, however, separate orthologs to both modules of thrA and metL in other organisms, for both the homoserine dehydrogenases and the aspartokinases. Thus additional modules can be identified by searching for orthologous matches. Therefore the actual number of multimodular proteins in E. coli is higher than reported here.
Cautions for functional annotation.
It goes without saying that any data concerning sequence similarity, such as the use of sequence similarity for functional annotation purposes, must recognize and operate with the appropriate alignment region and the unit of similarity, the module, as we have done here, rather than using complete genes or proteins when they are complex. For instance, similarity to the E. coli AK1H only in the NH2-terminal half should not be taken to indicate homoserine dehydrogenase activity in a homolog since the alignment was limited to the aspartokinase region. With respect to functional analysis, misassignment of the function of one module to any protein matching the other module instigates a chain of errors. Such misattributions contaminate databases but could be avoided by confining conclusions about functional similarity to the correct matching regions of sequence similarity between the query and the subject sequences.
Also, for different reasons caution is needed when transferring the exact function of a known protein to another of unknown function based on sequence similarity. As it happens, sequence similarity does not always spell out close similarity of catalytic function. There are functionally diverse superfamilies of proteins that spell difficulty for annotating of function by sequence similarity. There are sequence-related enzyme families that are catalytically diverse, retaining an underlying similarity of the structure of the active site, but using the same chemical mechanism for different overall reactions (9). Examples in E. coli sequence-similar groups, are the SDR superfamily (Kerr A and Riley M, unpublished observations) and the crotonase superfamily (McCormack T and Riley M, unpublished observations). In cases where a sequence-related group is not immediately recognized as being a superfamily of diversified proteins that catalyze related but different reactions, the attribution of an exact function of a known protein to an unknown protein can be entirely wrong.
Another kind of relationship in sequence-similar families is preservation of substrate with divergence of function. For instance, the sequence-related proteins RBSR and transporter RBSB are examples of proteins that have maintained ligand specificity (ribose) while changing action of the protein (16). RBSR is a regulator for the ribose operon, and RBSB is a transporter for ribose. The E. coli family to which this pair belongs contains both periplasmic binding proteins and transcriptional regulators. Although mode of evolution of function is interesting in these kinds of cases, unfortunately whenever protein families contain members of different function, in this case regulators and transporters, difficulty arises for accurate function prediction of unknown proteins of similar sequence.
That being said, one should not place too much emphasis on the cases where sequence similarity indicates membership in a protein family of diverse functionality, since the great majority of the sequence-similar protein families in E. coli unambiguously share a primary function. Most share chemistry of reaction and differ only in specificity of ligand/substrate (17).
Applications of paralogous protein families in genome annotation.
The results reported in this research provide useful annotation information in at least three ways. First, existing paralogous relationships of sequence similarity within a genome can be useful in attributing function to unknown proteins when there is little useful information from orthologs in current databases. The method of transitive assembly of the paralogous groups leaves some members of a group only marginally connected, yet where functions of the most distant members are known, the functions are usually clearly related to those of the group as a whole. Thus information can be derived even when the degree of sequence conservation between two paralogous proteins is sometimes below the standard threshold for detecting sequence similarity among orthologs. Internal paralog families are particularly useful in cases where the genes are unique to the studied genome or orthologs have not been found in other organisms. This is reflected in the fact that in E. coli paralogs as a class have a much lower percentage of unknown members compared with singles (8.9% vs. 35.2%) (Table 4). In our experience with E. coli K-12 and Halobacterium NRC-1 genomes, we were able to make putative assignments for at least 10% of genes by using this approach (19; http://zdna.micro.umass.edu/haloweb/).
Second, identification of multimodular proteins and location of the correct functions to the correct parts of the proteins improves the accuracy of the annotation. It is clear that multimodular proteins exist in all organisms, and in many cases only one of the functions is currently known. More complete information and location of the activities will correct misattributions and errors of omission in the annotations in current genomic databases.
Third, by examining the evolutionary relationships among the members of larger paralogous groups, additional information such as the type of enzymatic reaction or the substrate specificity may be obtained for those member proteins that are currently characterized only with a putative general function. Also, considering the finding that different degrees of sequence conservation exist among different types of proteins such as enzymes, regulators, and transporters, different thresholds may be optimal for different function types. Whereas putative transport function may be assigned with good confidence using a marginal sequence similarity, for regulators and even more so for many enzymes, level of sequence similarity may need to be more conservatively defined.
Conclusions.
Sequence-related protein families within a single organism which are assembled with special attention paid to the existence of multimodular (composite) proteins have useful applications both in understanding elements of molecular evolution and in improving genome annotation. Paralogous protein families each presumably descended from an individual ancestral protein can be inferred from families of sequence-similar proteins encoded within an individual genome. We found that most such sequence-related families contained proteins with the same type of function. In a few cases not expanded on here, there is divergence of function among group members, showing how sequence divergence among similar family members can lead to further divergence of function. Generally speaking, paralogous group membership provides a basis for assigning putative functions to unknown members, which is particularly useful when no information is available through orthologous matches. In addition, the approach improves the ability to identify multimodular proteins, locating specific functions to different parts of a protein. Membership in well-defined clusters within large paralogous groups affords the opportunity for even more specific functional characterization. Therefore, the approaches suggested here could be useful additions to existing methods for genome annotation.
![]() |
ACKNOWLEDGMENTS |
---|
This work was supported by National Aeronautics and Space Administration Astrobiology Institute grant NCC2-1054 and the Merck Genome Research Institute.
![]() |
FOOTNOTES |
---|
Address for reprint requests and other correspondence and present address of P. Liang: Department of Cancer Genetics, Roswell Park Cancer Institute, Elm & Carlton Streets, Buffalo, NY 14263 (E-mail: Ping.Liang{at}RoswellPark.org).
10.1152/physiolgenomics.00086.2001.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|