From the Center for Structural Biology, Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, Kentucky 40536-0298
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Some previous studies of protein simple sequences have used somewhat limited protein databases and have not necessarily compared organisms (5, 6, 17, 18). Other surveys have considered whole proteomes but often remove sequences considered redundant (19, 20). There are a number of surveys where simple sequences enriched in a particular residue type or associated with a particular function have been examined (2, 3, 9, 14). Some recent studies have focused on comparisons between organisms (10, 2123) but have mostly considered only homopolymeric sequences. Our current study differs from prior work in that we use only intact proteomes from fully sequenced genomes, including sequences annotated as hypothetical proteins. We focus solely on non-overlapping simple sequences, of 10 or more residues in length, highly enriched in a single residue type (50% composition). This approach provides a non-biased view of the distribution of this set of protein simple sequences as well as allowing for ready comparison of their occurrence in the organisms examined. The eukaryotes surveyed, namely a yeast, worm, fruit fly, and plant, comprise a diverse sample of members of the eukaryote kingdom. We have chosen not to include the human proteome given the current uncertain state of its completion. In addition, for comparison we have surveyed 26 prokaryotes, including 12 Archaea, two cyanobacteria, and six Gram-negative and six Gram-positive bacteria.
We find that highly enriched simple sequences are remarkably common in all of the organisms examined. Eukaryotes are found to possess more simple sequences per protein than do the prokaryotes in keeping with the findings of other groups (19, 21, 23). The occurrence of prokaryote proteins containing simple sequences is linearly correlated with proteome size. Given the limited number of organisms examined, it is not clear that this is the case for the eukaryotes. Perhaps most notably, each organism examined possesses its own unique distribution of simple sequences. We find that simple sequences display surprising length dependences with some residues preferentially populating long simple sequences regions, while others clearly prefer short simple sequences. There is no discernible correlation with residue occurrence. For example, leucine-enriched sequences appear to be discriminated against despite leucine being the most common residue in most organisms. Some observed length dependences can be explained in structural and functional terms, although many remain enigmatic. We have also found that simple sequence distributions vary according to functional groupings. For example, leucine-rich regions, despite being discriminated against in the overall distributions, are among the most common simple sequences found in membrane-associated proteins. It is clear from the sheer number found that all organisms examined, particularly eukaryotes, tolerate, and perhaps even require, large numbers of protein simple sequences. The data presented here will provide the basis for future studies of these ubiquitous and potentially extremely important sequences.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
We represent a protein sequence of length L as a string, a1a2a3a4 . . . aL, where ai is the residue at position i. When searching for a simple sequence enriched in a certain residue type, the numerical positions in the protein string for that residue are first generated as a string of i values. Putative simple sequences are extracted based on the positions of the i values given that gaps of 6 or more residues in length are not allowed within a simple sequence. Putative simple sequences of many lengths are identified with all i values corresponding to the residue of interest being output. Since only the residue of interest is selected, the process automatically generates only sequences that begin and end with the residue of interest. Subsequent filtering removes sequences that are less than 10 residues long. Remaining sequences are tested to satisfy the 50% threshold for the residue of interest. Sequences that do not satisfy the criteria are further analyzed to determine whether shorter simple sequences satisfying our criteria are within them. The entire process results in the identification of all non-overlapping simple sequences within the proteomes that satisfy all four of the above criteria. The computer programs used to identify simple sequences were written in Python/C++ and executed on a Silicon Graphics work station.
We use the Poisson distribution (9, 24) to model the probability of random occurrence of simple sequences containing a given residue type in the eukaryote proteomes. This is given by
![]() | (Eq. 1) |
where f(n) is the probability of an event happening n times. In our studies l is the length of the simple sequence, n is the threshold value, and m is derived from
![]() | (Eq. 2) |
The expected number of simple sequences of length l in a proteome is then
![]() | (Eq. 3) |
where Tl is the total of number of sequence windows of length l in the proteome.
The difference between the actual number of simple sequences, SSTot, of length l found and the number expected from the Poisson distribution is then
![]() | (Eq. 4) |
For simple sequences longer than about 25 residues, SSexpect is essentially zero in which case is equal to the number of simple sequences found. Finally, to compare the occurrence of simple sequences among organisms, we define
R as follows:
![]() | (Eq. 5) |
![]() |
RESULTS AND DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Inclusion of Potentially Incorrect Protein Sequences
We have chosen to include all complete protein sequences in the proteomes that we have examined. This includes those marked hypothetical, putative, or probable and those proteins that have not as yet been annotated. Redundant sequences have also been included. This choice was made so as to be able to perform a more complete analysis of the proteomes, leading to an "unbiased" view. It is possible that some of the simple sequences found come from sequences that are not expressed as proteins. Bork and Copley (25) have pointed out that the identification of genes in sequenced genomes is difficult. It is particularly difficult for eukaryote genes where the identification of exons is error-prone. Ideally the analyses presented below should be repeated leaving out those proteins marked hypothetical or not annotated. This is, however, extremely difficult due to the wide variety of annotations used to denote such putative protein sequences. We have thus chosen to present the analyses of the complete proteomes with the caveat that some of the results may be slightly skewed by the presence of incorrect protein sequences.
Abundance of Protein Simple Sequences
All of the organisms surveyed possess a remarkable number of simple sequences in their proteomes (Table I). The number found ranges from 251 in the small proteome of MG (480 proteins) up to 27,542 in the proteome of AT (26,496 protein sequences surveyed). Furthermore, a remarkable fraction of proteins in each proteome possess at least one simple sequence. Fig. 1a is a plot of the number of proteins possessing one or more simple sequences, ProtSS, against the number of proteins in each proteome. At first glance one might deduce that there is a linear relationship between the number of simple sequence-containing proteins and the total number of proteins. The line of best fit drawn in Fig. 1a has a correlation coefficient of 0.99. However, the eukaryotes possess significantly larger proteomes than do the prokaryotes and consequently far more simple sequences. In effect, the fit to the data is reduced to a fit to five points, the four eukaryotes plus the prokaryotes essentially as a single point.
|
Fig. 1b is a plot of ProtSS against the number of proteins in each proteome for the 26 prokaryotes surveyed. There is a clear linear correlation with the line of best fit having a correlation coefficient of 0.92. Two prokaryotes, the Archaea HS and the bacteria DR, appear to be outliers. Excluding these from the fit results in a correlation coefficient of 0.96. The strong linear correlation observed for the prokaryotes might suggest that these simple sequences have arisen via random events, leading to random distributions that depend only upon the number of proteins in each proteome. As will be demonstrated below, however, our data suggest the opposite, that the occurrence and distributions of simple sequences is not random in nature and that many of these sequences may possess biological significance.
Fig. 1c, a bar plot of the ratio of number of simple sequences found, SSTot, to ProtSS for each organism surveyed illustrates the difference in occurrence of protein simple sequences in prokaryotes and eukaryotes. Prokaryotes have far fewer simple sequences per protein than do the eukaryotes. In all cases, the prokaryotes have fewer simple sequences than the total number of proteins in their proteomes, whereas the eukaryotes possess more (Table I). The prokaryotes average 1.40 simple sequences per protein possessing at least one simple sequence (the dashed line on Fig. 1c). Once again, HS and DR are clear outliers among the prokaryotes, possessing SSTot/ProtSS ratios of 1.68 and 1.73, respectively, both values greater than 2 standard deviations from the mean for prokaryotes. The eukaryotes have ratios that range from 1.88 in AT through 2.09 in CE and 2.18 in SC up to 3.09 simple sequences per protein possessing at least one simple sequence in DM. Eukaryotes clearly not only tolerate a significantly higher occurrence of these sequences than do the prokaryotes, they are also more likely to possess multiple simple sequences in each protein.
The ratio SSTot/ProtSS is of course dependent upon our definition of protein simple sequences. One can imagine that increasing the size of the allowable gap (currently set at 5 or fewer residues) will result in some of the simple sequences merging, resulting in fewer overall but an increase in the number of longer sequences. The result will be lower values of SSTot/ProtSS for each proteome.
A number of groups have examined the occurrence of homopolymeric runs of sequence and noted that eukaryotes possess more per protein than do prokaryotes (19, 21, 23). Nishizawa et al. (23) note that "modern" tissue-specific proteins have a higher tendency to possess homopolymeric stretches of up to 20 residues in length as compared with ancient proteins. They go on to postulate that this repetitiveness enhances the chance for intermolecular interactions. This hypothesis is supported by observations that simple sequences enriched in glutamine, proline, or charged residues are often found in protein interaction domains of transcription regulatory proteins (25) and that proline-rich sequences are common protein-protein interaction domains (11, 12). It seems likely then that eukaryotes, in particular the multicellular organisms, have evolved to require numerous protein simple sequences for functional purposes.
It is not clear why HS and DR would be outliers among the prokaryotes in Fig. 1. HS is an extreme halophile (26), the only one in the set of organisms surveyed. It is tempting to postulate that HS might possess a higher proportion of simple sequences as a result of evolving to survive in such an unusual environment. Ng et al. (26) pointed out that 36% of the putative proteins in the HS proteome were unrelated to any previously reported at that time and that these proteins may well provide the mechanisms by which HS can survive extreme salt concentrations. However, the HS proteome has not been analyzed in sufficient detail to know whether those proteins are particularly enriched in simple sequences, so we cannot draw any conclusions at this point.
DR has been nicknamed "Conan the bacterium" for its amazing ability to resist very high doses of ionizing radiation and UV irradiation (27) and is the only organism surveyed to possess these remarkable traits. It has been speculated that the radiation resistance of DR is due to its unique polypoid nature and the abundant DNA repeat elements in its genome. These DNA repeats may function to regulate DNA degradation after damage to this organism. The high number of protein simple sequences identified in this species may be attributed to such repeats although not to the polypoid nature of DR. This organism possesses more simple sequences per protein than do other prokaryotes (Table I). Simply possessing multiple copies of each gene would not raise the number of simple sequences per protein. The protein simple sequences may have arisen over time as a result of errors made by the DNA repair apparatus of DR while "rebuilding" its genome from multiple gene copies after exposure to extreme conditions such as radiation. On the other hand, some of these simple sequences may play an active role in the survival mechanisms developed by DR. Further functional analysis of the DR proteome is required to better understand why this organism possesses so many protein simple sequences.
For reasons of clarity and focus, the remainder of this article will focus on the occurrence and distributions of protein simple sequences in eukaryotes.
Overall Length Distributions
Fig. 2, a log-log plot of the number of simple sequences found against simple sequence length, is a clear illustration of the remarkable simple sequence length distributions observed in the four eukaryotes examined. Prokaryotes display similar length distributions, although generally the longest prokaryotic simple sequences are shorter than the longest eukaryotic sequences (data not shown). At the shorter simple sequence lengths a periodicity in the data can be seen with there being fewer occurrences where the length is an odd number as compared with adjacent even-numbered lengths. This is a consequence of the algorithm used to identify the simple sequences. As an example, given the threshold of 50%, a simple sequence 11 residues long must possess at least 6 residues of a given type. This amounts to a minimum of 55% enrichment, whereas a 12-residue simple sequence can also possess 6 residues, leading to a minimum of 50% enrichment. This periodicity tends to be damped out at long simple sequence lengths.
|
The longest simple sequence was found in AT, is 410 residues long, and is enriched in glycine. AT is not alone in possessing remarkably long simple sequences. The longest in SC is 246 residues long and is enriched in serine. The longest in CE is threonine-rich and is 291 residues long, while DM possesses a 322-residue-long glycine-rich sequence. Notably, all four of these simple sequences occur in proteins that have been annotated as being hypothetical. The majority of the simple sequences found are of course much shorter than these, the vast majority being 60 or fewer residues in length (99.5%; Fig. 2).
Fig. 3 is a bar plot of the ratio of the number of simple sequences found to the number of proteins in the proteome as a function of length for the four eukaryotes. Division by the proteome size allows for direct comparison of the organisms. The data are split into three length scales; 1020 residues (Fig. 3a), 2040 (Fig. 3b), and 4060 (Fig. 3c). The periodicity observed in Fig. 2 is obvious in Fig. 3a and can be seen to have subsided in Fig. 3b. It is clear from Fig. 3 that DM averages more simple sequences per protein at all lengths than do the other organisms despite AT possessing more in total and CE possessing a similar number (Table I). In fact, DM possesses more than twice as many simple sequences of lengths 20 residues per protein than do any of the other three eukaryotes examined. Clearly DM has evolved to tolerate large numbers of simple sequences. What is not entirely clear is whether this observation is linked to the functional requirements of DM. Nishizawa et al. (23) have pointed out that neural and immune system-specific proteins have a higher propensity to possess short runs of sequence consisting entirely of one residue type. One could reasonably expect that this would extend to the highly enriched simple sequences found in this survey. If so, it may not in fact be surprising that DM possesses such an abundance of these simple sequences as compared with the other eukaryotes examined. An analysis of the functions of the proteins in DM possessing simple sequences will shed light on this as will surveys of the proteomes of other eukaryotes as they become available.
|
Residue Length Dependences
From Figs. 2 and 3 it would appear that the four eukaryotes examined have similar protein simple sequence distributions albeit with differences in relative abundance. Striking differences between the organisms are revealed when simple sequence distributions are considered at the level of individual residue types. Fig. 4 shows the ratio of the number of simple sequences found above that expected from the Poisson distribution to the number of proteins in each organisms proteome, R, plotted against simple sequence length for each residue. The sequence lengths are binned into ranges: 1020 (Fig. 4a), 2140 (Fig. 4b), and 41 and more (Fig. 4c) residues. Data for cysteine, methionine, and tryptophan are omitted since we found very few simple sequences containing these rare residues. The ratio
R is a measure of how common simple sequences are, above the Poisson distribution predictions, per protein in each eukaryote proteome. This ratio allows for easy comparison of the organisms. A higher
R indicates that simple sequences of a given length in an organism are more common in comparison to the other organisms even though the actual number found might be the same or even lower. A negative value of
R indicates that those simple sequences are found less often than predicted from the Poisson distribution. Such sequences are presumably discriminated against for various reasons.
|
We should note that the actual number of leucine-, isoleucine-, and valine-enriched simple sequences found can be quite large. For example, in DM we find 1841, 96, and 223 simple sequences of 10 residues in length enriched in each of these residues, respectively. Due to the relative abundance of these residues, however, the Poisson distribution predictions are also large (1445, 95, and 218, respectively), leading to small or negative values of and
R.
It is notable that we find positive values of R for phenylalanine and tyrosine at short, moderate, and even long lengths (Fig. 4). The tyrosine-rich sequences are particularly surprising given that this is one of the rarer residues. One might expect that sequences enriched in such large hydrophobic residues might be disfavored, and yet this does not appear to be the case. It is not clear why such sequences would be tolerated.
Careful inspection of Fig. 4 reveals that sequences highly enriched in serine, glutamate, lysine, and alanine appear to be favored by all four of the eukaryotes examined at short lengths (Fig. 4a). At moderate lengths alanine-rich sequences become less common (Fig. 4b), while at long lengths glycine-rich sequences seem to be favored. Similar distributions of runs of sequence containing these residues were observed by Green and Wang (6) and Katti et al. (5), although these authors did not normalize their data for residue occurrence nor for what might be expected were sequences random in nature. It is not entirely apparent why sequences highly enriched in serine are tolerated, or even required, by eukaryotes, although there are certainly examples of important protein domains enriched in this residue. One such example is the C-terminal domain of RNA polymerase II, which is functionally essential and consists of the heptad YSPTSPS repeated between 26 and 52 times in various organisms (29). Interestingly, this serine-enriched region (43% serine) is known to interact with proline-rich regions (12) as well as a family of serine/arginine-rich proteins (30). Wootton and Drummond (14) have suggested that sequences enriched in serine may act as flexible linkers between protein domains in much the same way as postulated for glycine-rich sequences.
Sequences enriched in charged residues, such as the lysine- and glutamate-rich sequences seen to be favored by the eukaryotes (Fig. 4), have been associated with DNA and RNA processing, chromatin structure, ion binding, and protein-protein interactions (13). The involvement of such simple sequences in a wide variety of functional roles might therefore explain their relative abundance. Alanine is known to be the most energetically favorable residue in -helices (31, 32). One might therefore expect that sequences that are 50% or greater alanine in composition will have a tendency to be
-helical, although this will of course be modulated by the nature of the other residues in the sequence as well as by the tertiary structure of the proteins they are part of. The preference for short alanine-rich sequences that we have observed might then be related to secondary structure requirements. The long glycine-rich sequences found are probably tolerated for the opposite reason; that is, these most likely represent flexible linkers between protein domains.
One of the more surprising observations from Fig. 4 is that of simple sequences highly enriched in histidine. Although there are not many of these at all lengths, the number we find above that expected is significant. Some of these are quite long. For example, the four longest histidine-rich sequences in DM are 46, 51, 54, and 56 residues long. In CE the four longest are 50, 51, 84, and 251 residues long, although the longest of these is in a protein annotated as being hypothetical and could in fact be an indication that this is not an expressed protein. Histidine is one of the most rare residues, comprising just 2.22.7% of all residues in the four proteomes. By comparison, methionine has a similar level of occurrence, and yet we find almost no simple sequences enriched in this residue above the Poisson distribution predictions. Similarly, we find very few tryptophan- and cysteine-enriched sequences. One might postulate that histidine-rich sequences have some kind of ion binding function, although this has not been demonstrated.
Distribution Differences among the Eukaryotes Examined
It is immediately apparent from Fig. 4 that DM has a markedly different distribution of simple sequences when compared with the other three eukaryotes. The data for DM demonstrate a preference for simple sequences of all lengths enriched in alanine, glutamine, glycine, and serine. At short to moderate lengths, 1040 residues (Fig. 4, a and b), DM also shows some preference for asparagine-, proline-, threonine-, and perhaps most surprisingly histidine-enriched sequences. To a lesser extent, there may also be a preference for aspartate- and arginine-rich sequences. These observed preferences are in large part responsible for the "unusually" high SSTot/ProtSS ratio observed for DM (Fig. 1c). The large numbers of glutamine- and to some extent asparagine-rich sequences in DM were also observed by Michelitsch and Weissman (9), who suggested that many of these may act as protein-protein interaction domains. It is not clear why DM would tolerate, and perhaps even require, large numbers of alanine-, glycine-, and serine-rich sequences.
Although DM is clearly different than the other three eukaryotes, it would be a mistake to assume that there are no significant differences between the distributions observed for the other organisms. SC has preferences for asparagine- and aspartate-enriched sequences at all lengths along with a striking preference for moderate length to long serine-rich sequences. Furthermore, SC disfavors leucine- and isoleucine-rich sequences more than do the other eukaryotes and is somewhat less tolerant of arginine-, glycine-, and proline-rich sequences. The reasons behind each of these preferences are not always clear. For example, the reasons for the large preference for moderate length to long serine-rich sequences are unknown. Wootton and Drummond (14) have suggested that sequences rich in serine form flexible linkers between protein domains. If this is true, then the preference for serine-rich sequences in SC may be linked to the observed lower tolerance for glycine-rich sequences (Fig. 4). SC may have evolved to use serine-rich sequences as linkers instead of the glycine-rich sequences that the other eukaryotes seem to prefer. Another potential role for serine-rich regions is discussed below. Michelitsch and Weissman (9) have previously observed large numbers of asparagine-rich sequences in SC as well as in other eukaryotes. These authors postulate that such regions act as modulators of protein-protein interactions. Why SC would require a larger fraction of asparagine-rich sequences for such interactions as compared with the eukaryotes is unclear. The lower tolerance for proline-rich sequences is probably due to the unicellular nature of SC. It has no need for the proline-rich extracellular structural proteins that the multicellular eukaryotes require. The reasons for the discrimination against leucine- and isoleucine-enriched sequences and the lower tolerance for arginine-rich sequences remain enigmatic.
The worm CE also possesses its own unique distribution of protein simple sequences. From Fig. 4 it can be seen that CE has some preference for short phenylalanine-rich sequences and for long glutamine- and serine-enriched sequences. CE also appears to be less tolerant of asparagine-rich sequences than are SC, DM, and perhaps AT and is less tolerant than are DM and AT of long proline-rich sequences. AT has little tolerance for threonine-rich sequences and a lowered tolerance for glutamine-rich sequences (Fig. 4). AT does not appear to have a heightened preference for any particular simple sequences at any length scale compared with the other eukaryotes.
It is clear that each of the four eukaryotes examined possesses its own unique distribution of simple sequences (Fig. 4). Based upon the analysis of homopolymeric runs performed by Karlin et al. (10) and an analysis by Kreil and Kreil (33) of asparagine-rich sequences, it seems clear that the human proteome will also display a unique simple sequence distribution. Some of the differences observed for the four eukaryotes examined arise for understandable reasons. For example, SC would not be expected to possess as many proline-rich sequences as would the other eukaryotes examined since SC does not have the same requirements for proline-rich structural proteins. However, as noted repeatedly above, the reasons for many of the various simple sequence preferences observed are not known. Some differences might well arise as a result of an organism using particular residues for the same purposes as other organisms use a different set of residues. For example, as suggested, SC might utilize serine-rich regions as flexible linkers where CE, DM, and AT use glycine-rich sequences. A detailed analysis of the conservation of simple sequence regions will aid in resolving such issues. Huntley and Golding (19) have noted that simple sequences are the most commonly shared feature between proteins but that the identity of the residues within the sequences can vary between organisms.
Functional Analysis of Protein Simple Sequence Occurrence
Our survey of the eukaryote (and prokaryote) proteomes has resulted in the identification of an enormous number of protein simple sequences, far more than would be expected were sequences random in nature. We have postulated that many of these sequences play some kind of functional role. This postulate is supported by a limited amount of experimental and bioinformatic evidence (3, 5, 912, 29, 30, 34). To further examine this issue we have examined the distribution of simple sequences in proteins of known function. Specifically, we have collected the sequences of all proteins from each of the four eukaryotes that are annotated in the SWISS-PROT database (35, 36) as being involved in a protein class (e.g. membrane proteins) or set of processes (e.g. transcription). The occurrence and distributions of simple sequences in these proteins were then analyzed using the approaches used on intact proteomes above. The results are shown in Table II. Note that the data shown are highly dependent upon the completeness and accuracy of the annotations in SWISS-PROT as well as how well studied the particular classes of proteins are in each organism. As a result of these limitations we have found comparatively few protein sequences in most cases. In addition, some proteins may appear in more than one classification in Table II. Thus, it is difficult to make direct comparisons between classes as to the number of simple sequences found as well as between organisms. However, it is feasible to consider the most common types of simple sequence found (Table II).
|
Considering now each class of protein in Table II, it can be seen that the most common simple sequences in the limited set of cell cycle proteins identified are serine-rich. Very few cell cycle proteins were found except in the case of SC, which is perhaps the model system for studying these processes. The 102 SC cell cycle proteins found possess a total of 177 simple sequences, over one-third of which (64) are serine-rich. This is a clear enrichment of such sequences as compared with the overall distribution of simple sequences in SC (Fig. 4). Potential roles for these sequences are as discussed above.
We found relatively few proteins with the keyword "metabolism" in their annotations (Table II). With the preceding finding as a caveat, it is notable that there are fewer simple sequences per protein in metabolism-related proteins (significantly less than one per protein) than the average over intact proteomes (slightly more than one per protein; Table I). This would suggest that simple sequences are either generally not required in metabolism-related proteins or that they are discriminated against in comparison to other protein classes. However, as already noted, few proteins were identified in this class, and we could simply be observing the vagaries of poor statistics.
Using "signal" as a keyword, we have identified a significant number of proteins in all four eukaryotes (Table II). These proteins possess a significant number of simple sequences, the most common of which are enriched in serine, threonine, proline, and perhaps surprisingly leucine. Given that signal transduction processes involve significant numbers of phosphorylation and dephosphorylation events, it is perhaps not so remarkable that serine- and threonine-rich sequences are common in signaling proteins. There are also a number of small protein interaction domains common in signaling processes (e.g. Src homology 3 domains) that bind to proline-rich sequences (11), leading to an enrichment in such sequences in this class. Thus, the occurrence of serine-, threonine-, and proline-rich sequences in this class of proteins would appear to be biologically significant. The occurrence of a significant number of leucine-rich sequences is at first puzzling, particularly given that such sequences are found at levels lower than would be predicted using our Poisson-distribution model (Fig. 4). However, it is possible that a reasonable number of the proteins in this class possess membrane-spanning segments that, as will be discussed below, can be leucine-rich.
A significant number of transcription-related proteins were also identified (Table II). Remarkably, the transcription-related proteins in SC and DM possess enormous numbers of simple sequences (677 in 274 proteins and 838 in 177 proteins, respectively). Although the same level of enrichment is not seen in CE and AT, it is tempting to postulate that large numbers of simple sequences indicate important functional roles in transcription processes. Indeed, such proteins are known to often possess glutamine-rich sequences (3), so it is not surprising that such sequences are common in DM transcription-related proteins. We also find large numbers of serine-rich sequences (Table II). Perhaps the best known example of a serine-rich region acting as a phosphorylation switch is in RNA polymerase II (29). Although not enriched in serine enough to be found in our surveys, this region is known to interact with a variety of transcription factors when not phosphorylated. These interactions, and consequently transcription, are interrupted when serines become phosphorylated. It is possible that there are similar serine-rich switch/interaction regions in other transcription-related proteins.
A reasonable number of transport-related proteins were also found (Table II). These possess approximately the same numbers of simple sequences as would be expected from the overall average values for the four eukaryotes (Table I). Leucine-, alanine-, and serine-rich sequences are the most common. A significant number of transport-related proteins will be associated with membranes given that transport of molecules through membranes is a common and vital set of processes. The large numbers of leucine-rich and perhaps alanine-rich regions are then most likely indicative of membrane-spanning regions as suggested by Schwartz et al. (28).
Finally, we have identified numerous membrane-associated proteins, many of which contain simple sequences (Table II). Presumably for the reasons noted above, large numbers of leucine-rich sequences are found in this class of proteins. In fact, many of these leucine-rich regions are annotated as being membrane-spanning in the SWISS-PROT files for these proteins. Why serine-rich regions would be so abundant is not clear. Some of these are probably found in signaling proteins associated with the membrane (see above), while others may be acting as flexible linkers separating soluble domains from integral membrane domains. Wootton and Drummond (14) have hypothesized that serine-rich regions act as flexible linkers. Notably, glycine-rich regions, also thought to act as linkers, are common in DM membrane-related proteins. Perhaps serine-rich regions are substituted for glycine-rich in the other organisms (Table II).
Simple Sequence Structure
It would of course be useful to know the types of structures adopted by protein simple sequences. Unfortunately little is known about the structural properties of such sequences. Saqi (17) and more recently Huntley and Golding (38) have looked for all occurrences of simple sequences in protein structures in the Protein Data Bank (39). Very few were found. Huntley and Golding (38) point out that simple sequences are under-represented in the Protein Data Bank and hypothesize that this indicates that such regions are intrinsically disordered. Intrinsically disordered regions of proteins are a barrier to structure determination and are consequently routinely deleted from proteins by structural biologists. That simple sequences, particularly relatively long sequences, are disordered is supported by the work of Dunker and co-workers (15, 16, 40), who use low complexity sequences as identifiers of intrinsically disordered proteins. There are indications, however, that not all protein simple sequences are unstructured. For example, leucine-rich membrane-spanning sequences will be highly structured, most likely -helices, in the membrane. Proline-rich regions are believed, and in many cases have been shown, to adopt the left-handed polyproline II helical conformation (11). It would be a mistake to assume that all simple sequences are unstructured. This is an area that clearly requires further investigation.
![]() |
CONCLUSIONS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Among the eukaryotes we find that DM possesses more simple sequences per protein than any of the other three eukaryotes (Table I). This is true for all simple sequence lengths (Fig. 3). By comparison, SC, CE, and AT possess a similar number of simple sequences per protein at most lengths with SC perhaps showing some preference for long simple sequences (Fig. 3). In the distributions for the intact proteomes, we find that simple sequences enriched in certain residues, for example alanine, glutamine, glutamate, glycine, and serine, appear to be favored, whereas other residues, specifically leucine, isoleucine, and valine, are discriminated against. These preferences do not correlate with residue occurrence. Some of these observed preferences can be rationalized in terms of structure and/or function, while others remain enigmatic.
The most notable finding of these surveys is that each of the eukaryotes possesses its own unique distribution of protein simple sequences. We find that each organism apparently has preferences for simple sequences enriched in certain residues while at times disfavoring simple sequences enriched in other residues. It is not clear why these eukaryotes have evolved to have differing simple sequence distributions. However, given the sheer number of such sequences found plus the known functional importance of those simple sequences that have been studied in detail, it is tempting to postulate that not only have eukaryotes evolved to tolerate large numbers of simple sequences but also that they require many of these. A simple analysis of simple sequences in classes of proteins indicates that some classes may favor simple sequences enriched in certain residues (Table II).
The data presented here raise questions that can only be answered by further study and analysis. For example, is there an association between type of simple sequence and function? The data in Table II are suggestive but by no means conclusive. Do different organisms use different types of simple sequence for the same function? The fact that each organism possesses a unique distribution implies that this may be the case, but we have no direct evidence. What are the structural properties of such sequences? Little structural data is currently available, although it is clear that it would be incorrect to assume that all simple sequences will be disordered. Answers to questions such as these will shed light on the abundance and distributions of simple sequences highlighted here.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, December 9, 2002, DOI 10.1074/mcp.M200032-MCP200
1 The abbreviations used are: AT, A. thaliana; AF, A. fulgidus; AgT, A. tumefaciens C58; AP, A. pernix K1; BH, B. halodurans; BM, B. melitensis 16M chr1; BS, B. subtilis; CA, C. acetobutylicum ATCC824; CE, C. elegans; DM, D. melanogaster; DR, D. radiodurans chr1; EC, E. coli K-12; HI, H. influenzae; HP, H. pylori 26695; HS, Halobacterium sp. NRC-1; MG, M. genitalium; MJ, M. jannaschii; MP, M. pneumoniae; MT, M. thermoautotrophicum; Nos, Nostoc sp. PCC7120; PA, P. abyssi; PAe, P. aerophilum; PH, P. horikoshii; SC, S. cerevisiae; SS, Synechocytis sp. PCC6803; SSol, S. solfataricus; ST, S. tokodaii; TA, T. acidophilum; TV, T. volcanium; VC, V. cholerae chr1.
* This work was supported in part by National Science Foundation Grant MCB-00110720 (to T. P. C.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Supported in part by a postdoctoral research grant in computational genomics from the Pharmacia Corp.
To whom correspondence should be addressed: Center for Structural Biology, Dept. of Molecular and Cellular Biochemistry, University of Kentucky, 800 Rose St., Lexington, KY 40536-0298. Tel.: 859-323-6037; Fax: 859-323-1037; Email: tpcrea0{at}uky.edu
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|