Streptococcus pyogenes sclB encodes a putative hypervariable surface protein with a collagen-like repetitive structure

Adrian M. Whatmore1

Infectious Disease Research Group, Department of Biological Sciences, University of Warwick, Coventry CV4 7AL, UK1

Tel: +44 2476 528359. Fax: +44 2476 523701. e-mail: a.m.whatmore{at}warwick.ac.uk


   ABSTRACT
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
Streptococcus pyogenes is the causative agent in a wide range of diseases of humans of varying severity. During a study scanning the genome sequence of a serotype M1 invasive isolate SF370 for novel surface proteins, an ORF, designated sclB, was identified. The putative protein encoded by sclB contains both a signal peptide and classic Gram-positive wall-associated sequences. Comparison of the sequences of this ORF with those from a number of unrelated isolates demonstrated that sclB encodes a putative surface protein with a variable N-terminal sequence followed by a variable length tract of collagen-like GXYn repeats. A further feature of sclB is the presence of CAAAA repeat tracts immediately downstream of the putative start codon. The number of these pentameric repeats varies from 4 to 15 between strains and variation in repeat number results in the predicted SclB protein being either in or out of frame relative to the start codon. These observations suggest that expression of this protein may be regulated at the translational level as a result of gain or loss of CAAAA repeats. While the function of SclB remains to be elucidated, an sclB-specific transcript was detected by RT-PCR during in vitro culture. Finally, it is shown that a second gene, sclA, potentially encoding a protein with a similar extensive collagen-like structure and variable N-terminal sequence, is present in all isolates of S. pyogenes tested to date. Thus S. pyogenes harbours a novel family of structurally related and surface-exposed proteins of potential importance in the pathogenic process.


View this table:
[in this window]
[in a new window]
 
Table 1. Characteristics of strains examined in this study

 
Keywords: bacterial surface protein, genetic variation, short sequence repeats, tandem repeats

Abbreviations: SSR, short sequence repeat

The GenBank accession numbers for the sequences determined in this work are given in the text and Table 1.


   INTRODUCTION
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
The Group A streptococcus (Streptococcus pyogenes) remains an important human pathogen, associated with a range of superficial skin and throat infections as well as a variety of more serious invasive infections and autoimmune sequelae such as acute rheumatic fever and post-streptococcal glomerulonephritis (Cunningham, 2000 ). As is the case in other Gram-positive bacteria, the cell wall of S. pyogenes is associated with an array of proteins which can share common structural features (Navarre & Schneewind, 1999 ). Proteins destined for transport across the cytoplasmic membrane often contain a signal (leader peptide) composed of a core of hydrophobic residues and a charged N-terminal end (von Heijne, 1990 ). Signal peptides of Gram-positive surface proteins, which are located at the N-terminus of the protein, are proteolytically removed by signal peptidases on translocation across the cytoplasmic membrane (Dev & Ray, 1990 ). Many Gram-positive surface proteins also share striking C-terminal sequence similarities corresponding to cell wall sorting signals. These consist of a highly conserved LPXTG sequence motif followed by a hydrophobic domain and a charged tail of largely positively charged residues (Fischetti et al., 1990 ). The hydrophobic domain and charged tail are believed to be important in preventing the immediate release of the protein from the secretory pathway into the extracellular milieu. This retention is thought to allow recognition of the LPXTG motif and proteolytic cleavage between the threonine and glycine residues, permitting cross-linking of the surface protein with peptidoglycan precursors (Navarre & Schneewind, 1999 ).

A number of proteins harbouring cell wall sorting signals have now been characterized from S. pyogenes. These include C5a peptidase (Chen & Cleary, 1990 ), the T protein (Schneewind et al., 1990 ), two distinct fibronectin-binding proteins – SfbI/proteinF (Sela et al., 1993 ; Talay et al., 1994 ) and SfbII/OF (Kreikemeyer et al., 1995 ; Rakonjac et al., 1995 ) – and a large number of proteins encoded by members of the emm gene family (reviewed by Navarre & Schneewind, 1999 ). As might be expected, many of the surface proteins characterized to date have been postulated to play important roles in virulence either by modulating the host immune response and/or by involvement in adhesion to host surfaces (Jenkinson & Lamont, 1997 ; Cunningham, 2000 ). Many, notably members of the emm gene family, possess multiple binding activities for an array of host proteins that can include albumin, fibrinogen, IgG, IgA, kininogen, fibronectin, factor H, FHL-1, plasminogen and C4BP (Navarre & Schneewind, 1999 ).

A further common feature of many surface proteins of Gram-positive bacteria is the presence of tandem repeat domains that can vary in size from just several to several hundred amino acids. Often strain-to-strain variations in the number of repeats are apparent, implying that gain or loss of repeats occurs as a result of intragenic recombination events or slipped-strand mispairing. Perhaps the best examples of such repetitive structures in S. pyogenes are members of the emm gene family, which can contain up to three distinct sets of tandem repeats (Kehoe, 1994 ). Analysis of successive isolates from patients demonstrates that variation in the anti-phagocytic M6 protein occurs as a result of alteration in the number of tandem repeats (Hollingshead et al., 1987 ) and these changes can alter antigenic determinants, resulting in resistance to phagocytosis (Jones et al., 1988 ). Variable repetitive structure is also apparent in a number of surface proteins of other streptococci, including the S. pneumoniae surface protein PspA (Yother & Briles, 1992 ) and the alpha C protein encoding gene of S. agalactiae (Madoff et al., 1996 ). A number of possible benefits, which may be of relevance in pathogenesis, of variation in repeat number can be envisaged. These include antigenic variation and subsequent immune escape, alteration in substrate binding properties and the simple physical displacement of parts of proteins further away from the immediate extracellular environment where they may be masked by other surface components. In several Gram-negative bacteria so-called short sequence repeats (SSRs), consisting of 2–20 nucleotide tandem repeats, can be involved in gene regulation either by altering the spacing between promoter domains or by affecting the integrity of ORFs (van Belkum et al., 1998 ; Henderson et al., 1999 ). However to date there is no strong evidence for the occurrence of similar repeats implicated in gene regulation in Gram-positive bacteria.

Sequencing of the complete genome of a serotype M1 invasive isolate of S. pyogenes (Roe et al., 1999 ) has facilitated the identification of previously unrecognized surface proteins. Such proteins are clearly of potential interest as many are likely to be involved in interactions with the host and may therefore play important roles in the pathogenic process. Elucidation of the nature of these proteins will facilitate a fuller understanding of the virulence of S. pyogenes and may identify novel therapeutic or vaccine targets. In this study, an ORF (sclB) identified during a screen of the genome sequence, potentially encoding a surface protein with a highly repetitive collagen-like structure, is described. In addition, the ORF is shown to be highly variable and to contain SSRs immediately downstream of the putative start codon that may regulate the expression of the corresponding SclB protein product.


   METHODS
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
Chromosomal DNA preparation.
Strains of S. pyogenes used in this study are listed in Table 1. Chromosomal DNA was isolated from S. pyogenes following overnight incubation at 37 °C on blood agar plates consisting of Todd–Hewitt broth supplemented with 1·5% (w/v) Bacto-agar and 5% (v/v) sheep blood (Gibco-BRL). Overnight growth from one or two spread plates was resuspended in 500 µl sterile PBS. To this suspension 10 µl of an enzyme lysis mix was added (30 µg hyaluronidase ml-1, 1 mg mutanolysin ml-1 and 20 mg lysozyme ml-1 in PBS) and incubated for 4 h at 37 °C. Following addition of 50 µl 10% (w/v) SDS the suspension was vortexed briefly and incubated at room temperature for 5–10 min. The resulting lysate was extracted with phenol/chloroform twice followed by chloroform extraction and DNA was precipitated with 2·5 vols ethanol and 10% (v/v) 3 M sodium acetate pH 5·2. Following washing with 70% (v/v) ethanol, DNA was dried, suspended in sterile distilled water and stored at -20 °C until required.

PCR.
Primer pairs used for initial amplification of RST00068 and RST00596 products were 210up (5'-CCAAGCCTAATCGCTTAGTCT-3') with 210dn (5'-TACTTTCCATCAGTTAGGTAGCA-3') and 174up (5'-TTTGCTTATCAGTAAGGTCTCTC-3') with 174dn (5'-CGTTCAGGAGTTACCAAAAGAAT-3') respectively. Primer pair sclup (5'-ATGTTGACATCAAAGCACCA-3') and scldn (5'-GTTGTTTTCTTTGCGTTTTGT-3') was used for amplification of sclA. PCR was performed under standard conditions with 32 cycles of 95 °C for 1 min, x °C for 1 min and 72 °C for 1 min where x °C represents an annealing temperature appropriate for the particular primer set.

Sequencing.
Full-length RST00068 ORFs were sequenced from PCR products amplified using primer 210up, described above, in conjunction with primer Rst69 (5'-TTTAACTCATCTGATGTTTCCAT-3') located approximately 200 bp downstream of the RST00068 sequence in the genome sequence. Sequencing was performed using a range of internal primers and dye-labelled dideoxy terminator cycle sequencing kits (Beckman Coulter) with the CEQ2000 automated DNA analysis system.

DNA sequence analysis.
Preliminary analysis of sequences was peformed using the DNASTAR package and multiple sequence alignments were performed using the CLUSTAL package (http://www.ebi.ac.uk).

Predicted structure of SclB.
The structure of the putative protein encoded by sclB was predicted using the PROT EAN program within the DNA Star package. This package uses the Chou–Fasman (Chou & Fasman, 1978 ) and Garnier–Robson (Garnier et al., 1978 ) algorithms to predict alpha, beta and turn regions, the Garnier–Robson algorithm for predicting coil regions and the Kyte–Doolittle algorithm (Kyte & Doolittle, 1982 ) for calculating hydrophilicity. Protein secondary structure was confirmed using the PSIPRED program at http://insulin.brunel.ac.uk/psipred (Jones, 1999 ). The presence and predicted cleavage site of signal peptide was predicted using the SignalP program at http://www.cbs.dtu.dk/services/SignalP (Nielsen et al., 1997 ).

RT-PCR.
Total cell RNA was prepared from S. pyogenes cultures grown to mid-exponential phase in Todd–Hewitt broth at 37 °C. A 10 ml culture was harvested by centrifugation and the pellet was resuspended in 100 µl nuclease-free water. RNA was isolated using the Hybaid Ribolyser Blue Kit and Ribolyser according to the manufacturer’s instructions and was resuspended in 50 µl nuclease-free water. The RNA was then treated with 5 U RQ1 RNase-free DNase (Promega) for 30 min at 37 °C. Following inactivation of the DNase at 75 °C for 5 min, RNA was stored at -70 °C in the presence of 40 U RNasin (Promega) until required. RT-PCR was performed using the Access RT-PCR system (Promega) according to the manufacturer’s instructions. First-strand cDNA synthesis was performed at 48 °C for 45 min while second-strand cDNA synthesis and PCR amplification were performed using 36 amplification cycles at an annealing temperature of 53 °C. Positive control primers gtrup and gtrdn, designed to amplify an internal 598 bp fragment of the housekeeping gene glnQ (glutamine-transport ATP-binding protein), were used. Primer pairs 655Exp (5'-GGATCCGATGGTGAAGATGCCCAA-3'), corresponding to the extreme N-terminus of the mature protein, with 210dn (described above) and SigP (5'-TTWGGAGGTGYAAGYGCRGTT-3'), corresponding to the signal peptide, with Rst69 (described above) were used to detect an sclB-specific transcript.

emm sequence typing.
This was performed as described by Beall et al. (1996) using primers 1 and 2, which were previously described by Whatmore & Kehoe (1994) . The identity of emm sequences was confirmed by interrogation of the Centers for Disease Control emm sequence database at http://www.cdc.gov/ncidod/biotech/strepinfo.html.


   RESULTS
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
Identification of ORFs potentially encoding collagen-like proteins
To identify genes encoding putative surface proteins of S. pyogenes the genome sequence of S. pyogenes SF370, a serotype M1 isolate (http://www.genome.ou.edu/strep.html), was screened for the presence of ORFs potentially encoding proteins with the characteristic wall-associated region seen in many Gram-positive surface proteins (Navarre & Schneewind, 1999 ). BLAST searches were performed using the nucleotide sequence encoding the wall-associated region of the emm5 gene from strain NCTC 8193 (accession no. U02480 – bases 1267–1371) described previously by Whatmore & Kehoe (1994) . During the course of this investigation two ORFs were identified that potentially encoded proteins with high similarity to collagen on BLAST search as a result of extensive stretches of sequence encoding GXYn repeats (Beck & Brodsky, 1998 ). These sequences are identified in the WIT database (http://wit.mcs.anl.gov/WIT2/CGI/prot.cgi), which provides annotation and metabolic reconstructions of genome sequences, as RST00068 and RST00596. The unusual structure of these putative proteins and their potential surface location raised the possibility of direct involvement in virulence (e.g. as adhesins or in binding to matrix proteins) or indirect involvement in the autoimmune sequelae of S. pyogenes infection. Thus, PCR primers were designed to investigate the presence and distribution of these ORFs in a wider sample of S. pyogenes.

In the case of RST00068 the reverse primer was designed immediately downstream of the putative stop codon that could be clearly defined on the basis of the wall-associated sequence seen in many Gram-positive bacteria. However, the start of the gene was more difficult to predict – the WIT database predicted a TTG start codon, but there were no convincing RBS or promoter elements associated with this predicted start codon (see Fig. 1). Because of this uncertainty, the upstream primer (210up) was designed within a conserved 3' region of a gene upstream of RST00068 which had significant similarity to GTP-binding proteins from other Gram-positive bacteria, such as Bacillus and Mycobacterium. Initially, 17 randomly selected strains were tested by PCR and a PCR product was amplified from 15 of these – interestingly these products were extremely variable in size. In both negative reactions the integrity of the DNA was confirmed by the amplification of the emm gene (Whatmore et al., 1994 ) from the same preparations. The presence of the sequence was then examined in more extensive studies using strains associated with acute rheumatic fever (6), acute post-streptococcal glomerulonephritis (16), throat infection and carriage (19) and impetigo (9). Forty-five of these 50 isolates (90%) gave a PCR product and these reactions confirmed the extensive size diversity of this product, with bands ranging from some 600 bp to approximately 2 kb. There was no clear relationship between the presence of this gene and the ecological association of the isolates and again emm could be amplified from all isolates that were PCR negative for the RST00068 ORF.



View larger version (39K):
[in this window]
[in a new window]
 
Fig. 1. (a) Comparison of sequences upstream of ORF RST00068 with three distinct isolates showing the location of the predicted start codon and the associated RBS, -10 and -35 sequences. The TTG start codon predicted in the WIT database is underlined. The ] represents the end of the predicted signal peptide. (b) Comparison of sequences downstream of RST00068 showing the stop codon and putative transcription terminator. Isolate 655 lacks the stop codon found in all ten other isolates characterized but may utilize an additional stop codon located slightly farther downstream, but prior to the transcription terminator. Residues conserved across all sequences are shown by an asterisk.

 
Primers to amplify RST00596, the second ORF potentially encoding a protein with collagen-like structure, were designed with the 3' primer 174dn again just downstream of the putative stop codon and the upstream primer, 174up, approximately 850 bp upstream of this. In contrast to the primer set described above, these primers did not amplify a product in an initial screen of S. pyogenes isolates, not even when an M1 isolate was used, although it should be noted that the genome strain (SF370) was not examined in any of these screening studies.

Sequence of RST00068
To further characterize the nature of the genetic diversity apparent in ORF RST00068, sequencing of the complete ORF and flanking regions was undertaken from 11 isolates, while an incomplete 5' sequence was obtained from a further seven isolates. The isolates chosen were from diverse diseases and various geographical localities and represented diverse emm types as determined by sequencing of the 5' emm gene (Whatmore et al., 1994 ).

Identification of predicted start codon and control elements. As mentioned above, the rarely used TTG start codon predicted by WIT was not convincing. Fig. 1(a) shows a comparison of the sequences from three isolates, 650, 655 and 733, with the RST00068 sequence. The TTG is marked but no convincing control elements could be identified. On comparing sequences it is also clear that the TTG is located in the middle of a highly conserved region encoding a putative signal peptide as described below. Comparison of sequences at the 5' of the ORF enabled the identification of much more likely elements. A much more promising GTG start codon was identified farther upstream – GTG is used as an initiation codon at about 5% the frequency of ATG. A purine-rich tract was located immediately upstream of this start codon where a RBS would be expected. Located some 55 bp upstream of this were putative -35 and -10 elements which exactly matched, in both sequence and spacing, the TTTACT/A and TATAAT consensus sequences determined from other S. pyogenes genes (unpublished data). Fig. 1(b) shows an alignment downstream of the predicted stop codon. An 11 bp inverted repeat followed by a run of T residues resembling a classical transcription terminator was located downstream of the putative coding sequence.

Identification of pentameric repeats within the ORF. A further interesting feature of this region was the presence of a tract of pentameric CAAAA repeats immediately after the putative start codon. Although only four of these repeats were seen in the SF370 genome sequence, the number of these repeats was found to vary from a minimum of 4 copies up to 15 copies (Table 1). It is clear that variation in repeat number resulted in the ORF being either in or out of frame with respect to the predicted start codon, as illustrated in Fig. 2. ORFs containing 5, 8, or 14 repeats were in-frame while other numbers of repeats resulted in the generation of stop codons almost immediately downstream of the CAAAA tract. The exceptions were two isolates, each containing nine repeats (733 and 740), that also appear to be in-frame as a result of minor sequence variation at the end of the CAAAA tract. Thus, these repeats may represent a mechanism to modulate expression of this ORF at the translational level.



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 2. Comparison of sequence of the CAAAA repeat region of RST00068 located within the signal-peptide-encoding sequence from all isolates examined in this study indicating whether the gene remains in-frame.

 
Predicted protein structure. An alignment of the predicted amino acid sequences of five of the eleven RST00068 alleles sequenced in full is shown in Fig. 3. These five sequences represent alleles that appear to be in-frame relative to the predicted start codon and were chosen to illustrate the major features of the predicted proteins. The 5' region of the ORF was found to be highly conserved and was predicted by SignalP to encode a classical Gram-positive signal (Nielsen et al., 1997 ). Depending on the particular sequence analysed, SignalP predicted the cleavage point of the mature protein to be either after the alanine residue, indicated by an arrow, or at the alanine four residues earlier. The second alanine appears to be the most likely, given comparison with the signal peptides of other S. pyogenes proteins and the continued conservation prior to this residue. This conservation contrasts strongly with the extensive sequence diversity seen immediately downstream. The putative N-terminal tip of this protein appears to be diverse, with limited sequence similarity apparent between many strains. On moving further downstream the sequence becomes gradually less diverse, with some similarity apparent between sequences. Approximately 60 residues into the protein an extensive region of collagen-like sequence begins. This sequence is characterized by a three-residue periodicity in which the first residue is always glycine, with a high proportion of proline, alanine, lysine, aspartate and glutamate at the other residues. The length of this GXYn tract varies substantially, from about 50 aa in the case of strain PT4854 (full sequence not shown) to about 240 aa in the case of P19. The more C-terminal regions of the collagen-like domain are composed of variable numbers of more defined 15 aa repeated domains with the consensus sequence GKDGQN/DGKDGLPGKD. The sequence downstream of the collagen-like domain is highly conserved between all isolates and encoded the classic wall-associated region of Gram-positive bacteria, including an LPXTG motif and a charged tail prior to the stop codon.



View larger version (72K):
[in this window]
[in a new window]
 
Fig. 3. CLUSTAL alignment of the complete predicted amino acid sequences from five of the RST00068 ORFs sequenced in full in this study. The predicted signal sequence cleavage point is shown by an arrow. Identical residues are shown by an asterisk while colons represent conserved substitutions and full points indicate semi-conserved substitutions. The conserved LPXTG motif is shown in bold.

 
Secondary structure analysis of all the putative proteins using the PROTEAN package predicted a similar structure for SclB from all strains. A diagrammatic representation of the predicted structure of one SclB protein, that from strain M63, is shown as an example in Fig. 4. Predicted molecular masses for the six isolates containing an in-frame protein range from 31·8 to 47·7 kDa, while the isoelectric points vary from 8·35 to 9·25. All the predicted proteins have a high glycine content (13·7–20·7%) and a mean proportion of 10% each of aspartate, lysine and proline. The proteins are largely hydrophilic molecules with the exception of the known hydrophobic regions constituting the N-terminal signal peptides and the C-terminal wall-associated region. Most of the molecule other than the signal peptide and wall-associated region was predicted to have a coiled structure. Regions with alpha-helical and beta-sheet structure were confined to the C- and N-terminal signals and the N-terminal hypervariable region of the protein. Secondary structure prediction using PSIPRED confirmed the above analysis, predicting an entirely coiled structure from approximately 40 residues into the mature protein as far as the wall-associated region.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 4. Predicted secondary structure and hydropathy profile of SclB from strain M63. The analysis was performed using the default parameters of the PROTEAN program within the DNA Star package.

 
Comparison of the 5' hypervariable region from multiple isolates. Comparison of the predicted amino acid sequences encoded by the 5' variable region of the ORF in all 18 strains examined in this study confirmed that the N-terminal end of the predicted mature protein was variable (data not shown). However, some of the sequences were clearly closely related to each other and appeared to have evolved as a result of the accumulation of nonsynonymous point mutations. In two cases, strains shared identical predicted N-terminal amino acid sequences (P195 with 669 and P15 with PT4854). The full ORFs were sequenced from both P195 and 669, both of which are emm87 isolates and the entire sequence was found to be identical with the exception of a difference in the number of CAAAA repeats. As described above, there is a clear trend towards sequences showing more similarity to each other on moving away from the extreme N terminus of the predicted protein.

Detection of expression of the RST00068 ORF
To determine whether a transcript corresponding to RST00068 was present in S. pyogenes, RT-PCR was performed using specific primers. Representative results using RNA isolated from strain 655 are shown in Fig. 5. An RNA transcript of approximately the correct size was detected using primer pair 655Exp/210dn located within the predicted RST00068 transcript and predicted from the sequence analysis to give a product of 815 bp. In contrast, when using the primer set SigP/Rst69, where Rst69 is located downstream of the predicted transcription terminator, no RNA transcript was detected in the DNase-treated preparation. In the RNA fraction that was not subjected to DNase treatment, a band of approximately 1 kb was seen, equating to the 1011 bp product which would be predicted assuming the region downstream of RST00068 in strain 655 was of identical length to the equivalent sequence in strain SF370. Primers to the housekeeping gene glnQ were used as a positive control and detected a transcript specific for this gene. Similar results to those obtained using the RNA from isolate 655 were also obtained using RNA from three other isolates, P180, P19 and 733, confirming the presence of a transcript corresponding to ORF RST00068 in S. pyogenes in culture in vitro.



View larger version (80K):
[in this window]
[in a new window]
 
Fig. 5. RT-PCR analysis of the expression of RST00068 in strain 655. Primer pairs are shown at the top of the figure. Lanes 1–3 are positive control primers to the housekeeping gene glnQ; lanes 4–6 are primers within the predicted transcript; lanes 7–9 include one primer outside the predicted transcript. Lanes 1, 4 and 7, DNase-treated RNA, -RT. Lanes 2, 5 and 8, DNase-treated RNA, +RT. Lanes 3, 6 and 9, untreated RNA preparation, -RT.

 
Confirmation of the presence of a second collagen-like protein
During the course of this work the sequence of an S. pyogenes gene, scl, encoding a collagen-like protein was deposited in GenBank (AF252861 – S. Lukomski, K. Nakashima, I. Abdi, V. J. Cypriano, S. D. Reid & J. M. Musser). This gene was isolated from SF370, the genome-sequencing strain and, based on the sequence of the conserved signal and wall-associated regions, clearly corresponded to the second ORF described earlier (RST00596) despite some sequence anomalies. The inability to amplify this gene by PCR using a single primer set based on this sequence in this study (174up and 174dn) most probably reflects genetic variability at this locus and an unfortunate positioning of PCR primers. To confirm the presence of this gene in isolates from which ORF RST00068 had already been characterized new PCR primers were designed based on the scl sequence (sclup and scldn). A product was obtained from all 11 isolates from which RST00068 had been sequenced in full and, as with the RST00068 product, the scl product was of variable size. Sequencing of the 5' of two of these products from isolates 655 (accession no. AJ301823) and M60 (accession no. AJ301824) confirmed that this gene is distinct from that corresponding to RST00068 (data not shown). This gene appears to share some of the characteristics of RST00068. The limited sequencing performed suggests that, following a conserved signal peptide, the N-terminal sequence encoded by scl is variable between isolates and the size variation of PCR products was likely to reflect a variable length of GXYn repeats. Thus it appears that many isolates of S. pyogenes have the potential to express two collagen-like proteins: scl, already deposited in the database and from hereon referred to as sclA, and a second putative gene which, for the sake of consistency, will be referred to from hereon as sclB.


   DISCUSSION
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
This study has characterized an ORF, sclB, with the potential to encode a previously uncharacterized hypervariable surface protein of S. pyogenes. This ORF appears to be present in a large proportion of isolates although further work using alternative primer sets and/or Southern blotting experiments is necessary to determine whether it is ubiquitous. Several strands of evidence lead to the belief that a real gene has been correctly identified. Firstly, convincing promoter elements are located upstream of the proposed gene with a sequence encoding a signal peptide located immediately downstream of the putative start codon and a sequence encoding classic Gram-positive wall-associated sequences at the 3' end of the ORF followed by a putative transcription terminator. Furthermore, the pattern of variability is consistent with that seen in some other S. pyogenes surface proteins. Both members of the emm gene family and the sof gene, encoding the fibronectin-binding opacity factor (Whatmore et al., 1994 , 1995 ; Rakonjac et al., 1995 ; Courtney et al., 1999 ), encode proteins displaying a similar pattern of C-terminal conservation with extensive diversity in the N-terminal distal part of the protein. This pattern of diversity suggests that sclB is expressed at least transiently and is subject to selective pressures imposed by the host immune response which are likely to be strongest on the more exposed N-terminal tip of the molecule. In agreement with the idea of selective pressure acting against this region, comparison of closely related sequences such as those from P195 and M63 demonstrates an inflation of nonsynonymous change over synonymous. Thus, in the sequence encoding the N-terminal 20 aa there are eight non-synonymous substitutions compared with only three synonymous substitutions – such an inflated ratio is indicative of positive selection for amino acid substitutions (Smith et al., 1995 ). This evidence that sclB can be expressed in vivo is supported by RT-PCR results demonstrating that a transcript corresponding the ORF defined in this study is produced during in vitro cultivation.

This study also identified a novel repetitive element apparently located within the sclB coding sequence. Repetitive DNA elements, often called SSRs have been increasingly recognized in prokaryotes in recent years. As a result of slipped-strand mispairing altering the number of such repeats they can be involved in gene regulation both when they are positioned upstream of the gene, thus influencing transcription, or within the gene, influencing translation (Henderson et al., 1999 ). There are now several examples of regulation of genes occurring at the translational level as a result of variation in the number of SSRs at the 5' of genes, similar to those seen in sclB, which alter the reading-frame. For example, the adhesins encoded by the opa genes of Neisseria gonorrhoeae and Neisseria meninigitidis achieve an on–off phase variation at the translational level by alterations in the reading frame of a semi-variable hydrophobic portion of the leader sequence consisting of a run of CTCTT repeats which can vary as a result of slipped-strand mispairing (Stern et al., 1986 ). In a similar manner translational control as a result of slipped-strand mispairing within a run of cytosine residues in the ORF has been shown to regulate expression of the Bordetella pertussis bvgS gene, a major regulator of virulence genes in this organism (Stibitz et al., 1989 ). Similar events have been implicated in the regulation of genes in a variety of other Gram-negative organisms, including Vibrio cholerae, Haemophilus influenzae, Helicobacter pylori, Mycoplasma hyorhinis, Haemophilus somnus, Pasteurella haemolytica and Mycoplasma fermentans (Henderson et al., 1999 ; van Belkum et al., 1999 ). Recent analysis of whole genome sequences has led to suggestions that translational slipped-strand mispairing may be a more important and widespread mechanism of regulation than has previously been realized. A substantial number of genes in both H. pylori and N. meninigitidis contain such SSRs and are potentially regulated by this mechanism (Saunders et al., 1998 , 2000 ). There is currently no direct evidence that sclB is regulated at the translational level by slipped-strand mispairing of the CAAAA repeats. However, circumstantial evidence in the form of the location of these repeats at the 5' of the signal peptide encoding sequence and immediately downstream of a putative start codon suggests that this is a probable scenario. Although no attempt was made to demonstrate an alteration in repeat number in culture, different numbers of CAAAA repeats were found to be associated with sequences sharing otherwise identical sclB sequences (669 and P195), providing additional support for this scenario. To date, translational regulation by slipped-strand mispairing has not been demonstrated in Gram-positive bacteria although slipped-strand mispairing of internal sequence repeats has been implicated in the antigenic variation of some surface proteins, notably the highly repetitive emm genes which encode one of the major virulence proteins of S. pyogenes (Harbaugh et al., 1993 ; Whatmore et al., 1994 ).

The striking and extensive collagen-like structure of the putative SclB protein appears unprecedented amongst the prokaryotic proteins characterized to date. The extensive run of GXYn repeats resembles the remarkably regular structure of collagen, where virtually every third residue is glycine. It is thus tempting to speculate that SclB may exist in the same triple-helix conformation as collagen, composed of three protein units wound around each other to form a stiff cable. The three-residue glycine periodicity is crucial in such a structure as there are three residues per turn of the helix and the only amino acid small enough to fit in an interior position is glycine. Previously only very limited runs of collagen-like sequence have been reported in bacteria, such as in a hydrolase of Klebsiella pneumoniae, which contains up to six tripeptide repeats (Charalambous et al., 1988 ), and in the Streptococcus sanguis platelet-aggregation-associated protein (Erickson & Herzberg, 1993 ). Interestingly, a phage-encoded hyaluronidase found in S. pyogenes can contain a run of ten GXY repeats although this region is often missing, implying that it is not important for hyaluronidase activity (Hynes et al., 1995 ). During the course of this work Lindmark & Guss (1999) reported a novel fibronectin-binding protein, SFS, from Streptococcus equi, which also shows high scores against collagen on similarity searches. The basis of this identity was the high content of glycine, serine and proline present in both proteins and a ten-residue motif (QGERGEAGPP) seen in both collagen and the fibronectin-binding domain of SFS. Although SclB and SFS share some common characteristics in terms of amino acid composition there are clear differences between the proteins. While the collagen-like GXY repeats of SclB are virtually perfect and very extensive, SFS has only very limited runs of perfect three-residue periodicity. In addition, although SFS does possess a signal peptide it does not contain any of the sequence motifs known to mediate attachment to the cell wall.

The function of SclB remains unknown. Attempts to express the protein in Escherichia coli have so far proved unsuccessful, perhaps reflecting the unusual structure of SclB. It is possible that SclB is involved in either adhesion to the cell matrix or binding of host proteins from the extracellular milieu. A further intriguing possibility, given the structural similarity to such a ubiquitous host protein as collagen, is the involvement of this molecule in the autoimmune sequelae, particularly acute rheumatic fever, associated with S. pyogenes. Although the sequences of SclB are rather variable, the M protein precedent demonstrates that S. pyogenes proteins displaying little or no primary sequence similarity can have identical activities (Johnsson et al., 1998 ). In light of the similarity to SFS described above it is possible that SclB represents a fibronectin-binding protein, particularly as collagens of types I to V have been shown to bind fibronectin (Engvall et al., 1978 ). S. pyogenes is already known to harbour at least two fibronectin-binding proteins, SfbI/protein F (Sela et al., 1993 ; Talay et al., 1994 ) and SfbII/OF (Kreikemeyer et al., 1995 ; Rakonjac et al., 1995 ) although neither are found in all strains (Goodfellow et al., 2000 ). In addition, some proteins encoded by certain members of the emm gene family are believed to possess fibronectin-binding activity (Frick et al., 1995 ; Reichardt et al., 1995 ). However, it is becoming apparent that binding of fibronectin by bacterial surface proteins is a common property and multiple fibronectin-binding proteins have now been reported from a number of other Gram-positive bacteria (Joh et al., 1999 ).

The confirmation that at least two surface proteins with collagen-like structure are present in at least some isolates of S. pyogenes identifies a new family of variable proteins within this organism. Although limited sequence data are available as yet for SclA, it is likely that SclA and SclB share the same highly repetitive structure with a variable-length tract of GXYn repeats following an N terminus displaying extensive sequence diversity. However, SclA appears to lack the pentameric nucleotide repeats seen within the signal-peptide region of SclB. This family appears to share a number of common features with members of the emm gene family. Both families have variable N termini that, at least in part, reflect a positive selection for the accumulation of nonsynonymous point mutations (Whatmore et al., 1994 ). Both, in common with many Gram-positive surface proteins, also encode extensive tandem amino acid repeats within the structural gene. It seems likely that genetic variation in members of the scl family can be generated by the gain or loss of repeats as a result of intragenic recombination events and/or slipped-strand mispairing as seen in members of the emm gene family (Whatmore et al., 1994 ). The presence of two genes sharing considerable sequence similarity further raises the possibility that variation could also be generated as a result of intergenic exchange of DNA between sclA and sclB, as has been noted between enn and emm gene representatives in the emm gene family (Whatmore & Kehoe, 1994 ; Podbielski et al., 1994 ). It is also possible that one scl gene may serve as a silent cassette acting as a potential reservoir of diversity for antigenic variation.

The results presented in this paper represent an initial characterization of a novel S. pyogenes gene, sclB, potentially encoding a surface protein and belonging to a family of previously uncharacterized proteins. Future work will be aimed at expressing and elucidating the activities of SclB, determining whether post-infection human sera contain antibodies to SclB, confirmation of the role of the pentameric repeats in modulation of expression, and primer extension studies to confirm the location of control elements predicted by sequence comparisons.


   ACKNOWLEDGEMENTS
 
A.M.W. gratefully acknowledges the support of a Wellcome Trust Research Fellowship in Biodiversity (Grant no. 053589) and the assistance of Thangam Menon, Herbert Nsanze, Antoaneta Detcheva, Jorgen Henrichson and Helena Seppala in providing some of the isolates used in this study.


   REFERENCES
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES
 
Beall, B., Facklam, R. & Thompson, T.(1996). Sequencing emm-specific PCR products for routine and accurate typing of group A streptococci. J Clin Microbiol 34, 953-958.[Abstract]

Beck, K. & Brodsky, B.(1998). Supercoiled protein motifs: the collagen triple-helix and the {alpha}-helical coiled coil. J Structural Biol 122, 17-29.[Medline]

van Belkum, A., Scherer, S., van Alphen, L. & Verbrugh, H.(1998). Short-sequence DNA repeats in prokaryotic genomes. Microbiol Mol Biol Rev 62, 275-293.[Abstract/Free Full Text]

van Belkum, A., van Leeuwen, W., Scherer, S. & Verbrugh, H.(1999). Occurrence and structure–function relationship of pentameric short sequence repeats in microbial genomes. Res Microbiol 150, 617-626.[Medline]

Charalambous, B. M., Keen, J. N. & McPherson, M. J.(1988). Collagen-like sequences stabilize homotrimers of a bacterial hydrolase. EMBO J 7, 2903-2909.[Abstract]

Chen, C. C. & Cleary, P. P.(1990). Complete nucleotide sequence of the streptococcal C5a peptidase gene of Streptococcus pyogenes. J Biol Chem 265, 3161-3167.[Abstract/Free Full Text]

Chou, P. Y. & Fasman, G. D.(1978). Prediction of secondary structure of proteins from their amino acid sequence. Adv Enzymol Relat Areas Mol Biol 47, 45-148.[Medline]

Courtney, H. S., Hasty, D. L., Li, Y., Chiang, H. C., Thacker, J. L. & Dale, J. B.(1999). Serum opacity factor is a major fibronectin-binding protein and a virulence determinant of M type 2 Streptococcus pyogenes. Mol Microbiol 32, 89-98.[Medline]

Cunningham, M. W.(2000). Pathogenesis of Group A streptococcal infections. Clin Microbiol Rev 13, 470-511.[Abstract/Free Full Text]

Dev, I. K. & Ray, P. H.(1990). Signal peptidases and signal peptide hydrolases. J Bioenerg Biomembr 22, 271-290.[Medline]

Engvall, E., Ruoslahti, E. & Miller, J. M.(1978). Affinity of fibronectin to collagen of different genetic types and to fibrinogen. J Exp Med 147, 1584-1595.[Abstract]

Erickson, P. R. & Herzberg, M. C.(1993). The Streptococcus sanguis platelet aggregation-associated protein. Identification and characterization of the minimal platelet-interactive domain. J Biol Chem 268, 1646-1649.[Abstract/Free Full Text]

Fischetti, V. A., Pancholi, V. & Schneewind, O.(1990). Conservation of a hexapeptide sequence in the anchor region of surface proteins from Gram-positive cocci. Mol Microbiol 4, 1603-1605.[Medline]

Frick, I. M., Crossin, K. L., Edelman, G. M. & Björck, L.(1995). Protein H – a bacterial surface protein with affinity for both immunoglobulin and fibronectin type III domains. EMBO J 14, 1674-1679.[Abstract]

Garnier, J., Osguthorpe, D. J. & Robson, B.(1978). Analysis of the accuracy and implications of a simple method for predicting the secondary structure of globular proteins. J Mol Biol 120, 97-120.[Medline]

Goodfellow, A. M., Hibble, M., Talay, S. R., Kreikemeyer, B., Currie, B. J., Sriprakash, K. S. & Chhatwal, G. S.(2000). Distribution and antigenicity of fibronectin binding proteins (SfbI and SfbII) of Streptococcus pyogenes clinical isolates from the Northern Territory, Australia. J Clin Microbiol 38, 389-392.[Abstract/Free Full Text]

Harbaugh, M. P., Podbielski, A., Hügl, S. & Cleary, P. P.(1993). Nucleotide substitutions and small-scale insertion produce size and antigenic variation in Group A streptococcal M1 protein. Mol Microbiol 8, 981-991.[Medline]

von Heijne, G.(1990). The signal peptide. J Membr Biol 115, 195-201.[Medline]

Henderson, I. R., Owen, P. & Nataro, J. P.(1999). Molecular switches – the ON and OFF of bacterial phase variation. Mol Microbiol 33, 919-932.[Medline]

Hollingshead, S. K., Fischetti, V. A. & Scott, J. R.(1987). Size variation in group A streptococcal M protein is generated by homologous recombination between intragenic repeats. Mol Gen Genetics 207, 196-203.[Medline]

Hynes, W. L., Hancock, L. & Ferretti, J. J.(1995). Analysis of a second bacteriophage hyaluronidase gene from Streptococcus pyogenes: evidence for a third hyaluronidase involved in extracellular enzymatic activity. Infect Immun 63, 3015-3020.[Abstract]

Jenkinson, H. F. & Lamont, R. J.(1997). Streptococcal adhesion and colonization. Crit Rev Oral Biol Med 8, 175-200.[Abstract]

Joh, D., Wann, E. R., Kreikemeyer, B., Speziale, P. & Höök, M.(1999). Role of fibronectin-binding MSCRAMMs in bacterial adherence and entry into mammalian cells. Matrix Biol 18, 211-223.[Medline]

Johnsson, E., Berggard, K., Kotarsky, H., Hellwage, J., Zipfel, P. F., Sjobring, U. & Lindahl, G.(1998). Role of the hypervariable region in streptococcal M proteins: binding of a human complement inhibitor. J Immunol 161, 4894-4901.[Abstract/Free Full Text]

Jones, D. T.(1999). Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292, 195-202.[Medline]

Jones, K. F., Hollingshead, S. K., Scott, J. R. & Fischetti, V. A.(1988). Spontaneous M6 protein size mutants of group A streptococci display variation in antigenic and opsonogenic epitopes. Proc Natl Acad Sci USA 85, 8271-8275.[Abstract]

Kehoe, M. A.(1994). Cell-wall-associated proteins in Gram-positive bacteria. In Bacterial Cell Wall , pp. 217-261. Edited by J.-M. Ghuysen & R. Hakenbeck. Amsterdam:Elsevier Science.

Kreikemeyer, B., Talay, S. R. & Chhatwal, G. S.(1995). Characterisation of a novel fibronectin-binding surface protein in group A streptococci. Mol Microbiol 17, 137-145.[Medline]

Kyte, J. & Doolittle, R. F.(1982). A simple method for displaying the hydropathic character of a protein. J Mol Biol 157, 105-132.[Medline]

Lindmark, H. & Guss, B.(1999). SFS, a novel fibronectin-binding protein from Streptococcus equi, inhibits the binding between fibronectin and collagen. Infect Immun 67, 2383-2388.[Abstract/Free Full Text]

Madoff, L. C., Michel, J. L., Gond, E. W., Kling, D. E. & Kasper, D. L.(1996). Group B streptococci escape host immunity by deletion of tandem repeat elements of the alpha C protein. Proc Natl Acad Sci USA 93, 4131-4136.[Abstract/Free Full Text]

Navarre, W. W. & Schneewind, O.(1999). Surface proteins of Gram-positive bacteria and mechanisms of their targeting to the cell wall envelope. Microbiol Mol Biol Rev 63, 174-229.[Abstract/Free Full Text]

Nielsen, H., Engelbrecht, J., Brunak, S. & von Heijne, G.(1997). Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 10, 1-6.[Abstract]

Podbielski, A., Hawlitzky, J., Pack, T. D., Flosdorff, A. & Boyle, M. D.(1994). A group A streptococcal Enn protein potentially resulting from intergenomic recombination exhibits atypical immunoglobulin binding characteristics. Mol Microbiol 12, 725-736.[Medline]

Rakonjac, J. V., Robbins, J. C. & Fischetti, V. A.(1995). DNA sequence of the serum opacity factor of group A streptococci: identification of a fibronectin-binding repeat domain. Infect Immun 63, 622-631.[Abstract]

Reichardt, W., Gubbe, K. & Schmidt, K. H.(1995). M3 protein with close sequence homology to M12 protein binds fibrinogen, albumin, fibronectin but not to any subclass of IgG – localisation of binding regions. Dev Biol Stand 85, 179-182.[Medline]

Roe, B. A., Linn, S. P., Song, L., Yuan, X., Clifton, S., McLaughlin, R. E., McShan, M. & Ferretti, J. (1999). Streptococcal Genome Sequencing Project. http://www.genome.ou.edu/strep.html.

Saunders, N. J., Peden, J. F., Hood, D. W. & Moxon, E. R.(1998). Simple sequence repeats in the Helicobacter pylori genome. Mol Microbiol 27, 1091-1098.[Medline]

Saunders, N. J., Jeffries, A. C., Peden, J. F., Hood, D. W., Tettelin, H., Rappuoli, R. & Moxon, E. R.(2000). Repeat-associated phase variable genes in the complete genome sequence of Neisseria meningitidis strain MC58. Mol Microbiol 37, 207-215.[Medline]

Schneewind, O., Jones, K. F. & Fischetti, V. A.(1990). Sequence and structural characteristics of the trypsin-resistant T6 surface protein of group A streptococci. J Bacteriol 172, 3310-3317.[Medline]

Sela, S., Aviv, A., Tovi, A., Burstein, I., Caparon, M. G. & Hanski, E.(1993). Protein F: an adhesin of Streptococcus pyogenes bindings fibronectin via two distinct domains. Mol Microbiol 10, 1049-1055.[Medline]

Smith, N. H., Maynard Smith, J. & Spratt, B. G.(1995). Sequence evolution of the porB gene of Neisseria gonorrhoeae and Neisseria meningitidis; evidence of positive Darwinian selection. Mol Biol Evol 12, 363-370.[Abstract]

Stern, A., Brown, M., Nickel, P. & Meyer, T.(1986). Opacity genes in Neisseria gonorrhoeae: control of phase and antigenic variation. Cell 47, 61-71.[Medline]

Stibitz, S., Aaronson, W., Monack, D. & Falkow, S.(1989). Phase variation in Bordetella pertussis by frameshift mutation in a gene for a novel two-component system. Nature 338, 266-269.[Medline]

Talay, S. R., Valentin-Weigand, P., Timmis, K. N. & Chhatwal, G. S.(1994). Domain structure and conserved epitopes of Sfb protein, the fibronectin binding adhesin of Streptococcus pyogenes. Mol Microbiol 13, 531-539.[Medline]

Whatmore, A. M. & Kehoe, M. A.(1994). Horizontal gene transfer in the evolution of group A streptococcal emm-like genes: gene mosaics and variation in the Vir regulon. Mol Microbiol 11, 363-374.[Medline]

Whatmore, A. M., Kapur, V., Sullivan, D. J., Musser, J. M. & Kehoe, M. A.(1994). Noncongruent relationships between variation in emm gene sequences and the population genetic structure of group A streptococci. Mol Microbiol 14, 619-631.[Medline]

Whatmore, A. M., Kapur, V., Musser, J. M. & Kehoe, M. A.(1995). Molecular population genetic analysis of the enn subdivision of group A streptococcal emm-like genes: horizontal gene transfer and restricted variation among enn genes. Mol Microbiol 15, 1039-1048.[Medline]

Yother, J. & Briles, D. E.(1992). Structural properties and evolutionary relationships of PspA, a surface protein of Streptococcus pneumoniae as revealed by sequence analysis. J Bacteriol 174, 601-609.[Abstract]

Received 25 August 2000; revised 10 October 2000; accepted 16 October 2000.