Department of Microbiology, University of Virginia, Charlottesville, Virginia 22908
** Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, Virginia 22908
¶ Department of Pharmacology, Duke University, Durham, North Carolina 27710
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Partial protein sequences may also be determined by conventional N-terminal Edman sequencing on unseparated mixtures of peptides (11) (Fig. 1C). Because no fragment separation step is involved, femtomoles of starting material are sufficient for 1012 sequencing cycles, where each cycle produces a mixture of all the amino acids present at that cycle from each peptide. Mixed peptide sequencing usually obtains longer sequence reads than that possible by MS-based de novo sequencing and is less sensitive to post-translational modifications. However, the exact linear sequence of any individual peptide remains unknown, requiring a deconvolution of the residues at each site to reconstruct the original sequences of each peptide. Previously (11), we briefly described the FASTF algorithm for performing this task; here we provide further details of the FASTF algorithm, as well as improvements to sensitivity over the previous approach by introducing probability-based scores.
FASTS and FASTF extend the FASTA algorithm (12) and are available in the FASTA software package for sequence database searching. Both algorithms maximize the search sensitivity by (a) using scoring matrices with high information content, (b) constraining the kinds of alignments generated, and (c) using a strict probabilistic criterion for optimal alignments, which significantly improves the sensitivity and specificity of the algorithms over traditional similarity-score maximization approaches. Most importantly, these algorithms calculate accurate statistical estimates, providing the ability to robustly identify homologous proteins from large scale proteomic sequencing efforts.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
FASTF uses an identical strategy as FASTS but must also deconvolute the mixture of amino acid residues provided from each cycle of the Edman degradation reaction. This deconvolution requires an additional stage (2b in Table I) to ensure that each residue from each cycle is used only once in the peptide alignments. Because the assignment of residues to peptides is not known in stage 1 and stage 2, FASTF first calculates the best possible similarity scores by selecting the amino acid that produces the highest match score in each position in each high identity region, regardless of whether the selected amino acid had been used previously. For instance, if the residues L, K, and S were given as the amino acids present at position two in the query and were being aligned to library residues L, R, and M at the second stage, then the query residue L would align to both the library L (with a score of +20 using the MDM20 scoring matrix (14)) and the library M (score of -2), and the query residue K would align with the library R (score of 0), whereas the query residue S would remain unused.
During the rescan in stage 2b, the peptide alignments "consume" the best residues available, with the highest scoring region from stage 2a getting to choose first. In the best case, the library residue L should consume the query residue L, because the L:L alignment has the highest score and would be given first choice of the query residues to consume; K should next align to R and then S with M (score of -12). However, if the region containing the library M (the M-region) had a higher overall score than the region containing the library residue L (the L-region), then the M-region would choose its query residues first in stage 2b; the query residue L would be consumed by the M-region before the L-region could use it, forcing the library L to align to a different (and worse-scoring) query residue. Thus, this "greedy" method of deconvolution can generate suboptimal scores that lead to obvious mistakes in the reported alignment (as above where the two pairs L:L and S:M could instead be erroneously aligned as L:M and S:L). The greedy approach may reduce FASTF sensitivity. However, an optimal assignment of residues to peptide alignments is considerably more time consuming because of the combinatorial nature of the problem.
Alignment Probabilities
Unlike the similarity scores produced by FASTA, BLAST, and other sequence similarity searching programs that calculate local alignments, FASTS and FASTF calculate alignments whose scores are a combination of both global and local similarity scores. Global scores are calculated for individual peptide alignments, but peptides may be left out of the final alignment; a local alignment of the set of globally aligned segments lets contaminating peptides be excluded. These hybrid global/local alignment scores are not extreme value-distributed as are conventional BLAST, FASTA, and Smith-Waterman similarity scores (15, 16) (data not shown). To estimate the statistical significance of an alignment, we first calculate a theoretical probability for it, assuming that it was obtained by an optimal algorithm employing no heuristics. This probability is subsequently scaled to reflect the empirically observed distribution of alignments.
The statistical expectation of a FASTS or FASTF score in the context of a database search is determined by the product of two terms: 1) the probability of obtaining an alignment score in a single pairwise alignment, and 2) the number of alternative alignments considered, a value which depends on the length of the sequences involved and the number of queries and database sequences that were compared. Term 1, the probability of a single pairwise alignment score S, P(S x), or PS, can be calculated from the frequency of each amino acid and the scoring matrix (1720). There are two sources of alternative alignments that contribute to the "search space" term 2 with FASTS and FASTF: (a) the alternative arrangements of peptides made possible by the length of the library sequence, and (b) the possibility that not all peptides will be aligned.
K peptides of aggregate length M can be aligned to a sequence of length L in NA = ((L - M) + K)!(L - M)! ways. If the NA alternative alignment scores are independently and identically distributed, the pairwise alignment probability becomes
when the alternative alignment search space term is considered (21). However, because of the FASTA heuristic strategy, not all of these possible positions will have been explored, thus reducing the actual alignment search space. Moreover, the NA different alignments do not generate truly independent scores because of local compositional effects and higher order sequence dependences. Together, this means that NA is too large; the correction overestimates the alignment search space and results in statistical estimates that are much too conservative.
Neither FASTS nor FASTF requires all of the query peptides to align; this allows for contaminating peptides or the simultaneous analysis of protein complexes. This adds another level of query complexity; for FASTS, NQ = 2K - 1 unique peptide selections may be obtained from a query containing K peptides. This factor represents the maximal combinatorial search space explored during the subalignment joining stage. However, there are strong dependences between the scores obtainable by each of these combinations, so the correction is again conservative. For FASTF, the number of unique peptide selections and deconvolutions with K peptides, each of length M and having K unique residues at each position, is as follows in Equation 1.
![]() | (Eq. 1) |
The number represented by Equation 1 grows factorially with the number of peptides present in the query and exponentially with the length of the peptides. This second adjustment is again too large; it does not take into account the greatly reduced number of possible deconvolutions explored by the greedy residue consumption method used by FASTF.
The alignment search space-corrected probabilities PS|NA are used to select optimal alignments during the initial search. These alignment probabilities are then scaled to reflect the apparent combinatorial search space size. FASTS and FASTF use the initial PS|NA alignment probabilities to estimate an empirical combinatorial search space correction. The 95% of alignments with the highest probabilities most likely to be because of chance are fit to the equation, ln(PS|NA) = aln(PO) + bPO + c, where PO is the observed frequency of alignments with probabilities better than or equal to PS|NA, and a, b, and c are parameters to be estimated by multiple linear regression. This relationship fits the observed distribution of probabilities over its entire range (data not shown) and resembles mixed exponential decay; this is somewhat expected, as the FASTA algorithm is designed to find the best alignments of high quality while not spending any time optimizing an alignment of already low quality. After obtaining the parameter estimates â, b, and , the alignment probabilities PS|NA are scaled to yield the final probability P of each alignment. This value is used to report the statistical expectation estimate, E = PN, where N is the number of sequences searched in the database.
Database Searches with FASTS and FASTF
Searches with FASTS and FASTF use a shallow scoring matrix with high information content (MDM20 (14) for protein databases or MDM10 for searches against DNA) because of the small amount of sequence content in each query. No gap penalties are used. A web interface to the programs is available at fasta.bioch.virginia.edu, and the source code is available as part of the FASTA source distribution, via FTP at ftp.virginia.edu/fasta/.
The probabilistic alignment strategy that improves search sensitivity is time consuming; a runtime option is available in which no alignment probabilities are calculated initially; only the raw alignment score is used as the alignment optimality criterion. After the database has been searched, the top scoring 10% of sequences are realigned using the normal probability-driven alignment method and resorted for reporting. To obtain a distribution of alignment probabilities with which to perform the empirical scaling step described previously, additional randomly selected database sequences are realigned with a shuffled query, using probability-driven alignment. For all FASTF searches, and for those FASTS searches that recalculate compositional frequencies for each library sequence, using raw alignment scores as surrogates for the true alignment probabilities during the initial search of the database improves the run time by a factor of 10 or more, while not greatly effecting sensitivity (data not shown).
Construction of the Test Database and Queries
FASTS and FASTF performance was evaluated on a subset of proteins from the SwissProt v34 (22) database whose encoding DNA sequence was also available from GenBankTM.2 111 protein families (defined by their PROSITE (23) and PFAM (24) annotations) from the test database were selected that satisfied the following criterion: from each family, a representative sequence could be selected that shared more than 50% sequence identity, over a region of more than 50 residues, with at least 15 other family members. Additionally, any family whose chosen representative sequence was able to identify a non-annotated sequence as statistically significant using the Smith-Waterman search algorithm was considered annotated incompletely and dropped from further usage.
Five equally spaced, non-overlapping 10-mer peptides were extracted from within the identified region of shared sequence identity in each representative sequence. These 111 queries, consisting of five peptides each, were used to generate successively smaller queries containing fewer peptides and of shorter length. This process was continued until all possible sets of nested queries were obtained consisting of between two and five peptides and of length between three and ten residues each. The described sequence databases and query datasets are available via anonymous FTP at ftp.virginia.edu/fasta/data/fastsf_data.tar.gz.
Equivalence Number Calculation and the Sign Test
Search performance was evaluated using equivalence numbers, a measure of the number of related sequences found in a search (25). If all related sequences are ranked higher than all unrelated sequences, the equivalence number is 0. For all other orderings, the equivalence number ranges between 1 and the size of the family of a given query. We use the non-parametric sign test statistic to assess any differences in performance indicated by the distribution of increases and decreases in equivalence numbers from independent queries.
Comparison between FASTS and MS-Shotgun
Fourteen MS/MS-derived FASTS queries from Trypanosoma brucei 20 S proteosomal proteins, published in Ref. 8, were used to search the National Center for Biotechnology Information (NCBI) non-redundant protein database (obtained October 11, 2001). We removed from consideration all hits against sequences from organisms in the Kinetoplastida taxonomic subtree (which includes T. brucei), as determined by the NCBIs Taxonomy database at ncbi.nlm.nih.gov/Taxonomy. FASTS p values are calculated from the reported expectation (E) values as P = 1 - exp(-E). Percent identities were obtained by aligning the full-length query sequence to the best related sequence identified by FASTS; alignment gaps were not counted in the percentage calculation. For spots where FASTS failed to report any related sequences, the corresponding full-length query sequences were used to search the Kinetoplastida-filtered non-redundant database with FASTA. MS-Shotgun p values are as reported in Ref. 8. Highest scoring unrelated library sequences were identified by searching the complete non-redundant protein database with the full-length candidate library sequence, from which no hits against proteosomal sequences were found with E 10-3.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
This search against the SwissProt sequence database was performed on a 1-GHz Pentium III computer running the GNU/Linux OS and took 10 s to complete. Below is shown the list of top scoring hits, identifying the protein as a serum albumin.
|
The programs also output the calculated alignments for the top hits.
|
FASTF uses the same format as FASTS, with random assignments of the residues identified at each cycle to a specific peptide. Thus, in the FASTF query shown below, the d, g, t, and l residues obtained in cycle 1 (m is the first residue, because the peptides were produced by cyanogen bromide cleavage) have been arbitrarily assigned to peptides 1 through 4. FASTF reads each column as a position (ignoring the vertical order of the residues within the column).
|
Searching the NCBI non-redundant protein database (699,616 sequences) with this query took 80 s. The alignment shows that FASTF has identified the query as a ZIP-kinase (with an expectation of 2.7 x 10-8) and deconvoluted the input sequence while preserving the positional composition defined by the query. In this case, however, the alignment only involves three of the four peptides.
|
Accuracy of Estimates of Statistical Significance
Since the introduction of the BLAST program for rapid sequence similarity searches (26), most widely used sequence comparison programs provide an estimate of how frequently an alignment score is expected by chance. If the statistical estimates are accurate, then an unrelated sequences should have alignment scores with an expectation E of 0.01 in about 1% of independent searches, expectations of E 0.001 should be seen one time in 1000, etc. If the highest scoring unrelated sequence obtains E
0.1 only once in 1000 searches, the estimates are too conservative, and related sequences are likely to be missed as false negatives (type II errors). Conversely, if an unrelated sequence receives an E
0.01 in every search, many false positive (type I) errors are likely to occur. Thus, in evaluating the performance of a sequence comparison strategy, it is important to examine the accuracy of its statistical estimates.
To evaluate the accuracy of FASTS and FASTF statistical estimates, we used our test queries to search annotated protein and DNA databases, examining expectation estimates of the highest scoring unrelated sequence from each search (Fig. 2). Using FASTF, independent queries sharing the same peptide number and length may exhibit modest type I statistical error (Fig. 2A); most FASTS estimates are very reliable. FASTF estimates are less accurate, often lower by factors of 2050. Statistical inaccuracy depends on both the length and number of peptides, and translated sequence comparisons (TFASTS and TFASTF) provide less accurate statistical estimates (Fig. 2B). Error increases with total query content and is generally about 10-fold worse with searches against DNA databases, an effect that is also seen in standard translated DNA search algorithms (27).
|
Evaluation of Alignment Probability as Optimality Criterion
Most sequence alignment algorithms, including some designed for use with MS/MS-derived sequence (28), maximize the sum total similarity score of an alignment, rather than minimizing the probability of obtaining the alignment by chance (2931). In searches with multiple peptides, however, whereas any single peptide involved in an alignment with a higher similarity score will result in a lower probability of the score PS, the addition of a second peptide to an existing alignment may not produce a more statistically significant alignment under the statistical model we describe; the additional peptide increases the alignment search space adjustment NA. To take this potential penalty into account, FASTS and FASTF use the adjusted probabilities PS|NA when joining subalignments. This optimality criterion requires multiple-peptide alignments to be composed of higher scoring subalignments, excluding low (but positive-) scoring subalignments that would otherwise worsen the overall alignment quality (an artifact often termed the "mosaic effect" (32)).
With both FASTS and FASTF, we find that, except for queries with vanishingly small sequence content, probabilistic alignments provide better discriminatory power than subalignment joining based on similarity scores alone (Fig. 3). We also measured the direct effect of probabilistic scoring on sensitivity at a conservative expectation threshold and find that as many as 30% more related sequences at 80100% identity could be identified with probabilistic alignment optimization; at 5070% identity, however, a more modest 10% improvement is seen.
|
|
Searches with a smaller number of longer peptides are more sensitive, particularly at greater evolutionary distance. Reduced sensitivity with more peptides largely reflects an increase in the theoretical NQ term and associate increase in PS|NA scaling. With FASTS queries, the penalty for additional peptides is relatively small and is easily offset by the gain in total information content afforded by the extra residues. Thus, FASTS sensitivity nearly always increases with additional peptides of similar length (Fig. 4). When data from an MS/MS experiment fails to find a significant hit, sequence data obtainable from interpretation of additional MS fragment spectra should improve sensitivity. In contrast, FASTF queries suffer large penalties with the addition of peptides, made even worse as the peptide length increases (Fig. 4). Unlike MS/MS experiments, however, in a mixed Edman degradation experiment there is little control over the number of peptides from which sequence is obtained. Luckily, this effect is mitigated by the ability of Edman sequencing to generate longer peptide sequences that overcome the combinatorial penalty.
Comparison of FASTS to Alternative Methods
Both FASTA (CIDentify (33)) and BLAST (MS-BLAST (28), MS-Shotgun (8)) have been employed by previous methods to search databases with MS/MS-derived sequence data. These earlier methods use various forms of congruency analysis to identify database sequences that hit with the highest scores and most often against the peptide sequences in each query. Of the three, only MS-Shotgun attempts to align all of the query peptides simultaneously (by repeating gapped-BLAST searches with all possible permutations of the peptide order of a query) and to assign statistical significance to the results. Therefore, we compared FASTS with respect to MS-Shotgun by repeating the analysis performed in Ref. 8. Fourteen experimentally obtained MS/MS peptide sequence queries from the 20 S proteasome subunit of T. brucei were used to search the National Center for Biotechnology Information non-redundant database of protein sequences, with all taxonomically adjacent Kinetoplastida sequences removed.
Although FASTS and MS-Shotgun performances are similar (Table II), FASTS statistical estimates are considerably more accurate than those produced by MS-Shotgun. The highest scoring unrelated sequences in the FASTS searches had p values ranging from 0.22 to 1.0; MS-Shotgun p values ranged from 10-5 to 1.0. This wide range of p values for unrelated sequences confounds attempts to identify unambiguously homologous database sequences. The importance of accurate statistical estimates can be seen clearly in the MS-Shotgun results for spot 5, where a significant alignment to a related sequence has a worse probability than that of an unrelated sequence; FASTS has no such difficulty. Although in Ref. 8 spots 2, 12, and 15 were all determined to be identifiable, the highest scoring homologs had p values worse than 10-4, and the related sequence had less than 100-fold differences in probability between related and unrelated sequences.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Although FASTS takes into account single residue isobars (I/L and Q/K), it does not correct for other sources of sequence error from spectral misinterpretation (e.g. dipeptide isobars, reversed sequence order). If such sources of error are likely in an experiment, additional peptide sequences reflecting these alternatives may be added to the FASTS query. These additional peptide sequences will incur a small penalty for the additional combinatorial complexity; for a query with five peptides, adding five reversed peptide sequences will increase search time by 2-fold and decrease the statistical significance of a match by 25 = 32-fold. Because the addition of each additional peptide decreases significance by a factor of two, the inclusion of all possible sequence variants (however unlikely) is unadvisable. This robustness of FASTS to inclusion of peptides that may not be involved in any specific protein alignment makes it an ideal tool to simultaneously identify multiple proteins from mixtures (34); we have simultaneously identified multiple unrelated proteins in several experiments. Future versions of FASTS may be designed to analyze peptide data from more complex mixtures.
No algorithmic equivalent to FASTF currently exists. Mutation-sensitive pattern or motif search algorithms could be used to search a database with mixed Edman degradation-derived sequence data, but all matching sequences would still require further processing to determine which alignment assemblies satisfy the compositional requirements of the query, akin to the subalignment joining performed by FASTF. We are currently exploring methods to generate optimal FASTF alignments for display, correcting those mistakes made by the greedy alignment heuristic. We will also then evaluate whether taking the time to calculate optimal alignments during the database search has any measurable effect on sensitivity.
The probabilistic optimality criterion in FASTS improves search sensitivity over methods based on total similarity score alone (see Fig. 3, e.g. CIDentify and MS-BLAST). In a concrete example, a query consisting of five peptides of total length 35 from GBB3_RAT (guanine nucleotide-binding protein beta, subunit 3 from rat) achieves a nearly complete alignment against various coronavirus glycoproteins including VGL2_CVBV.
|
Inspection of the alignment suggests that this is a potentially homologous match; it has a very high similarity score (init = 170). However, the statistical exception provided by FASTS is only 2.9. Lower scoring (init = 153) yet more significant alignments (E < 8.8 x 10-5) occur with true homologs of GBB3_RAT. Probabilistic scoring, combined with accurate statistical estimates, makes FASTS a clear choice over score-based alternatives.
FASTS and FASTF achieve high sensitivity by maximizing the search potential of queries, with high information content scoring matrices, ungapped global peptide alignments, and a stringent probabilistic criterion for alignment optimality. Sensitivity can be improved by reducing the set of library sequences examined, for example by filtering the database by approximate molecular weight or isoelectric point (pI) ranges or by selecting a taxonomic subset of the data (mammals, plants, fungi). These options are all available within the FASTA search package.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, December 12, 2001, DOI 10.1074/mcp.M100004-MCP200
1 The abbreviations used are: MS, mass spectrometry; MS/MS, tandem mass spectrometry.
2 M.-Q. Huang and W. R. Pearson, manuscript in preparation.
* The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The on-line version of this article (available at http://www.mcponline.org) contains Supplemental Material.
Supported by Grant T32AI07046 from the National Institutes of Health.
|| Supported by Grants HL19242-24 and DK52378-04 from the National Institutes of Health.
Supported in part by Grant LM04969 from the National Library of Medicine, with additional support from the Compaq Computer Corporation. To whom correspondence should be addressed. Tel.: 434-924-2818; Fax: 434-924-5069; E-mail: wrp{at}virginia.edu.
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|