* Institute of Information Science, Academia Sinica, Taipei, Taiwan
Department of Ecology and Evolution, University of Chicago
Genomics Research Center, Academia Sinica, Taipei, Taiwan
Correspondence: E-mail: whli{at}uchicago.edu.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key Words: genomic sequences sequence alignment conserved regions translocations inversions
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
However, genomic sequences can be several million bases rather than only hundreds or thousands of bases long, and the traditional methods based on dynamic programming (e.g., Needleman and Wunsch 1970; Smith and Waterman 1981) do not work efficiently because the time and space complexities of these methods are proportional to the product of the lengths of the input sequences. Also, they are not sufficiently sensitive for finding short regions of good alignment flanked by longer regions of poor alignment (Batzoglou et al. 2000). Although popular heuristic similarity search methods such as Fasta (Pearson and Lipman 1988) and Blast (Altschul et al. 1990, 1997) are well suited for finding highly significant matching segment pairs between the query and target sequences, some conserved regions can often be missed because these methods were designed to align only highly conserved regions.
To solve the above problems, several tools for genomic sequence alignment have been proposed. These include MUMmer (Delcher et al. 1999), WABA (Kent and Zahler 2000), Glass (Batzoglou et al. 2000), SSAHA (Ning, Cox, and Mullikin 2001), Dialign (Morgenstern et al. 2002), Avid (Bray, Dubchak, and Pachter 2003), and Longa (Haidong Wang and Wen-Hsiung Li unpublished data). All of these methods adopt a two-step paradigm: the first step is to find alignment anchors and the second step is to use a standard alignment method such as Smith and Waterman's (1981) algorithm to align the region between each pair of two adjacent anchors. The kernel of MUMmer is the suffix tree data structure that is an attractive approach to representing strings and to finding MUMs (Maximum Unique Matches), which are exact matching segment pairs between two sequences and can be used as alignment anchors. Since all MUMs are perfectly matching pairs, not a large number of MUMs may be found when aligning two divergent sequences such as homologous mouse and human sequences. Thus, it may take a considerable amount of time to align those subsequence pairs between two neighboring MUMs by using Smith and Waterman's algorithm. To remove this disadvantage, Wang and Li (unpublished data) proposed a novel seed extension procedure and a two-out-of-three rule to find more significant matching pairs. However, the space required for the construction of a suffix tree in genomic sequence alignment can be so huge that a comparison of two 100 Mb sequences would require about 8 gigabytes of memory (Delcher et al. 1999). Therefore, how to reduce the space requirement is an important issue for this approach. Avid is a newly proposed global alignment program (Bray, Dubchak, and Pachter 2003). The kernel for finding anchors is also the suffix tree approach. With a recursion procedure, Avid can align genomic sequences rapidly. However, this program is an end-to-end global approach, so it cannot detect rearrangements and duplicated regions in sequences.
Glass (Global Alignment System) is a system for aligning hundreds of kilobases of genomic sequences (Batzoglou et al. 2000). There are three fundamental steps in Glass. First, a rough alignment map is constructed by finding long segments that match exactly. Then, the procedure is iterated on the intervening regions searching successively for shorter matching segments. Finally, the remaining regions are aligned using standard alignment techniques. One of the main advantages of Glass is a multilevel (iterative) procedure to find seeds by using different initial matching sizes, so that more long matching segment pairs can be found to serve as alignment anchors. However, like MUMmer, finding exactly matching segment pairs as initial seeds may not lead to a sufficiently large number of anchors when aligning divergent sequences.
Using two binary digits to represent each nucleotide in a DNA sequence is highly efficient for reducing the space requirement (Kent and Zahler 2000; Ning, Cox, and Mullikin 2001). Based on this idea, Ning, Cox, and Mullikin (2001) proposed a hashing tablebased approach to search large DNA databases. Since all sequences are constructed to be a hashing table, finding matching pairs between different sequences becomes more efficient and faster than character-based methods. However, the objective of this method is to search for homologous sequences in a huge database. Thus, only those very highly similar regions are found and aligned. Although Blat (Kent 2002) is an improvement in that it allows mismatches in seeds, its objective is also to search for highly homologous sequences in a large database.
In this paper, a novel genomic sequence alignment program called GS-Aligner (Genomic Sequence Aligner) is proposed. GS-Aligner is efficient in terms of both time and space for aligning two very long genomic sequences and for identifying genomic rearrangements such as translocations and inversions. It consists of several ideas from the above mentioned approaches: bit-level coding (Kent and Zahler 2000; Ning, Cox, and Mullikin 2001; Kent 2002), iteratively finding successively shorter matching segments (anchors) (Batzoglou et al. 2000; Morgenstern et al. 2002), longest increasing subsequence (LIS) (Delcher et al. 1999), and optimal local alignment (Smith and Waterman 1981). Moreover, to reduce the execution time, many parts of the program were also implemented in bit-level operations. Basically, GS-Aligner can be divided into four main parts: sequence coding, seed searching and extension, sorting alignment anchors, and closing the gaps between adjacent anchors. These steps are explained in detail below. We shall also compare the performance of our method with those of others.
![]() |
Method |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Conversion of the Input Sequences to x-mer-Packet Numeric Sequences
Each of the four nucleotides can be represented by two bits (e.g., C by 00, T by 01, A by 10, and G by 11). Then, we can packet each subsequence of x nucleotides (called an x-mer) together to compute a 2xbits numeric. To avoid the problem of which nucleotide is the starting one, each x-mer packet overlaps with the next packet in x - 1 nucleotides. The obtained numeric sequence is called an x-mer-packet numeric sequence. For example, if the input sequence is 5'-ATGGCATCGAATCG...-3', the 2-mer-packet coding sequence is 6131538611211106112... (fig. 2). We divide the x-mer-packet codes into two categories: one for seed searching and the other for seed extension. For seed searching, we design an 8-out-of-12-mer-packet code for obtaining conserved matching segment pairs, using the two-out-of-three rule of Wang and Li (unpublished data). That is, for each three consecutive nucleotides, we require only the first two nucleotides to be identical between the two sequences and allow the third one to be either identical or different. Thus, when counting 12 consecutive nucleotides at a time, the number of the nucleotides in a packet is eight rather than 12. This strategy applies to both coding and noncoding regions because we are searching for short, highly conserved segments to be used as seeds. Note that all three possible reading frames are considered because in our encoding procedure, the first position of the x-mer moves one step to the next nucleotide in the sequence, that is, all possible x-mers are included in our encoding procedure.
|
Creation of an Address Lookup Table for Seeding Codes
An address lookup table is constructed for indexing the coordinates of all x-mers in the sequence B that are identical to the seeding codes. Then, for any seeding code in sequence A we can immediately find the locations of the same numeric in sequence B from this table. The data structure of this lookup table consists of each unique seeding code and all the coordinates of the same numerics in sequence B.
Seed Extension and Anchor Searching
Searching for Ungapped Alignment Anchors
For each of the identical seeding pairs, we attempt to extend it in both directions according to the following rule: if i or fewer than i mismatches occur between the two sequences to be aligned in the immediate adjacent x-mer in the 5' or 3' direction, then the extension is made in that direction, and this procedure is repeated in that direction until the rule is broken. The value of i is usually less than x/2. All such extended segments with a score exceeding a threshold value are kept as potential alignment anchors.
Note that actually we do not need to attempt to extend every seeding pair. Each segment without gaps can always be represented as a straight line with slope 1 in the two-dimensional space. If the line is extended, it will intersect with the vertical axis and the intersection can be readily calculated. Thus, before an extension is attempted, we can calculate the intersection of a seeding pair with the Y axis, and attempt the extension only if the coordinate of the seeding pair is outside the range of the last anchor recorded in an intersection table.
For convenience, we denote sequence A by SA and sequence B by SB and the number of bases in a seeding packet is set to x = 8. In figure 3, the upper line and the lower line are SA and SB, respectively. Let xsp be the starting position of a seed in SA and its 8-mer-packet numeric be i. Using i to index the coding seed lookup table, one can immediately find its related positions with the same numeric i in SB. Then, for each starting position of such a coding seed in SB, for example ysp, we compare the pair of next 8-mer-packet numeric at the position xsp+8 of SA and the position ysp+8 of SB. Let r be the allowed maximum mismatches within a pair of x-mers. If the number of mismatches of the pair of x-mers is less than r, the forward extending process can go ahead. Otherwise, the forward process is terminated, and the backward extension process is started. Once both of the forward and backward processes are terminated, we record the obtained xsp, xep, ysp, and yep in one of the linking lists.
|
Sorting the Anchors
As in other approaches (e.g., Delcher et al. 1999), the obtained anchors are usually unordered and can be inconsistent (fig. 4a). Thus, we have to sort these anchors according to their coordinates and scores. Using the LIS algorithm, several sets of consistent and ordered anchors can be obtained. Each set is called a group in our system. For example, the two sequences to be aligned are shown in figure 4a, and the matching pairs, (1,c), (2,d), (3,a), (4,b), and (5,e), that are obtained by the anchor searching procedure are shown in figure 4b. Here, we use a link list to store these pairs at the implementation stage. After sorting by the LIS procedure, two consistent groups, group 1 and group 2, are obtained as shown in figure 4c. Note further that if there are some blocks duplicated either in one sequence or in both sequences, one set of the duplicated matching pairs will be recorded in group 1 and the other set of the pairs will be stored in group 2, as an example shown in figure 5.
|
|
|
|
Alignment of Interanchor Subsequences
At the last stage, we align the subsequences between each pair of two neighboring anchors in the same group by developing a program for the algorithm of Smith and Waterman (1981). The default scores of match, mismatch, open gap penalty, and gap extension penalty are set to 1, -1, -5, and -0.2, respectively. These scores can be adjusted to users' demand.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In PipMaker (Schwartz et al. 2000), the program gives pips (percent identity plots), each of which represents the position and percent identity (from 50% to 100%) of a gap-free segment of the alignment. However, this percent identity represents only the mean value of the identities in an aligned segment. If there are two aligned segments with the same percent identity, but one is several hundreds of base pairs long and the other is, say, only 30 base pairs long, we do not know how to compare the qualities of these two aligned segments.
In Waterston et al. (2002), the alignments are filtered after they have been generated to retain only those portions of the alignments that score above a certain threshold according to a suitable scoring matrix (Bray, Dubchak, and Pachter 2003). This method can be used to compare the alignments obtained with a common scoring system. However, because the default scoring systems of different programs are different, it may not be logical to recalculate the scores of the alignment results of different programs according to any particular scoring matrix and filter out those alignments under a single threshold. For example, the default scoring parameters of BlastZ were match = 1, mismatch = -1, gap open = -6, and gap extension = -0.02 in Schwartz et al. (2000). However, in their recent paper (Schwartz et al. 2003), the substitution scoring matrix has been adjusted as follows:
|
Let S(i) be the ith position of alignment S. The identical rate per aligned base pair is defined as the proportion of the matching (identical) pairs in an aligned segment with S(i) at the center of the segment; in this paper, the default length of the segment is set to 21. (Note that it is not suitable to use a long segment; that is, a large window size, because the number of windows may become very small.) Thus, the identical rate does not refer to a particular site, but to the average similarity of the sequence segment. If the identical rate is close to 1, the region is very highly conserved. On the other hand, if the rate is below 25%, the region contains mostly gaps, because even for an alignment of two random sequences, the identical rate is approximately 25% if the four nucleotides are about equally frequent.
Since the length of the segment is set to 21, all possible calculated identical rates are from 0% to 100% with a step of 5%. Thus, we can accumulate the number of all S(i) values with the same identical rate over the alignment and obtain a distribution in which the X-axis represents the identical rate and the Y-axis represents the number of base pairs with the same identical rate. This distribution is called the total number of sites for each identical rate. If the two sequences are highly conserved, the peak of the distribution will be very close to the right end (100%). In the remaining regions, the distribution will be close to zero. If numerous successive gaps are inserted into one of the two aligned sequences, the peak of the distribution with highly identical rates will still be close to the right end, but a second peak of the distribution will appear at the left end (0%). Thus, in an alignment of two highly homologous sequences, the distribution could have two peaks, one close to 1 and the other close to 0.
In addition, we also calculate the percentage of aligned sites for each identical rate and call this normalized distribution the accumulated percentage for each identical rate. In most cases, finding a relatively small number of highly conserved regions or a large number of moderately or highly conserved regions is a trade-off problem. Thus, we can analyze the quantity and quality of the final aligned result based on the distributions of the total number and the percentage of aligned sites for each identical rate, respectively.
Alignment Anchors
In GS-Aligner, the number of allowed mismatches within an 8-mer extension and the minimum score of an acceptable anchor are two parameters used at the stage of finding anchors, and the maximum aligned length (Lmax) is the parameter used at the final stage for aligning interanchor subsequence pairs; the scoring system is used at both stages. At the stage of finding anchors, the scores of a match and a mismatch are set to 1 and -1, respectively. Since we do not allow any gap within an anchor, there is no penalty at this stage for gap opening and extension. The allowed mismatches within an 8-mer extension are set to 2 and 3 at the first iteration and the second iteration, respectively. Thus, the average identical rate of each candidate anchor is expected to be at least 75% in the first iteration and 62.5% in the second iteration.
How to decide the minimum score of an acceptable anchor is based on the computational result obtained by Wang and Li (unpublished results). In their analysis, the expected number of matching pairs that appear by chance with a score over 30 is only 0.0001, when aligning two DNA sequences that are both 100 kb long. In our case, the minimum scores of an acceptable anchor are set to 35 and 25 at the first iteration and the second iteration, respectively. At the stage for the final alignment, the default scores of match, mismatch, open gap penalty, and gap extension penalty are set to 1, -1, -5, and -0.2, respectively. The maximum aligned length is set to several values (Lmax = 100, 1,000, and 5,000) for comparison (see below). All parameters, of course, can be adjusted by users themselves to suit their purpose.
Applying our program to the first set of test sequences, we found 372 anchors, of which 239 were found at the first iteration and 133 were found at the second iteration. The total length of the anchors is 31,360 bp, of which 80% are obtained at the first iteration and 20% are obtained at the second iteration. The identical rates of the anchors at the first iteration and the second iteration are 89% and 81%, respectively. The average identical rate of all the anchors is 85%.
Total Numbers of Aligned Sites for Different Identical Rates
Figure 8a shows the total number of sites on the vertical axis and the identical rate on the horizontal axis for three different maximum aligned lengths (Lmax = 100, 1,000, and 5,000). We can see that the total numbers of aligned sites with identical rates over 85% are almost the same for different lengths of Lmax, whereas the total number of sites with identical rates less than 85% varies with Lmax. Thus, if we want to increase the number of conserved regions that do not have a high identical rate (e.g., under 80% but above 40%), the Lmax has to be set to a larger one. We consider an identical rate of 40% as acceptable, but an alignment segment with an average identical rate lower than 40% should be taken with caution. When Lmax is set to a large value, there may be many aligned regions with an average identical rate lower than 40%. One should be very cautious in accepting such regions. Figure 8b shows the accumulated percentage for each identical rate of figure 8a. The quality of the aligned sites by using Lmax = 100 is better than the other ones because over 70% of its aligned sites have an identical rates higher than 80%. However, we can see from figure 8a that it has the smallest number of sites with an identical rate higher than 40%. Thus, we should consider not only the quality but also the "quantity" of an alignment.
|
In this paper, we used only input sequences that have already been repeat-masked. (In Bray, Dubchak, and Pachter [2003], the input sequences to BlastZ were not masked for repeats, so all the execution times for BlastZ were much longer than the ones shown in this paper.) In addition, because the alignment results for different programs were in different formats (see fig. 9), we converted the results from different programs into a consistent format, so that we could calculate their total numbers of aligned sites for different identical rates. Table 2 summarizes the execution times of different programs for aligning the nonhuman sequences with the human sequences. All programs were executed by using the default parameters; for BlastZ we also set the parameter c to 2 since turning the chaining opinion on can speedup its execution time (Bray, Dubchak, and Pachter 2003). In the comparisons of the cat and cow sequences with the human sequences, the execution times of GS-Aligner are slightly shorter than the others, but, for the other species, either MUMmer is faster or BlastZ is faster. However, it should be noted that BlastZ is only a local alignment program because it searches for local similarity and gives alignments of local segments but does not sort the segments into consistent orders. Skipping the sorting step is expected to save some execution time. MUMmer is a fast program for closely related sequences such as human versus chimpanzee sequences. On average, we cannot say which program is the fastest except that for Avid, the execution times of each species pair are almost twice those for the other programs. Note that as long as the differences in execution time are not large, the execution time should not matter much because the major determining factors should be the quality and quantity of alignment, which are the ultimate goal of sequence alignment and will be discussed below.
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
![]() |
Literature Cited |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410.[CrossRef][ISI][Medline]
Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.
Ansari-Lari, M. A., J. C. Oeltjen, and S. Schwartz, et al. (11 co-authors). 1998. Comparative sequences analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. Genome Res. 8:29-40.
Bailey, J. A., A. M. Yavor, H. F. Massa, B. J. Trask, and E. E. Eichler. 2001. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11:1005-1017.
Batzoglou, S., L. Pachter, J. P. Mesirov, B. Berger, and E. S. Lander. 2000. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 10:950-958.
Bray, N., I. Dubchak, and L. Pachter. 2003. Avid: a global alignment program. Genome Res. 13:97-102.
Delcher, A. L., S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg. 1999. Alignment of whole genomes. Nucleic Acids Res. 27:2369-2376.
Dubchak, I., M. Brudno, G. G. Loots, L. Pachter, C. Mayor, E. M. Rubin, and K. A. Frazer. 2000. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 10:1304-1306.
Göttgens, B., J. G. Gilbert, L. M. Barton, D. Grafham, J. Rogers, D. R. Bentley, and A. R. Green. 2001. Long-range comparison of human and mouse SCL loci: localized regions of sensitivity to restriction endonucleases correspond precisely with peaks of conserved noncoding sequences. Genome Res. 11:87-97.
Jareborg, N., E. Birney, and R. Durbin. 1999. Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res. 9:815-824.
Lund, J., F. Chen, A. Hua, B. Roe, M. Budarf, B. S. Emanuel, and R. H. Reeves. 2000. Comparative sequences analysis of 634kb of the mouse chromosome 16 region of conserved synteny with the human velocardiofacial syndrome region on chromosome 22q11.2. Genomics 63:374-383.[CrossRef][ISI][Medline]
Kent, W. J. 2002. Blatthe Blast-like alignment tool. Genome Res. 12:656-664.
Kent, W. J., and A. M. Zahler. 2000. Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. Genome Res. 10:1115-1125.
Mallon, A.-M., M. Platzer, and R. Bate, et al. (31 co-authors). 2000. Comparative genome sequence analysis of the Bpa/Str region in mouse and man. Genome Res. 10:758-775.
Mayor C., M. Brudno, J. R. Schwartz, A. Poliakov, E. M. Rubin, K. A. Frazer, L. S. Pachter, and I. Dubchak. 2000. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16:1046-1047.[Abstract]
Miller, W. 2001. Comparsion of genomic DNA sequences: solved and unsolved problems. Bioinformatics 17:391-397.[Abstract]
Morgenstern, B., O. Rinner, S. Abdeddaïm, D. Haase, K. F. X. Mayer, A. W. M. Dress, and H.-W. Mewes. 2002. Exon discovery by genomic sequence alignment. Bioinformatics 18:777-787.
Needleman, S., and C. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443-453.[ISI][Medline]
Ning, Z., A. J. Cox, and J. C. Mullikin. 2001. SSAHA: a fast search method for large DNA databases. Genome Res. 11:1725-1729.
Pearson, W. R., and D. J. Lipman. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85:2444-2448.[Abstract]
Schwartz, S., W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller. 2003. Human-mouse alignments with BlastZ. Genome Res. 13:103-107.
Schwartz, S., Z. Zhang, K. A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and W. Miller. 2000. PipMakera web server for aligning two genomic DNA sequences. Genome Res. 10:577-586.
Smith, T., and M. Waterman. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195-197.[ISI][Medline]
Waterston, R. H., K. Lindblad-Toh, and E. Birney, et al. (222 co-authors). 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420:520-562.[CrossRef][ISI][Medline]
Wilson, M. D., C. Riemer, and D. W. Martindale, et al. (12 co-authors). 2001. Comparative analysis of the gene-dense ACHE/TFR2 region on human chromosome 7q22 with the orthologous region on mouse chromosome 5. Nucleic Acids Res. 29:1352-1365.
Zhang, Z., S. Schwartz, L. Wagner, and W. Miller. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7:203-214.[CrossRef][ISI][Medline]