Suffix Array Construction: KA’s algorithm and Induced Sorting, which is faster?

In mid September, I started writing a internal tool for my employer, which requires suffix tree. So I went to check if  there are any new faster algorithms after the O(n) skew algorithm for building suffix array and O(n) Ukkonen’s algorithm for building suffix tree online.

I only found a little publications of suffix array during the 9 years between 2003 and 2012, which seems to indicate that this field is inactive. And it’s also funny that the skew algorithm has a new name “DC-3”. The publications we are talking about in this article are:

  • Pang Ko and Srinivas Aluru, Space Efficient Linear Time Construction of Suffix Arrays, 2003;
  • Ge Nong, Sen Zhang and Wai Hong Chan, Two Efficient Algorithms for Linear Suffix Array Construction, 2008.

I was immediately attracted by the KA’s algorithm after I read the first 2 pages of their paper. Their idea are so smart. But the most unfortunate thing is that, they are just one step behind a fastest O(n) algorithm of suffix array. Then after 5 years, NZC proposed their Induced Sorting algorithm, which is an improvement based on the KA’s algorithm.

According to a WADS’11 paper http://arxiv.org/abs/1101.3448, the fastest O(n) suffix array implementation is maintained by a Japanese researcher/programmer in https://sites.google.com/site/yuta256/sais. It is an implementation of NZC’s algorithm, or the induced sorting algorithm. His code is really cool. I’ll talk about it later.

Even yuta256 has compared the KA’s algorithm, he didn’t notice that the potential power behind the KA’s algorithm which will beat his sais-lite in future.

As a preliminary, let me introduce the idea of KA’s algorithm and NZC’s algorithm first. I already wrote a post introducing the idea in Chinese in https://yangzhe1990.wordpress.com/2012/09/14/suffix-array-suffix-tree/. You can read it instead if you prefer Chinese.

We start by defining type S suffix and type L suffix. If s[i:] < s[i+1:], then s[i:] is a type S suffix. Let’s say i is type S for short. Otherwise i is type L. length – 1 is type L because it’s comparing with $, where $ is always smaller. Let’s define length to be type S, because it’s $.

We can insert all the suffixes into buckets. If a suffix is starting with a character x, it will be inserted to the bucket x. A type S suffix is larger than xxx… (repeat forever), and similarly type L suffix is smaller than xxx… (repeat forever). Thus in any bucket, type L suffixes are smaller than type S suffixes.

If we know the order of all type S suffixes, by the following algorithm, we can sort all the type L suffixes. Let’s scan the bucket in increasing order. The first suffix we meet is $. It’s the smallest suffix. For each suffix i we meet, we know that suffix i – 1 is the current smallest in its bucket. As we only need to update the type L suffixes, we can then move it to the head of its bucket, and advance the head of its bucket to indicate that it’s removed from the bucket, and its rank is determined. Then we’re using the second suffix to update. Then the third, fourth, …. We will end up with a sorted suffix array. To prove the correctness of the algorithm, we only need to prove that, in the i-th step, the first smallest i suffixes is already placed in the right position. At first, all the type S suffixes’ rank over all suffixes are determined. In the i-th step, the i-th suffixes is either type S or type L. If it’s type S, it’s the i-th smallest suffix; otherwise, this suffix is type L, suppose it’s s[p:], then we know that s[p-1:] < s[p:], so that by the induction hypothesis, there must be some j < i, in the j-th step, we’re using s[p-1:] to update s[p:], and in this update, the rank of s[p:] is determined to be i, the rank of the smallest suffix in its bucket. So that s[p:] is the i-th smallest suffix.

So we only need to sort all the type S suffixes. We can do it by recursion, if we can divide the original string into pieces, each piece corresponds to a type S suffix.

For example, if we want to know the order of all even suffixes of abacabad, we only need to cut it into ab|ac|ab|ad and solve the suffix array problem for the length 4 string ab|ac|ab|ad where its alphabet is ab, ac, ad. But this method only works if no piece is prefix of any other piece.

So how can we sort the new alphabet if there is a piece which is a prefix of another prefix? Let’s look at how a piece looks like in our case. Let’s rewrite our string by replacing each char with its type. We’ll get a string like LSLLSSLSLSSL$. If a SL piece is a prefix of a SLL piece, we have to compare the third char. Now we’re comparing an SLS with SLL. If their third char is not equal, we’re done; if their third char is equal, recall the property that in the same bucket, L comes before S, we can deduce that SLL < SLS.

So let’s define a type S substring to be a substring between an S position and it’s next S position, including the leading and terminating S. We can sort the substrings by two rules: 1. alphabetically smaller char comes first; 2. for the same char, type L comes first. After the substrings are sorted, we can solve it recursively. The length of the new string for the reduction is the number of type S position. We can always assume that it’s <= n/2. Otherwise we can use type L positions in the recursion.

Now the problem becomes, how can we sort the substrings in linear time? Unfortunately, KA already designed the best algorithm to solve it but they missed it, they wrote a really slow algorithm in their paper. Note that each substring is like SL…LS, where there are zero or more L between two S. Let’s define each of L…LS, … , S to be a partial-substring as it’s a suffix of the substring. We can sort all the partial-substrings by the previous algorithm to deduce the rank of type L suffixes from the rank of type S suffixes. BECAUSE by inserting all type S partial-substrings into bucket, we have the rank of type S partial-substrings determined! Then after the S-to-L deduction, the type L…LS, S partial-substrings are also determined. Then loop over all partial-substring which is immedately after an S position, a.k.a a suffix of a substring, by inserting the substring into the its bucket, we sorted all the substrings.

The NZC’s algorithm is based on KA’s algorithm. They defined a type S* position to be the type S position after a type L position. An S position after an S is useless in the above S-to-L deduction algorithm. The S* position is enough to determine the rank of all type L suffixes. And after the type L suffixes is determined, we can determine the rank of all type S suffxies by the rank of type L suffixes. The L-to-S deduction is just a reverse scan version of S-to-L duduction.

The substrings in NZC’s algorithm looks like S…SL…LS. After we sorted L…LS, S partial-substring, we can sort SL…LS, SSL…LS, … , S…SL…LS by a L-to-S deduction.

The number of type S* positions is at most 1/2 length, and the expection is about 1/3 length. So that their algorithm may be faster because the number of recursion is smaller, and the sub-problem has smaller size.

For people who don’t discover how to sort substrings efficiently in KA’s algorithm, NZC’s algorithm is always better. But now, can you tell which is better?

In KA’s algorithm, sorting substrings requires one scan. In this scan, if we meet a suffix i where suffix i – 1 is type L, we determined the rank of the partial-substring starting by i – 1; otherwise if i – 1 is type S, we determined the rank of a substring starting by i – 1. And after the recursion, KA’s algorithm requires only 1 scan.

In NZC’s algorithm, sorting substrings requires two scan. The post-process after the recursion is also two scan.

So KA’s algorithm is T(n) = T(n / 2) + n, and NZC’s algorithm is T'(n) = T'(n / 3) + n * 2, which runs faster? T(n) = 2n and T'(n) = T'(n / 3) + 2n > 2n. But it’s an over-simplified computation. Many other operations like scanning the string, building the bucket, computing the new string for recursion takes the same amount of time in KA’s and NZC’s algorithm.

We have to compare them by implementing both algorithm in the best way. Let’s talk about implementation now.

At first, I wrote a slow code whose running time is 2 times of the running time of yuta256’s sais. Then I find why I’m slower than him: 1. the inverse of the suffix array is not necessary in the computation of the suffix array. The extra time spending on updating the inverse of the suffix array is too much; 2. random access of a large array is expensive. So that some boolean array such as if this substring is equal to its next, if this char is type S or type L, if the previous char is type S or type L can be compressed in the suffix array, or deduced on-the-fly; 3. updating bucket head is expensive. Yuta cached the current bucket. His code accumulate the updates and only update the current bucket if he has to update another bucket.

And in both algorithm, we can reduce the size of the recursion by remove all chars in the pattern which appears only once and isn’t following a char appeared twice or more, because this char has no impact on the order of all the suffixes. This reduction also takes some times. When we can reduce the problem from n/3 or n/2 to n/10 or n/100, it worth; otherwise, it don’t worth. For random input, when alphabet^2 = o(n), the reduction don’t work; when alphabet = O(n), the reduction works.

yuta256’s sais is an optimized implementation of the not reduced version of NZC’s algorithm. I implemented both reduced or not reduced KA’s and NZC’s algorithm, by the same optimizations. My code of not reduced NZC’s algorithm runs at the same speed as yuta256’s code. Thus the benchmark is objective.

For 10M random string with alphabet size 16:

ka:               4.35, 3.80, 3.85, 3.84
sais:             3.53, 3.55, 3.52, 3.53
ka_reduced:       3.29, 3.19, 3.25, 3.25
nzc_reduced:      3.18, 3.16, 3.17, 3.17
ka_reduced_x_nzc: 3.18, 3.18, 3.19, 3.19

where ka means KA’s algorithm, not reduced. ka_reduced means reduction is used when it’s worth to do. To make it simple, I use reduction in the recursion, and don’t use reduction in the top level. nzc_reduced is similar. The ka_reduced_x_nzc is consist of nzc’s substring partitioning and sorting in the top level, KA’s algorithm in the recursion.

For 20M random string with alphabet size 16:

ka:               9.32, 8.05, 9.33, 9.28
sais:             7.87, 7.87, 7.90, 7.90
ka_reduced:       7.17, 7.00, 7.22, 7.10
nzc_reduced:      6.85, 6.87, 6.83, 6.84
ka_reduced_x_nzc: 6.94, 6.95, 7.01, 6.94

And for 30M:

ka_reduced:       11.25, 11.21, 11.20
nzc_reduced:      10.74, 10.74, 10.73
ka_reduced_x_nzc: 10.46, 10.44, 10.48

In summary, reduction is really faster when it “worth”. The NZC’s suffix-sorting and post-recursion runs slower than KA’s, but it’s often faster than KA’s algorithm because the length of string in the second level is about 1/3 of the length in the first level, which is smaller than the 1/2 of KA’s algorithm.

So I wrote the ka_reduced_x_nzc code. When reduction works in KA’s algorithm, the length of the string for recursion is about 1/10 of the length in the parent level. In the second level, the reduction works. So the time spend in the third level, and the following levels is negligible. So that the time saved by KA’s substring-sorting and post-recursion in the second level makes ka_reduced_x_nzc faster than nzc_reduced in the 30M case.

Thus the conclusion is that: it’s still hard to tell which one is faster. We can combine them to build the fastest code.

Some people in adacemia definitely know that KA’s suffix-sorting algorithm in its original paper can be improved just like how NZC did. But it’s not important to the academia because the academia only concerns something new. Writing article against something new or advertising something old is not interesting to the academy.

So, what about writing a paper which backport the the S-to-L deduction as suffix-sorting algorithm to the KA’s algorithm and compare the reduction version of the KA’s algorithm to sais to advertise that it’s better? Maybe it would be accepted, then we were to have a new joke :)

I’m not motivated. Because I’m lazy and as a programmer, a joking paper won’t promote me. There are also faster code such as libdivsufsort and Archon4r0. c.f. http://code.google.com/p/libdivsufsort/wiki/SACA_Benchmarks.

===== Amend.

I later run the same benchmark on my new laptop with a i7-3517U. It turns out that ka_reduced out performed nzc_reduced. I think it’s because in the new processor, memory access is faster. Anyway, It is not worth to tell which one is better because we need a more faster O(n) algorithm to out perform those IT’s algorithm based implementations (libdivsufsort, Archon4ro).

And .. the idea of deducing the sa from the order of type S suffix isn’t new in the KA’s paper. It’s already invented in IT’s paper appeared in 99.

Advertisements

3 responses to this post.

  1. May I have your code? I promise to use it in academia.

    回复

发表评论

Fill in your details below or click an icon to log in:

WordPress.com 徽标

You are commenting using your WordPress.com account. Log Out /  更改 )

Google+ photo

You are commenting using your Google+ account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )

Connecting to %s

%d 博主赞过: