初访伦敦

本着能公款吃喝顺便看看英国抢走的宝贝的想法去伦敦出差。从降落前到旅馆的路上看到城市干净整齐,周日的街上走满了人,地铁站与站之间距离合适,走路去地铁站非常方便。第一印象比香港还要好。

在山里憋坏了一下子回到大城市,不免眼前一亮。如果比较的话,伦敦应该是我去过的感觉最好的城市。不过弊病也马上出来了,周一到周五的生活就是另一个北京:即使是走在路上,也一刻都不能悠闲,需要闪避快速走来或交叉的人流。地铁车厢里没有信号,不过挤满了人的车厢里最好还是不要拿出手机,要留一只手出来抓扶手。

走在街上,中国人不少,亚洲面孔的非中国人也很多。穿着昂贵的,十有八九是中国妹子。其他人都是随便穿,穿衣水准还不及瑞士老太太。

伦敦有个稍具规模的中国城,不过水平就跟北京差不多,很一般。客人太多就会这样。人没到齐不让就座、服务很慢、坐久了会赶人,好不好吃就不在考虑之内了。

像极了北京。

伦敦的地铁,算是国外上了年头的地铁中相当好的了。虽然狭小破旧但是高效干净。可惜弯太急特别刺耳。我是带着隔音耳塞来伦敦的,果然派上了用场。说到高效,伦敦地铁的扶梯窄窄的,刚够站两个人,左边行走右边站立。我乘扶梯向下俯瞰着人头攒动,每个扶梯左边的那条道永远在咣咣地走人,不由得好奇,大家都很着急吗?直到有一次我在人堆中走到扶梯口,发现自己站在左边,这才发觉人在江湖身不由己。

有了这种感觉,不由得想到人和猫一样,都是不愿意被驯化的动物。一个城市如果设计的太好,也不长久。而人都是善于夹缝中求生存的,规则太完备,就连夹缝都没有了。虽然瑞士是个做中国人不太方便的地方,反而更像一个安家之处。

Advertisements

关于香港的一些想法

我没去过香港,下面都是我听说和YY的,欢迎踊跃讨论。先介绍一下设定和世界观:你是有一份年薪一百多万港币的高帅富。啊香港的物价嘛,除了房子很疼以外,吃饭什么的还是很平民的(按照中国和人民币来理解)。平民都是住20平不到的壳里,睡醒了把床折起来靠在墙上,睡觉还要上下铺的那种。市中心的豪宅就是70平,一千万。但是不要紧,文中我们是年薪一百多万的高帅富。香港的所得税只有15%。减掉日常开销我们还是八九不离十。按揭个房子也很随意。实在是钱多得不知道怎么花。当然,想花掉还是很容易的。主要还是关心怎么花得好。

基本上这个收入,在香港就是顿顿可以吃好的,那是真的好到觉得回国吃的都是屌丝餐了。总之一切生活需求都可以花钱得到最好的用户体验。业余想培养任何兴趣爱好,多数常人能想到的,厨艺、休闲健身、艺术之类的应该也都可以花钱解决吧。毕竟这是一个发达的资本主义小岛,要是中国大陆,恐怕这种需求没这么好满足。

看起来一切都很好,就是感觉太物质了…过上怎样的精神追求就看个人了。其实有钱还是好啊,可以专心IT/EE技术宅,不愁买不起设备。也可以雇几个phd每周给自己讲篇paper愉悦一下身心。只不过香港没有领先的制造业光搞金融不是回事。不知道自己一个人技术宅下去是不是足够有趣……当然地球已经是平的了,能做到什么地步还是靠自己。

其实最大的问题是女人。拜金女?公主病?屡见不鲜。。。虽然看似笔者不应该深入研究但是我们依旧得讨论这个问题,因为即使对于一个已婚男士这依旧是占至少1/4重要性的头等大事,假设你未来会有一个男孩儿,而且你希望他未来家庭幸福,且不要沦落到平民阶层。香港的贫富差距还是太大了。教育似乎不是大问题,因为类似于学区房,只要再好好上大学,基本还是不至于落入平民阶层的,因为据说香港高中毕业上大学的比率不高。所以还是女人这里最有问题。当然小孩儿教育不好也有问题。小孩儿至少一大半还是受女人影响的,所以根本问题也还是女人。

看看香港的电视剧基本也就明白了。基本上都得有个只会败家的富二代和只懂得追求地位和一心想活在别人眼里让别人嫉妒的老太婆。为什么书呆子不受欢迎?(校内) 这篇文章挺不错的,说的也是类似的问题。

女人也是小孩儿来的所以在这种社会教育确实是个严肃的问题。你看,现在连你都不知道一百万怎么花了,怎么教育小孩儿?

尤其是,在这个商品社会,target到女人上的市场宣传(毒),商品(品),还有电视、SNS这种传播低俗价值的洗脑盒子,稍不留神一个人就丧失思考能力最后沦为书呆子不受欢迎文中的情况了。

所以我觉得,即便有这个机会,我还是不会打算在香港定居。虽然可能1/2是自己可以控制的,1/4也是自己努力之下可以控制的,1/8,1/16,……就无能为力了。我还是很不希望看见甚至想象“家道中落”这种悲哀的,尤其是,最后可能是毁在没有血缘关系的女人手上。

看过一些美国的片子的人应该都会觉得美国喜欢赤裸裸地输出价值观

这些价值观基本就几类,不同的片子中经常重复。所以我不太喜欢看美国片。

对于在中国长大的人来说,这种东西看到最后都挺无语的…“美国人很傻?”“这是美国式洗脑吗?”

在看下面的博文前请大家先去欣赏一下Family Guy S11E09

嗯。接下来再请大家欣赏一幅图片:

Palo Alto Airport

这里是Palo Alto机场,是一个public airport,里面停着的都是私人飞机。旁边有一些飞行俱乐部。可以向俱乐部租飞机飞,也可以请教练教开飞机。只要有驾照,开飞机不需要向任何部门申请,只要跑道排好队,想什么时候飞,向哪里飞,都随意。只要是公共机场,跟地面联系一下排个队就可以降落。公共机场就像停车场一样普遍和自然。从家里来机场,只要把车停在自己飞机的机位,把飞机拉出来,回来了再开车回家,简单高效。当然,基本上是个正常人都能学会开飞机,考到驾照,以及开飞机。

在Palo Alto机场逛的时候,我才意识到,那个FG S11E09其实并不是在刻意输出价值观,这就是美国的生活、美国梦。这是一个神奇的国家,真的都像是假的。

这个世界怎么变得跟数学一样了

不管你在做什么应用的研究,当你发现你需要用到某种数学,基本上总能发现你需要的东西就摆在那里,不知道是谁出于什么目的,在多少年前,证明了或者构造了你恰好就需要的东西。

当什么IT公司需要分析文章、音频、视频、做推荐系统、…的时候,相关领域的Master、PhD们也早就一大把一大把排好队等着了。

多像。

所以这时候我恨不得能把我的神经元进化了。从线性进化成贝叶斯……好像这样才能活下去……

可怕的事情是,简单的东西,基础的东西,上面需要做的事情是有限的。兔死狗烹。只有“数学”是做不完的。你发现,不擅长“数学”,就只能做1/N。虽然“数学”永远做不完…可奇怪的是“做数学”的人又太多了。

如果你是“非数学”的人,认为会制作可用的东西更重要。你发现,“数学”那边好复杂啊…(当然,“复杂”正是他们的堡垒)假使未来你打算跟“数学”战个痛,当你出关你再次站在他们面前,也会发现,啊,他们人好多,你是野路子,没有publication,依然被排斥。你认识到自己的立足点是1/N,所以只有一条路,就是be professional,一路升级,用title(经验、总结)当盾牌,发现闲着没事的话还要自己做点什么好工具,继续给自己贴金。

啊,其实看上去还是挺不错的。

“数学”那边,似乎只要会写程序,应该是说如果会写程序的有一些machine learning的技能,也能比较容易的找到工作。因为这类位置比纯写代码的位置要多。

我只是不太理解,是什么力量使得这么多over qualified的人出现,就像数学,做出一些超前这个时代几十年甚至一百年的东西来。

想一想,确实是有需求的。并且可能还是因为觉得现有的事不太有意思所以“向前看”的人特别多吧。所以人多也不是什么稀罕事。我估计还是反应了一种类似于“知识爆炸”的状态,换言之,简单的东西都已经做得差不多了(其实仔细看有的做的不少,只不过需要妥协的东西会越来越多)。

Suffix Array Construction: KA’s algorithm and Induced Sorting, which is faster?

In mid September, I started writing a internal tool for my employer, which requires suffix tree. So I went to check if  there are any new faster algorithms after the O(n) skew algorithm for building suffix array and O(n) Ukkonen’s algorithm for building suffix tree online.

I only found a little publications of suffix array during the 9 years between 2003 and 2012, which seems to indicate that this field is inactive. And it’s also funny that the skew algorithm has a new name “DC-3”. The publications we are talking about in this article are:

  • Pang Ko and Srinivas Aluru, Space Efficient Linear Time Construction of Suffix Arrays, 2003;
  • Ge Nong, Sen Zhang and Wai Hong Chan, Two Efficient Algorithms for Linear Suffix Array Construction, 2008.

I was immediately attracted by the KA’s algorithm after I read the first 2 pages of their paper. Their idea are so smart. But the most unfortunate thing is that, they are just one step behind a fastest O(n) algorithm of suffix array. Then after 5 years, NZC proposed their Induced Sorting algorithm, which is an improvement based on the KA’s algorithm.

According to a WADS’11 paper http://arxiv.org/abs/1101.3448, the fastest O(n) suffix array implementation is maintained by a Japanese researcher/programmer in https://sites.google.com/site/yuta256/sais. It is an implementation of NZC’s algorithm, or the induced sorting algorithm. His code is really cool. I’ll talk about it later.

Even yuta256 has compared the KA’s algorithm, he didn’t notice that the potential power behind the KA’s algorithm which will beat his sais-lite in future.

As a preliminary, let me introduce the idea of KA’s algorithm and NZC’s algorithm first. I already wrote a post introducing the idea in Chinese in https://yangzhe1990.wordpress.com/2012/09/14/suffix-array-suffix-tree/. You can read it instead if you prefer Chinese.

We start by defining type S suffix and type L suffix. If s[i:] < s[i+1:], then s[i:] is a type S suffix. Let’s say i is type S for short. Otherwise i is type L. length – 1 is type L because it’s comparing with $, where $ is always smaller. Let’s define length to be type S, because it’s $.

We can insert all the suffixes into buckets. If a suffix is starting with a character x, it will be inserted to the bucket x. A type S suffix is larger than xxx… (repeat forever), and similarly type L suffix is smaller than xxx… (repeat forever). Thus in any bucket, type L suffixes are smaller than type S suffixes.

If we know the order of all type S suffixes, by the following algorithm, we can sort all the type L suffixes. Let’s scan the bucket in increasing order. The first suffix we meet is $. It’s the smallest suffix. For each suffix i we meet, we know that suffix i – 1 is the current smallest in its bucket. As we only need to update the type L suffixes, we can then move it to the head of its bucket, and advance the head of its bucket to indicate that it’s removed from the bucket, and its rank is determined. Then we’re using the second suffix to update. Then the third, fourth, …. We will end up with a sorted suffix array. To prove the correctness of the algorithm, we only need to prove that, in the i-th step, the first smallest i suffixes is already placed in the right position. At first, all the type S suffixes’ rank over all suffixes are determined. In the i-th step, the i-th suffixes is either type S or type L. If it’s type S, it’s the i-th smallest suffix; otherwise, this suffix is type L, suppose it’s s[p:], then we know that s[p-1:] < s[p:], so that by the induction hypothesis, there must be some j < i, in the j-th step, we’re using s[p-1:] to update s[p:], and in this update, the rank of s[p:] is determined to be i, the rank of the smallest suffix in its bucket. So that s[p:] is the i-th smallest suffix.

So we only need to sort all the type S suffixes. We can do it by recursion, if we can divide the original string into pieces, each piece corresponds to a type S suffix.

For example, if we want to know the order of all even suffixes of abacabad, we only need to cut it into ab|ac|ab|ad and solve the suffix array problem for the length 4 string ab|ac|ab|ad where its alphabet is ab, ac, ad. But this method only works if no piece is prefix of any other piece.

So how can we sort the new alphabet if there is a piece which is a prefix of another prefix? Let’s look at how a piece looks like in our case. Let’s rewrite our string by replacing each char with its type. We’ll get a string like LSLLSSLSLSSL$. If a SL piece is a prefix of a SLL piece, we have to compare the third char. Now we’re comparing an SLS with SLL. If their third char is not equal, we’re done; if their third char is equal, recall the property that in the same bucket, L comes before S, we can deduce that SLL < SLS.

So let’s define a type S substring to be a substring between an S position and it’s next S position, including the leading and terminating S. We can sort the substrings by two rules: 1. alphabetically smaller char comes first; 2. for the same char, type L comes first. After the substrings are sorted, we can solve it recursively. The length of the new string for the reduction is the number of type S position. We can always assume that it’s <= n/2. Otherwise we can use type L positions in the recursion.

Now the problem becomes, how can we sort the substrings in linear time? Unfortunately, KA already designed the best algorithm to solve it but they missed it, they wrote a really slow algorithm in their paper. Note that each substring is like SL…LS, where there are zero or more L between two S. Let’s define each of L…LS, … , S to be a partial-substring as it’s a suffix of the substring. We can sort all the partial-substrings by the previous algorithm to deduce the rank of type L suffixes from the rank of type S suffixes. BECAUSE by inserting all type S partial-substrings into bucket, we have the rank of type S partial-substrings determined! Then after the S-to-L deduction, the type L…LS, S partial-substrings are also determined. Then loop over all partial-substring which is immedately after an S position, a.k.a a suffix of a substring, by inserting the substring into the its bucket, we sorted all the substrings.

The NZC’s algorithm is based on KA’s algorithm. They defined a type S* position to be the type S position after a type L position. An S position after an S is useless in the above S-to-L deduction algorithm. The S* position is enough to determine the rank of all type L suffixes. And after the type L suffixes is determined, we can determine the rank of all type S suffxies by the rank of type L suffixes. The L-to-S deduction is just a reverse scan version of S-to-L duduction.

The substrings in NZC’s algorithm looks like S…SL…LS. After we sorted L…LS, S partial-substring, we can sort SL…LS, SSL…LS, … , S…SL…LS by a L-to-S deduction.

The number of type S* positions is at most 1/2 length, and the expection is about 1/3 length. So that their algorithm may be faster because the number of recursion is smaller, and the sub-problem has smaller size.

For people who don’t discover how to sort substrings efficiently in KA’s algorithm, NZC’s algorithm is always better. But now, can you tell which is better?

In KA’s algorithm, sorting substrings requires one scan. In this scan, if we meet a suffix i where suffix i – 1 is type L, we determined the rank of the partial-substring starting by i – 1; otherwise if i – 1 is type S, we determined the rank of a substring starting by i – 1. And after the recursion, KA’s algorithm requires only 1 scan.

In NZC’s algorithm, sorting substrings requires two scan. The post-process after the recursion is also two scan.

So KA’s algorithm is T(n) = T(n / 2) + n, and NZC’s algorithm is T'(n) = T'(n / 3) + n * 2, which runs faster? T(n) = 2n and T'(n) = T'(n / 3) + 2n > 2n. But it’s an over-simplified computation. Many other operations like scanning the string, building the bucket, computing the new string for recursion takes the same amount of time in KA’s and NZC’s algorithm.

We have to compare them by implementing both algorithm in the best way. Let’s talk about implementation now.

At first, I wrote a slow code whose running time is 2 times of the running time of yuta256’s sais. Then I find why I’m slower than him: 1. the inverse of the suffix array is not necessary in the computation of the suffix array. The extra time spending on updating the inverse of the suffix array is too much; 2. random access of a large array is expensive. So that some boolean array such as if this substring is equal to its next, if this char is type S or type L, if the previous char is type S or type L can be compressed in the suffix array, or deduced on-the-fly; 3. updating bucket head is expensive. Yuta cached the current bucket. His code accumulate the updates and only update the current bucket if he has to update another bucket.

And in both algorithm, we can reduce the size of the recursion by remove all chars in the pattern which appears only once and isn’t following a char appeared twice or more, because this char has no impact on the order of all the suffixes. This reduction also takes some times. When we can reduce the problem from n/3 or n/2 to n/10 or n/100, it worth; otherwise, it don’t worth. For random input, when alphabet^2 = o(n), the reduction don’t work; when alphabet = O(n), the reduction works.

yuta256’s sais is an optimized implementation of the not reduced version of NZC’s algorithm. I implemented both reduced or not reduced KA’s and NZC’s algorithm, by the same optimizations. My code of not reduced NZC’s algorithm runs at the same speed as yuta256’s code. Thus the benchmark is objective.

For 10M random string with alphabet size 16:

ka:               4.35, 3.80, 3.85, 3.84
sais:             3.53, 3.55, 3.52, 3.53
ka_reduced:       3.29, 3.19, 3.25, 3.25
nzc_reduced:      3.18, 3.16, 3.17, 3.17
ka_reduced_x_nzc: 3.18, 3.18, 3.19, 3.19

where ka means KA’s algorithm, not reduced. ka_reduced means reduction is used when it’s worth to do. To make it simple, I use reduction in the recursion, and don’t use reduction in the top level. nzc_reduced is similar. The ka_reduced_x_nzc is consist of nzc’s substring partitioning and sorting in the top level, KA’s algorithm in the recursion.

For 20M random string with alphabet size 16:

ka:               9.32, 8.05, 9.33, 9.28
sais:             7.87, 7.87, 7.90, 7.90
ka_reduced:       7.17, 7.00, 7.22, 7.10
nzc_reduced:      6.85, 6.87, 6.83, 6.84
ka_reduced_x_nzc: 6.94, 6.95, 7.01, 6.94

And for 30M:

ka_reduced:       11.25, 11.21, 11.20
nzc_reduced:      10.74, 10.74, 10.73
ka_reduced_x_nzc: 10.46, 10.44, 10.48

In summary, reduction is really faster when it “worth”. The NZC’s suffix-sorting and post-recursion runs slower than KA’s, but it’s often faster than KA’s algorithm because the length of string in the second level is about 1/3 of the length in the first level, which is smaller than the 1/2 of KA’s algorithm.

So I wrote the ka_reduced_x_nzc code. When reduction works in KA’s algorithm, the length of the string for recursion is about 1/10 of the length in the parent level. In the second level, the reduction works. So the time spend in the third level, and the following levels is negligible. So that the time saved by KA’s substring-sorting and post-recursion in the second level makes ka_reduced_x_nzc faster than nzc_reduced in the 30M case.

Thus the conclusion is that: it’s still hard to tell which one is faster. We can combine them to build the fastest code.

Some people in adacemia definitely know that KA’s suffix-sorting algorithm in its original paper can be improved just like how NZC did. But it’s not important to the academia because the academia only concerns something new. Writing article against something new or advertising something old is not interesting to the academy.

So, what about writing a paper which backport the the S-to-L deduction as suffix-sorting algorithm to the KA’s algorithm and compare the reduction version of the KA’s algorithm to sais to advertise that it’s better? Maybe it would be accepted, then we were to have a new joke :)

I’m not motivated. Because I’m lazy and as a programmer, a joking paper won’t promote me. There are also faster code such as libdivsufsort and Archon4r0. c.f. http://code.google.com/p/libdivsufsort/wiki/SACA_Benchmarks.

===== Amend.

I later run the same benchmark on my new laptop with a i7-3517U. It turns out that ka_reduced out performed nzc_reduced. I think it’s because in the new processor, memory access is faster. Anyway, It is not worth to tell which one is better because we need a more faster O(n) algorithm to out perform those IT’s algorithm based implementations (libdivsufsort, Archon4ro).

And .. the idea of deducing the sa from the order of type S suffix isn’t new in the KA’s paper. It’s already invented in IT’s paper appeared in 99.

最好写的suffix tree(suffix array)算法?

其实我只是想放几个link进来,然后把相关chrome tabs都关了(省内存)。

有至少6年没写suffix array了,另外大概半年内听谁说过似乎有个OI界不怎么用的04年的算法,正好最近要用到后缀树,但我是绝对不会蛋疼尝试写Ukkonen的,抱着会不会有个比U家后缀树简单的算法的想法,所以看了看整个suffix tree有什么新东西没有……很遗憾的是好…像…没…有………………suffix array倒有一两个。

http://homepage3.nifty.com/wpage/links/suffix_trees.html 不知道这家伙的列表全不全,反正好像没什么意思,什么时候蛋特别疼的时候再看看好了…(我想可能几年内都不会这么疼的)

然后既然打算写了就顺便科普一下好了…

SA以前有个3个字母一组,按起始位移12分别递归,再根据递归结果把起始位移为0的也加进来一起排序的算法。(Juha Karkkainen and Peter Sanders 2003,或者DC-3算法?)鄙人当年觉得写起来好像不太爽、radix三层 + 2/3递归是不是常数因子有点儿大就从没写过。(所以以前一直写n log n的,而且为了省事写n log^2 n,这病得治,然而已经是慢性病了(

这个Pang Ko and Srinivas Aluru的看上去不错,简要说起来,就是如果s[i:] < s[i+1:]那么i的label是小,不然i的label是大;s[len:]也就是$的label都算。算label就倒着扫一遍。

然后先根据第一个字母把所有后缀都排序了(radix sort,但这里我们说桶排,因为排完先别把桶扔了)。然后,有个性质是起始字母相同的后缀一定是label大的比label小的小。其实假设第一个字母是x,label大的就是<xxx…的,label小就是>xxx…的,所以也可以理解为大的靠左,小的靠右。假设大后缀数目比较多,接着用递归把小后缀的想办法给排序了,反之就排大的。要排序就只能把相间的部分缩成字符集(但是变长是个大问题)。所以排序是个子问题,先说排序后的事情。

如果小后缀的顺序都排好了,那么原来每个桶中的小后缀也都可以线性排好顺序(倒过来,每个都放到它自己的桶的最后,然后把它的位置固定掉)。接着只要把大后缀的排序。大后缀有个毛病,它的下一个后缀一定排在自己前面。第一个桶其实是$,里面就一个后缀$。按顺序去读每个桶和桶中的后缀,假设这个后缀是s[i:],如果s[i-1:]是label大的后缀,就把s[i-1:]调整到s[i-1:]所在桶的第一个,并固定住它。正确性是因为,我们读到的东西都已经是顺序的了,而且根据读取过的东西立刻调整s[i-1:]的位置,每次调整一定会把被调整的东西放到正确的位置,这是因为一个桶内的后缀的从第二个字母开始排序的顺序跟整体的后缀顺序是一致的。由于大后缀一定会被之前的某个东西调整,所以能保证在读到接下来它的时候,它已经是被调整到正确的位置上了。

回来看准备递归的过程,相间substring转换成字符集最大的问题在于不定长的字符串中如果一个是另一个的前缀,就坑爹了。所以解决办法是substring定义为包含后面一个小后缀的首字母,即如果两个相邻的小后缀,分别在i和j位置开始,那么substring定义为s[i..j]。这样,如果一个substring是另一个的前缀,则长的那个是小大大大大大小而短的是小大大小,根据前面提的大比小靠左的规则,短的要靠后,长的靠前,顺序就定出来了。

现在的问题就是如何把这些不定长的字符串们排序,这还是我这!辈!子!第一次见到桶排一边排一边添新桶的!怎么用线性时间把它实现就看你自己的了。大不了用链表。其实可以用桶的最后一个元素代表这个桶。具体做法是,把所有字母装桶(具体的桶),再根据它是第几个字母分别挪到第几行,不同的桶内的字符划条竖线以示区别。根据第二行的分割,对第一行的桶进行细分(怎么做不用再说了吧),再根据第三行的分割对第一行进行细分,……就好了。字符集做好了,就可以递归了。其实这个桶的性质很好,都跟棵树一样了。那么,就只用把同一个桶中的如果有多个substring,对那个桶排序就可以了吧,但是比较可惜这里做不了太多优化,只能把原字符串中连续出现单字符桶缩成一个,递归完了之后,可以在当前的桶的基础上挪一挪。(不做这个优化就不用挪桶里的东西,利弊自己想)

上面链接的列表里面有个Ge Nong, Sen Zhang and Wai Hong Chan, Two Efficient Algorithms for Linear Suffix Array Construction, 2008. 其实就是这个算法的小改进。如果说原来是把字符串从小后缀的地方拆掉,这个方法仅仅是连续的小后缀就不再拆了,因为它们要么起始字符不同,要么在同一个桶中的顺序是已知的。也就是加了一条规则,两个连续的小中间不能拆。把能拆的地方叫做非退化小后缀好了。如果能把这样的非退化小后缀用递归排序,则用它可以把大后缀排好,方法跟之前的算法是一样的,正确性的证明也很显然,因为在原来的算法中,如果一个小后缀前面一个后缀是大后缀才会调整,而现在只是不包含这种退化的小后缀。然后用排好的大后缀又可以排好全部小后缀了。

这个算法的好处是非退化小后缀的数目是n/2的,不用再按情况分小后缀多还是大后缀多,代码简化了不少。但是多出来的这个用大后缀再排小后缀的部分略蛋疼,增加了常数因子。另外substring的排序问题又要再讨论一下了。如果一个substring是另一个的前缀,而且分别都是小小…小只是长度不同怎么办(有大的情况跟前面一样)其实这情况不会出现因为肯定要有一个大不然是不拆的,所以只要比大小,大的靠前就OK。

substrings排序的算法其实也不用变了,不过他们说他们弄了一个新的排序方法,意思是说,你不是要比第二个字母、第三个字母吗?别比了,你之前那个递归后排序的方法就是大致OK的,与其比第二个字母、第三个字母,不如用短的去更新长一个字母的,也就是用s[i:]更新s[i-1:]一样的道理,不过由于类似的原因,还是得反复做两次。原理是substring的结构是 小…小大…大小。一开始末尾的 小 的位置放对(小/大 放到桶的 小/大 部分),一轮过去 大…大小 的位置就对了,再一轮 小…小大…大小 也就对了。所以原来的KA03算法的,substring的结构是 小大…大小 跑一轮从左到右的就OK,要注意这次跑的时候,要把桶的大部分和小部分分别看待,然后不管s[i-1:]是大还是小都要按照s[i:]进行位置调整。

然后这个论文大致提了一个“新算法”似乎是把长的substring以d为大小给它切了,反正切了在substring数目上也是n/2级别的。这样就只要用radix排序substring了,后面我实在没啥兴趣看。

这个东西其实我刚看两页,就产生了一种浓浓的“中国学术道德底下”感…写个post应该还不错,实际效果好也行,发paper就,很不舒服,因为概念都[KA03]是跟一样的,如果是用软件做反抄袭我怕真的通不过,客观地说是灌水…改进就老实说,整合过来就很不好了。当然如果发的是lower conference,倒也不为过。但这个不为过理论让我想到在国外不能大肆污染环境但是中国可以,你看北京的天……总之感觉非常不好。其实在点开论文之前,我是没有偏见的,因为没觉得作者是中国人文章就一定是中国风…当然,也有可能是作者们写文章的时候没注意到尽量不要将别人的东西直接搬过来这点…

zgl(大家好久没见cmdButtons这个id了吧我增加一下这个id的存在感)给了个link http://arxiv.org/abs/1101.3448 说 lcp数组是可以跟着这个算法一起算出来的,不用那个lcp(s[i:], s[next(i):]) >= lcp(s[i-1:], s[next(i-1):]) -1的方法了,而且更快。我就不蛋疼了…

中华人民共和国境内的递归域名服务器,应配置中国国家DNSSEC信任源为信任锚点

刚搜到工信部的《中华人民共和国通信行业标准 —— 域名服务安全框架技术要求》,乍一看目录好先进啊!这是真的么?嗯,真的!http://www.ccsa.org.cn/publish/download_bp.php?stdtype=yd1&sno=112

看到标题那句话,我懂了…