Two-way string-matching algorithm

In computer science, the two-way string-matching algorithm is an efficient string-searching algorithm that can be viewed as a combination of the forward-going Knuth–Morris–Pratt algorithm and the backward-running Boyer–Moore string-search algorithm. Maxime Crochemore and Dominique Perrin invented this algorithm in 1991.[1] The preprocessing time is linear to the needle size. It has a linear worst-case performance at 2n-m comparisons.[2] Breslauer has two improvements with fewer comparisons: one with constant space and n+floor(1+eps/2 * (n-m)) comparisons, the other with log(m) space and n + floor((n-m)/2) comparisons.[3]

ClassString-searching algorithm
Data structureAny string with an ordered alphabet
Worst-case performanceO(n)
Best-case performanceO(n)
Worst-case space complexityO(1)

As with KMP and BM, the algorithms utilizes shifts based on partially repeating periods in the pattern. However, it does so via partitioning (critical factorization) of the needle into two halves, so that only one value needs to be remembered from preprocessing.

The algorithm is considered fairly efficient in real-world conditions, being cache-friendly and containing operations amenable to replacement by library functions. It is selected as the glibc (and the derived newlib; str-two-way.h) and musl algorithm for the memmem and strstr family of substring functions.[4][5][6] However, as with most advanced string-search algorithms, there tends to be a break-even point in the size of both the haystack and the needle, before which a naive quadratic (memchr-memcmp) implementation is more efficient.[7] Glibc provides the Breslauer algorithm in both forms.[6]

Critical factorization

Before we define critical factorization, we should define:[1]

  • Factorization: a string is considered factorized when it is split into two halves. Suppose a string x is split into two parts (u, v), then (u, v) is called a factorization of x
  • Period: a period p for a string x is defined as a value such that for any integer 0 < i ≤ |x| - p, x[i] = x[i + p]. In other words, "p is a period of x if two letters of x at distance p always coincide". The minimum period of x is a positive integer denoted as p(x).
  • A repetition w in (u, v) is a substring of x such that:
    • w is a suffix of u or the other way around;
    • w is a prefix of v or the other way around;
    In other words, w occurs on both sides of the cut with a possible overflow on either side. Each factorization trivially has at least one repetition, the original string x.[2]
  • A local period is the length of a repetition in (u, v). The smallest local period in (u, v) is denoted as r(u, v). For any factorization, 0 < r(u, v) ≤ |x|.
  • A critical factorization is a factorization (u, v) of x such that r(u, v) = p(x). For a needle of length m in an ordered alphabet, it can be computed in 2m comparisons, by computing the lexicographically larger of two ordered maximal suffixes, defined for order ≤ and ≥.[6]

The algorithm

The algorithm starts by critical factorization of the needle as the preprocessing step. This step produces the index (starting point) of the periodic right-half, and the period of this stretch. The suffix computation here follows the authors' formulation. It can alternatively be computed using the simpler Duval's algorithm, which is slower but also linear time.[8]

Shorthand for inversion.
function cmp(a, b)
    if a > b return 1
    if a = b return 0
    if a < b return -1

function maxsuf(n, rev)
    l ← len(n)
    p ← 1       currently known period.
    k ← 1       index for period testing, 0 < k <= p.
    j ← 0       index for maxsuf testing. greater than maxs.
    i ← -1      the proposed starting index of maxsuf
    
    while j + k < l
        cmpv ← cmp(n[j + k], n[i + k])
        if rev
            cmpv ← -cmpv   invert the comparison
        if cmpv < 0
            Suffix (j+k) is smaller. Period is the entire prefix so far.
            j ← j + k
            k ← 1
            p ← j - i
        else if cmpv = 0
            They are the same - we should go on.
            if k = p
                We are done checking this stretch of p. reset k.
                j ← j + p
                k ← 1
            else
                k ← k + 1
        else
            Suffix is larger. Start over from here.
            i ← j
            j ← j + 1
            p ← 1
            k ← 1
   return [i, p]

function crit_fact(n)
    [idx1, per1] ← maxsuf(n, false)
    [idx2, per2] ← maxsuf(n, true)
    if idx1 > idx2
        return [idx1, per1]
    else
        return [idx2, per2]

The comparison proceeds by first matching for the right-hand-side, and then for the left-hand-side if it matches. Linear-time skipping is done using the period.

function match(n, h)
    nl ← len(n)
    hl ← len(h)
    [l, p] ← crit_fact(n)
    P ← {}                  set of matches. 
    
    Match the suffix.
    Use a library function like memcmp, or write your own loop.
    if h[0] ... h[l] = h[p] ... h[l+p]
        P ← {}
        pos ← 0
        s ← 0
        
    TODO. At least put the skip in.
gollark: Haskell would do it far more "based"ly.
gollark: So the obvious approach would be to selfhost your own custom ABR instance.
gollark: Well, ABR's reminders are admittedly not optimal for safety-sensitive reminders and also are not repeating.
gollark: ENTITY 1048-001T IS AWAKENING
gollark: ÆÆÆÆ

References

  1. Crochemore, Maxime; Perrin, Dominique (1 July 1991). "Two-way string-matching" (PDF). Journal of the ACM. 38 (3): 650–674. doi:10.1145/116825.116845.
  2. "Two Way algorithm".
  3. Breslauer, Dany (May 1996). "Saving comparisons in the Crochemore-Perrin string-matching algorithm". Theoretical Computer Science. 158 (1–2): 177–192. doi:10.1016/0304-3975(95)00068-2.
  4. "musl/src/string/memmem.c". Retrieved 23 November 2019.
  5. "newlib/libc/string/memmem.c". Retrieved 23 November 2019.
  6. "glibc/string/str-two-way.h".
  7. "Eric Blake - Re: PATCH] Improve performance of memmem". Newlib mailing list.
  8. Adamczyk, Zbigniew; Rytter, Wojciech (May 2013). "A note on a simple computation of the maximal suffix of a string". Journal of Discrete Algorithms. 20: 61–64. doi:10.1016/j.jda.2013.03.002.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.