Implement glob Matcher

15

1

Implement a function of pattern and string to be matched, return true if the pattern matches the WHOLE string, otherwise false.

Our glob pattern syntax is:

  • ? matches any one character
  • + matches one or more characters
  • * matches zero or more characters
  • \ escapes

Rules:

  • No eval, no converting into regular expression, no calling a system glob function.
  • I/O are not required: you can just write a function
  • Shortest wins

Examples:

glob('abc', 'abc') => true
glob('abc', 'abcdef') => false IMPORTANT!
glob('a??', 'aww') => true
glob('a*b', 'ab') => true
glob('a*b', 'agwijgwbgioeb') => true
glob('a*?', 'a') => false
glob('?*', 'def') => true
glob('5+', '5ggggg') => true
glob('+', '') => false
glob('a\*b', 'a*b') => true

Here is a tip to get started: http://en.wikipedia.org/wiki/Backtracking

Ming-Tang

Posted 2011-02-04T06:04:50.113

Reputation: 5 383

Some examples on escaping please? ("") – Eelvex – 2011-02-14T13:40:16.670

1May I suggest an additional tag "pattern-matching"? – dmckee --- ex-moderator kitten – 2011-02-04T20:45:55.547

1Could you clarify what you mean by "no standard function"? That you can't call functions from the standard library? How is that supposed to work? – sepp2k – 2011-02-05T04:23:11.887

Answers

4

Golfscript - 82 chars

{1,\@n+:|;{:<;{:I)I|="\\+*?"[<]+?[{|=<={I))}*}I~{I\C}{}.{;}]=~}:C%}/{|>'*'-n=},}:g

Assumes that there are no newlines in the strings. Returns an empty array for false, and a non-empty array for true (consistent with golfscript definition of true/false).

This is a non-recursive solution (except for consecutive *s), which maintains a list of the positions in the pattern string i such that pattern[0..i] matches string[0..cur].

This has the potential to run for a very long time. You can add .& after :C% to prevent this.

Nabb

Posted 2011-02-04T06:04:50.113

Reputation: 2 540

5

Haskell, 141 characters

c('\\':a:z)s=a&s>>=c z
c(a:z)s=a%s>>=c z
c[]s=[null s]
p&(a:z)|a==p=[z]
_&_=[]
'?'%(a:z)=[z]
'*'%a=a:'+'%a
'+'%(a:z)='*'%z
l%a=l&a
g=(or.).c

Works for all input, both patterns and strings to match against. Handles trailing backslash in the pattern as a literal match (behavior was unspecified.)

This can be run with the following test driver:

main = do
    globtest "abc" "abc"    True
    globtest "abc" "abcdef" False
    globtest "a??" "aww"    True
    globtest "a*b" "ab"     True
    globtest "a*b" "agwijgwbgioeb" True
    globtest "a*?" "a"      False
    globtest "?*" "def"     True
    globtest "5+" "5ggggg"  True
    globtest "+" ""         False
    globtest "a\\*b" "a*b"  True
  where
    globtest p s e =
      if g p s == e
        then putStrLn "pass"
        else putStrLn$"fail: g " ++ show p ++ " " ++ show s ++ " /= " ++ show e

Update: I wrote a blog post about this particular answer, as I think it shows off well how Haskell so easily encodes the problem.


  • Edit: (181 -> 174) replaced d and m with operators
  • Edit: (174 -> 151) inlined r into c
  • Edit: (151 -> 149) removed an unnecessarily generated option in the + case
  • Edit: (149 -> 141) removed an unnecessary clause for %, which was handled by &

MtnViewMark

Posted 2011-02-04T06:04:50.113

Reputation: 4 779

2

C function - 178 necessary characters

Compiled with GCC, this produces no warnings.

#define g glob
int g(p,s)const char*p,*s;{return*p==42?g(p+1,s)||(*s&&g(p,
s+1)):*p==43?*s&&(g(p+1,++s)||g(p,s)):*p==63?*s&&g(p+1,s+1)
:*p==92?*++p&&*s++==*p++&&g(p,s):*s==*p++&&(!*s++||g(p,s));}
#undef g

The first and last lines are not included in the character count. They are provided for convenience only.

Blown-up:

int glob(p,s)
const char *p, *s; /* K&R-style function declaration */
{
    return
        *p=='*'  ? glob(p+1,s) || (*s && glob(p,s+1)) :
        *p=='+'  ? *s && (glob(p+1,++s) || glob(p,s)) :
        *p=='?'  ? *s && glob(p+1,s+1)                :
        *p=='\\' ? *++p && *s++==*p++ && glob(p,s)    :
        *s==*p++ && (!*s++ || glob(p,s));
}

Joey Adams

Posted 2011-02-04T06:04:50.113

Reputation: 9 929

2

JavaScript - 259 characters

My implementation is very recursive, so the stack will overflow if an extremely long pattern is used. Ignoring the plus sign (which I could have optimized but chose not to for simplicity), one level of recursion is used for each token.

glob=function f(e,c){var b=e[0],d=e.slice(1),g=c.length;if(b=="+")return f("?*"+d,c);if(b=="?")b=g;else if(b=="*"){for(b=0;b<=g;++b)if(f(d,c.slice(b)))return 1;return 0}else{if(b=="\\"){b=e[1];d=e.slice(2)}b=b==c[0]}return b&&(!d.length&&!g||f(d,c.slice(1)))}

The function sometimes returns a number instead of a boolean. If that is a problem, you can use it as !!glob(pattern, str).


Ungolfed (unminified, rather) to serve as a useful resource:

function glob(pattern, str) {
    var head = pattern[0], tail = pattern.slice(1), strLen = str.length, matched;
    if(head == '+') {
        // The plus is really just syntactic sugar.
        return glob('?*' + tail, str);
    }
    if(head == '?') { // Match any single character
        matched = strLen;
    } else if(head == '*') { // Match zero or more characters.
        // N.B. I reuse the variable matched to save space.
        for(matched = 0; matched <= strLen; ++matched) {
            if(glob(tail, str.slice(matched))) {
                return 1;
            }
        }
        return 0;
    } else { // Match a literal character
        if(head == '\\') { // Handle escaping
            head = pattern[1];
            tail = pattern.slice(2);
        }
        matched = head == str[0];
    }
    return matched && ((!tail.length && !strLen) || glob(tail, str.slice(1)));
}

Note that indexing into characters of a string as for array elements is not part of the older (ECMAScript 3) language standard, so it may not work in older browsers.

PleaseStand

Posted 2011-02-04T06:04:50.113

Reputation: 5 369

2

PHP - 275 243 characters

<?function g($P,$I){$o='array_shift';if(@$I[0]==="")return 0;for(;$P;$o($P)){$p=$P[0];if($p=='?'|$p=='+'&&@$N===$o($I))return 0;if($p=='+'|$p=='*'&&$I&&g($P,array_slice($I,1)))return 1;if(!strpos(" ?+*\\",$p)&&$p!==$o($I))return 0;}return!$I;}

Ungolfed:

<?php

function g($P,$I) {
        if ($I && $I[0] === "") return false;
        for(;$P;array_shift($P)) {
                $p = $P[0];
                if( $p == '?' || $p == '+') {
                        if (NULL === array_shift($I)) {
                                return false;
                        }
                }
                if( $p=='+' || $p=='*' ) {
                        if ($I && g($P, array_slice($I,1))) {
                                return true;
                        }
                }
                if (!strpos(" ?+*\\",$p) && $p !== array_shift($I)) {
                        return false;
                }
        }
        return !$I;
}

function my_glob($pattern,$subject) {
    return !!g(str_split($pattern),str_split($subject));
}

Arnaud Le Blanc

Posted 2011-02-04T06:04:50.113

Reputation: 2 286

2

Overly Verbose Python (384 367 Characters)

t=lambda a:a[1:] 
h=lambda a:a[0] 
n=lambda p,s:s and(h(p)==h(s)and m(t(p),t(s))) 
def m(p,s): 
 if not p: 
  return not s 
 else: 
  return { 
   '?':lambda p,s:s and m(t(p),t(s)), 
   '+':lambda p,s:s and(m(p,t(s))or m(t(p),t(s))), 
   '*':lambda p,s:m(t(p),s)or(s and m(p,t(s))), 
   '\\':lambda p,s:n(t(p),s), 
  }.get(h(p),n)(p,s) 
glob=lambda p,s:not not m(p,s)

It's not the shortest, but it's nice and functional. The dispatch dict thing in the middle could presumably be rewritten as a disjunction over (h(p) == '?') and (? lambda body) type things. Defining that h operator costs me some characters for no benefit, but its nice to have a keyword for head.

I'd like to have a crack at golfscripting it later if time permits.

edit: removed unnecessary third branch in '*' case after reading user300's ruby answer

roobs

Posted 2011-02-04T06:04:50.113

Reputation: 299

2

Ruby - 199 171

g=->p,s{x=(b=->a{a[1..-1]})[p];y=s[0];w=b[s];v=p[0];_=->p,s{p[0]==y&&g[x,w]}
v==??? g[x,y&&w||s]:v==?+? y&&g[?*+x,w]:v==?*?
y&&g[p,w]||g[x,s]:v==?\\? _[x,s]:v ? _[p,s]:!y}

Ungolfed:

def glob(pattern, subject)
        b=->a{a[1..-1]}
        _=->p,s{ p[0]==s[0] && glob(b[p],b[s]) }
        ({
                ??=>->p,s { glob(b[p], s[0] ? b[s] : s) },
                ?+=>->p,s { s[0] && glob(?*+b[p], b[s]) },
                ?*=>->p,s { s[0] && glob(p,b[s]) || glob(b[p],s) },
                ?\\=>->p,s{ _[b[p],s] },
                nil=>->p,s{ !subject[0] }
        }[pattern[0]] || _)[pattern, subject]
end

Tests:

p glob('abc', 'abc')
p glob('abc', 'abcdef')
p glob('a??', 'aww')
p glob('a*b', 'ab')
p glob('a*b', 'agwijgwbgioeb')
p glob('a*?', 'a')
p glob('?*', 'def')
p glob('5+', '5ggggg')
p glob('+', '')

Inspired by roobs' answer

Arnaud Le Blanc

Posted 2011-02-04T06:04:50.113

Reputation: 2 286

i know nothing about ruby, but from your code i've learnt that accessing out of bounds indices returns nil. so popping an empty string yields a nil value which can be used as a string terminator symbol. C-style! nifty! i guess it could be mimicked in python by passing each input string through lambda s : list(s)+[None] – roobs – 2011-02-13T02:09:32.980

From the looks of it, Ruby has built in pattern matching. That's certainly handy for this sort of problem. – Jonathan M Davis – 2011-02-14T05:54:29.220

Actually the ?? are literal chars, => are key/value separators in Ruby Hashes, and -> starts a lambda :-) ( { ?? => ->{...} } is a hash with key "?" and a lambda as value.) But yes the way its used together looks like pattern-matching on single chars :-) – Arnaud Le Blanc – 2011-02-14T10:09:02.157

2

Shorter Snappier Python 2.6 (272 characters)

golfed:

n=lambda p,s:p[0]==s[0]and m(p[1:],s[1:]) 
def m(p,s): 
 q,r,t,u=p[0],p[1:],s[0],s[1:] 
 return any((q=='?'and(t and m(r,u)),q=='+'and(t and(m(p,u)or m(r,u))),q=='*'and(m(r,s)or(t and m(p,u))),q=='\\'and n(r,s),q==t==0))or n(p,s) 
glob=lambda*a:m(*[list(x)+[0]for x in a])

ungolfed:

TERMINATOR = 0 

def unpack(a): 
    return a[0], a[1:] 

def terminated_string(s): 
    return list(s) + [TERMINATOR] 

def match_literal(p, s): 
    p_head, p_tail = unpack(p) 
    s_head, s_tail = unpack(s) 
    return p_head == s_head and match(p_tail, s_tail) 

def match(p, s): 
    p_head, p_tail = unpack(p) 
    s_head, s_tail = unpack(s) 
    return any(( 
        p_head == '?' and (s_head and match(p_tail, s_tail)), 
        p_head == '+' and (s_head and(match(p, s_tail) or match(p_tail, s_tail))), 
        p_head == '*' and (match(p_tail, s) or (s_head and match(p, s_tail))), 
        p_head == '\\' and match_literal(p_tail, s), 
        p_head == s_head == TERMINATOR, 
    )) or match_literal(p, s) 

def glob(p, s): 
    return match(terminated_string(p), terminated_string(s))

featuring:

  • lazily-evaluated logical mess!
  • C style strings!
  • cute multiple comparison idiom!
  • plenty of ugly!

credit to user300's answer for illustrating how things are simplified if you can get some kind of terminator value when popping the head from an empty string.

i wish the head/tail unpacking could be performed inline during the declaration of m's arguments. then m could be a lambda, just like its friends n and glob. python2 cannot do it, and after a bit of reading, it looks like python3 cannot either. woe.

testing:

test_cases = { 
    ('abc', 'abc') : True, 
    ('abc', 'abcdef') : False, 
    ('a??', 'aww') : True, 
    ('a*b', 'ab') : True, 
    ('a*b', 'aqwghfkjdfgshkfsfddsobbob') : True, 
    ('a*?', 'a') : False, 
    ('?*', 'def') : True, 
    ('5+', '5ggggg') : True, 
    ('+', '') : False, 
}   
for (p, s) in test_cases: 
    computed_result = glob(p, s) 
    desired_result = test_cases[(p, s)] 
    print '%s %s' % (p, s) 
    print '\tPASS' if (computed_result == desired_result) else '\tFAIL' 

roobs

Posted 2011-02-04T06:04:50.113

Reputation: 299

1

C# (251 chars)

static bool g(string p,string i){try{char c;System.Func<string,string>s=t=>t.Remove(0,1);return p==i||((c=p[0])==92?p[1]==i[0]&g(s(s(p)),s(i)):c==42?g(s(p),i)||g(p,s(i)):c==43?g(s(p),s(i))|g(p,s(i)):g(s(p),s(i))&(c==i[0]|c==63));}catch{return false;}}

Slightly more readable:

static bool g(string p /* pattern */, string i /* input string */)
{
    // Instead of checking whether we’ve reached the end of the string, just
    // catch the out-of-range exception thrown by the string indexing operator
    try
    {
        char c;

        // .Remove(0,1) is shorter than .Substring(1)...
        System.Func<string, string> s = t => t.Remove(0, 1);

        // Note that every glob matches itself!† This saves us having to write
        // “(p=="" & i=="")” which would be much longer — very convenient!
        return p == i || (

            // backslash escapes
            (c = p[0]) == 92 ? p[1] == i[0] & g(s(s(p)), s(i)) :

            // '*' — need “||” so that s(i) doesn’t throw if the first part is true
            c == 42 ? g(s(p), i) || g(p, s(i)) :

            // '+'
            c == 43 ? g(s(p), s(i)) | g(p, s(i)) :

            // '?' or any other character
            g(s(p), s(i)) & (c == i[0] | c == 63)
        );
    }

    // If we ever access beyond the end of the string, we know the glob doesn’t match
    catch { return false; }
}

I know, I know... except for globs containing the backslash. Which is really unfortunate. It would have been really clever otherwise. :(

Timwi

Posted 2011-02-04T06:04:50.113

Reputation: 12 158

1

Python (454 characters)

def glob(p,s):
  ps,pns=[0],[]
  for ch in s:
    for i in ps:
      if i<0:
        pns+=[i]
        if i>-len(p) and p[-i]==ch:pns+=[-i]
      elif i<len(p):
        pc=p[i]
        d={'?':[i+1],'+':[i,-i-1],'*':[i+1,-i-1]}
        if pc in d:pns+=d[pc]
        else:
          if pc=='\\':pc=p[i+1]
          if pc==ch:pns+=[i+1]
    ps,pns=pns,[]
  if (s or p in '*') and (len(p) in ps or -len(p)+1 in ps or -len(p) in ps): return True
  return False

Hoa Long Tam

Posted 2011-02-04T06:04:50.113

Reputation: 1 902

1

D: 363 Characters

bool glob(S)(S s,S t){alias front f;alias popFront p;alias empty e;while(!e(s)&&!e(t)){switch(f(s)){case'+':if(e(t))return false;p(t);case'*':p(s);if(e(s))return true;if(f(s)!='+'&&f(s)!='*'){for(;!e(t);p(t)){if(f(s)==f(t)&&glob(s,t))return true;}}break;case'\\':p(s);if(e(s))return false;default:if(f(s)!=f(s))return false;case'?':p(s);p(t);}}return e(s)&&e(t);}

More Legibly:

bool glob(S)(S s, S t)
{
    alias front f;
    alias popFront p;
    alias empty e;

    while(!e(s) && !e(t))
    {
        switch(f(s))
        {
            case '+':
                if(e(t))
                    return false;

                p(t);
            case '*':
                p(s);

                if(e(s))
                    return true;

                if(f(s) != '+' && f(s) != '*')
                {
                    for(; !e(t); p(t))
                    {
                        if(f(s) == f(t) && glob(s, t))
                            return true;
                    }
                }

                break;
            case '\\':
                p(s);

                if(e(s))
                    return false;
            default:
                if(f(s) != f(s))
                    return false;
            case '?':
                p(s);
                p(t);
        }
    }

    return e(s) && e(t);
}

Jonathan M Davis

Posted 2011-02-04T06:04:50.113

Reputation: 705

1

golfscript

{{;;}2$+}:x;{x if}:a;{x\if}:o;{1$1$}:b;{(@(@={\m}a}:r;{b(63={\({\m}a}a{b(43={\({\b m{'+'\+m}o}a}a{b(42={b m{\({\'*'\+m}a}o}a{b(92={r}a{b 0=0=\0=0=*{r}o}o}o}o}o}:m;{[0]+\[0]+m}:glob;

it's built out of functions that consume two arguments from the stack, s and p, and produce a single boolean return value. there's a bit of mucking around to make that compatible with the lazy and and lazy or operators. i really doubt this approach is anywhere near optimal, or even in the right direction.

there's also a few entertainingly stupid moments, such as popping a '*' off the pattern, consuming the '*' in a comparison, only to realise that the subsequent branch doesn't match. in order to go down the other branch, we need the pattern with the '*' on the front, but we've consumed that original pattern when we popped the '*', and we consumed the '*', so to get ourselves the pattern again we load a shiny new string constant '*', and prepend it in place. it gets even uglier because for some reason the character matching needs to be done with ascii values, but prepending back onto the string needs strings.

less golfed golfscript

{[0]+}:terminate_string;
{{;;}2$+if}:_and;
{{;;}2$+\if}:_or;
{1$1$}:branch;
{(@(@={\match}_and}:match_literal;
{0=0=\0=0=*}:match_terminator;
{(92={match_literal}_and}:match_escape;
{(63={\({\match}_and}_and}:match_wildcard;
{(43={\({\branch match{'+'\+match}_or}_and}_and}:match_wildcard_plus;
{(42={branch match{\({\'*'\+match}_and}_or}_and}:match_wildcard_star;
{branch match_wildcard{branch match_wildcard_plus{branch match_wildcard_star{branch match_escape{branch match_terminator{match_literal}_or}_or}_or}_or}_or}:match;
{terminate_string\terminate_string match}:glob;

tests

{2$2$glob = "test passed: " "test FAILED: " if print \ print ' ; ' print print "\n" print}:test_case;

'abc' 'abc' 1 test_case
'abc' 'abcdef' 0 test_case
'a??' 'aww' 1 test_case
'a*b' 'ab' 1 test_case
'a*b' 'agwijgwbgioeb' 1 test_case
'a*?' 'a' 0 test_case
'?*' 'def' 1 test_case
'5+' '5ggggg' 1 test_case
'+' '' 0 test_case

roobs

Posted 2011-02-04T06:04:50.113

Reputation: 299