Implement PCRE in your language.

13

4

Note: After trying this myself, I soon realized what a mistake this was. Therfore, I am modifying the rules a bit.

The minimum required functionality:

  • Character classes (., \w, \W, etc.)
  • Multipliers (+, *, and ?)
  • Simple capture groups

Your challenge is to implement PCRE in the language of your choice subject to the following conditions:

  • You may not use your language's native RegEx facilities in any way. You may not use a 3rd party RegEx library either.
  • Your entry should implement as much of the PCRE spec. as possible.
  • Your program should accept as input, 2 lines:

    • the regular expression
    • the string input to match against
  • Your program should indicate in its output:

    • Whether the RegEx matched anywhere in the input string
    • The results of any capturing groups
  • The winner shall be the entry that implements as much of the spec. as possible. In case of a tie, the winner shall be the most creative entry, as judged by me.


Edit: to clarify a few things, here are some examples of input and expected output:


  • Input:
^\s*(\w+)$
         hello
  • Output:
Matches: yes
Group 1: 'hello'

  • Input:
(\w+)@(\w+)(?:\.com|\.net)
sam@test.net
  • Output:
Matches: yes
Group 1: 'sam'
Group 2: 'test'

Nathan Osman

Posted 2011-02-21T20:18:32.000

Reputation: 596

That's a really challenging challenge, given the amount of features in PCRE. Recursion, backtracking, lookahead/assertions, unicode, conditional subpatterns, ... – Arnaud Le Blanc – 2011-02-21T20:42:00.553

1

See PCRE docs; PERL RE; PHP PCRE docs are great too.

– Arnaud Le Blanc – 2011-02-21T20:44:30.597

@user300: The goal is to implement as much as possible. Obviously everything would be a little too hard. – Nathan Osman – 2011-02-21T21:41:36.157

2@George: How about you list the features you want and give some test cases, just so that we're all on even ground. – Marko Dumic – 2011-02-21T22:09:39.843

@Marko: Is that better? – Nathan Osman – 2011-02-22T00:07:01.013

1@George: I think @Marko was after specific features, or rather, the minimal subset you'd want to people to implement first. On the whole, though, PCRE is really way too hard a challenge for a casual coding competition. I suggest changing this to a very small, specific RE subset, and make the challenge to implement. – MtnViewMark – 2011-02-22T16:04:36.710

@Mtn: You're right. I've edited the question now. – Nathan Osman – 2011-02-22T18:38:20.257

@George, why not just ask for POSIX regexes instead? That roughly corresponds to your minimum subset. – Peter Taylor – 2011-02-22T19:09:03.587

Awesome task. Click. Whirrrrr. Working... – Jonathan Watmough – 2011-04-28T23:54:24.100

@Jonathan: I'm guessing it was harder than it seemed? – Nathan Osman – 2011-05-12T22:22:36.800

@JonathanWatmough (3 years later) Still waiting... ;-) – Digital Trauma – 2014-05-02T01:12:28.233

I think the part with the groups is pretty complicated. I have given a little thought and had trouble identifying characters inside groups inside groups inside groups (as in Inception). – Alexandru – 2011-06-30T20:36:33.003

Actually, I have a good chunk working, just need to clean it up a bit and finish the \w part. Got distracted back to working on 'paying' software. Plus the baby is a lot more active, and it's really hard to program with a noisy baby ;-) Wouldn't have it any other way. – Jonathan Watmough – 2011-07-11T22:00:49.117

Answers

10

Python

Since implementing full PCRE is too much I implemented only an essential subset of it.

Supports |.\.\w\W\s+*(). Input regexp must be correct.

Examples:

$ python regexp.py 
^\s*(\w+)$
   hello
Matches:     hello
Group 1 hello

$ python regexp.py
(a*)+
infinite loop

$ python regexp.py 
(\w+)@(\w+)(\.com|\.net)
sam@test.net
Matches:  sam@test.net
Group 1 sam
Group 2 test
Group 3 .net

How it works:

For detailed theory read this Introduction to Automata Theory, Languages, and Computation.

The idea is to convert the original regular expression into a nondeterminist finite automata (NFA). Actually, PCRE regular expressions are at least context free grammars for which we need push-down automata, but we'll limit ourselves to a subset of PCRE.

Finite automatas are directed graphs in which nodes are states, edges are transitions and each transition has a matching input. Initially you start from a start node, predefined. Whenever you receive an input that matches one of the transition you take that transition to a new state. If you reached a terminal node it is called that automata accepted input. In our case input is a matching function that returns true.

They are called nondeterminist automata because sometimes there are more matching transitions that you can take from the same state. In my implementation all transition to the same state must match the same thing, so I stored the matching function together with the destination state (states[dest][0]).

We transform our regexp to a finite automata using building blocks. A building block has a start node (first) and an end node (last) and matches something from the text (possible empty string).

The simplest examples include

  • matching nothing: True (first == last)
  • matching a character: c == txt[pos] (first == last)
  • matching end of string: pos == len(txt)(first == last`)

You will also need the new position in text where to match next token.

More complicated examples are (capital letters stand for blocks).

  • matching B+:

    • create nodes: u, v (matching nothing)
    • create transitions: u -> B.first, B.last -> v, v -> u
    • when you get to node v you already matched B. Then you have two options: go further, or try to match B again.
  • matching A|B|C:

    • create nodes: u, v (matching nothing)
    • create transitions: u -> A.first, u -> C.first, u -> C.first,
    • create transitions: A->last -> v, B->last -> v, C->last -> v,
    • from u you can go to any block

All regexp operators can be transformed like this. Just give a try for *.

The last part is to parse the regexp which requires a very simple grammar:

 or: seq ('|' seq)*
 seq: empty
 seq: atom seq
 seq: paran seq
 paran: '(' or ')'

Hopefully implementing a simple grammar (I think is LL(1), but correct me if I'm wrong) is much easier than building an NFA.

Once you have the NFA you need to backtrack until you reach terminal node.

Source code (or here):

from functools import *

WORDCHAR = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890_'


def match_nothing(txt, pos):
  return True, pos

def match_character(c, txt, pos):
  return pos < len(txt) and txt[pos] == c, pos + 1

def match_space(txt, pos):
  return pos < len(txt) and txt[pos].isspace(), pos + 1

def match_word(txt, pos):
  return pos < len(txt) and txt[pos] in WORDCHAR, pos + 1

def match_nonword(txt, pos):
  return pos < len(txt) and txt[pos] not in WORDCHAR, pos + 1

def match_dot(txt, pos):
  return pos < len(txt), pos + 1

def match_start(txt, pos):
  return pos == 0, pos

def match_end(txt, pos):
  return pos == len(txt), pos


def create_state(states, match=None, last=None, next=None, name=None):
  if next is None: next = []
  if match is None: match = match_nothing

  state = len(states)
  states[state] = (match, next, name)
  if last is not None:
    states[last][1].append(state)

  return state


def compile_or(states, last, regexp, pos):
  mfirst = create_state(states, last=last, name='or_first')
  mlast = create_state(states, name='or_last')

  while True:
    pos, first, last = compile_seq(states, mfirst, regexp, pos)
    states[last][1].append(mlast)
    if pos != len(regexp) and regexp[pos] == '|':
      pos += 1
    else:
      assert pos == len(regexp) or regexp[pos] == ')'
      break

  return pos, mfirst, mlast


def compile_paren(states, last, regexp, pos):
  states.setdefault(-2, [])   # stores indexes
  states.setdefault(-1, [])   # stores text

  group = len(states[-1])
  states[-2].append(None)
  states[-1].append(None)

  def match_pfirst(txt, pos):
    states[-2][group] = pos
    return True, pos

  def match_plast(txt, pos):
    old = states[-2][group]
    states[-1][group] = txt[old:pos]
    return True, pos

  mfirst = create_state(states, match=match_pfirst, last=last, name='paren_first')
  mlast = create_state(states, match=match_plast, name='paren_last')

  pos, first, last = compile_or(states, mfirst, regexp, pos)
  assert regexp[pos] == ')'

  states[last][1].append(mlast)
  return pos + 1, mfirst, mlast


def compile_seq(states, last, regexp, pos):
  first = create_state(states, last=last, name='seq')
  last = first

  while pos < len(regexp):
    p = regexp[pos]
    if p == '\\':
      pos += 1
      p += regexp[pos]

    if p in '|)':
      break

    elif p == '(':
      pos, first, last = compile_paren(states, last, regexp, pos + 1)

    elif p in '+*':
      # first -> u ->...-> last -> v -> t
      # v -> first (matches at least once)
      # first -> t (skip on *)
      # u becomes new first
      # first is inserted before u

      u = create_state(states)
      v = create_state(states, next=[first])
      t = create_state(states, last=v)

      states[last][1].append(v)
      states[u] = states[first]
      states[first] = (match_nothing, [[u], [u, t]][p == '*'])

      last = t
      pos += 1

    else:  # simple states
      if p == '^':
    state = create_state(states, match=match_start, last=last, name='begin')
      elif p == '$':
    state = create_state(states, match=match_end, last=last, name='end')
      elif p == '.':
    state = create_state(states, match=match_dot, last=last, name='dot')
      elif p == '\\.':
    state = create_state(states, match=partial(match_character, '.'), last=last, name='dot')
      elif p == '\\s':
    state = create_state(states, match=match_space, last=last, name='space')
      elif p == '\\w':
    state = create_state(states, match=match_word, last=last, name='word')
      elif p == '\\W':
    state = create_state(states, match=match_nonword, last=last, name='nonword')
      elif p.isalnum() or p in '_@':
    state = create_state(states, match=partial(match_character, p), last=last, name='char_' + p)
      else:
    assert False

      first, last = state, state
      pos += 1

  return pos, first, last


def compile(regexp):
  states = {}
  pos, first, last = compile_or(states, create_state(states, name='root'), regexp, 0)
  assert pos == len(regexp)
  return states, last


def backtrack(states, last, string, start=None):
  if start is None:
    for i in range(len(string)):
      if backtrack(states, last, string, i):
    return True
    return False

  stack = [[0, 0, start]]   # state, pos in next, pos in text
  while stack:
    state = stack[-1][0]
    pos = stack[-1][2]
    #print 'in state', state, states[state]

    if state == last:
      print 'Matches: ', string[start:pos]
      for i in xrange(len(states[-1])):
    print 'Group', i + 1, states[-1][i]
      return True

    while stack[-1][1] < len(states[state][1]):
      nstate = states[state][1][stack[-1][1]]
      stack[-1][1] += 1

      ok, npos = states[nstate][0](string, pos)
      if ok:
    stack.append([nstate, 0, npos])
    break
      else:
    pass
    #print 'not matched', states[nstate][2]
    else:
      stack.pop()

  return False



# regexp = '(\\w+)@(\\w+)(\\.com|\\.net)'
# string = 'sam@test.net'
regexp = raw_input()
string = raw_input()

states, last = compile(regexp)
backtrack(states, last, string)

Alexandru

Posted 2011-02-21T20:18:32.000

Reputation: 5 485

1+1 Wow... I tried to do this myself with PHP and utterly failed. – Nathan Osman – 2011-07-01T19:34:40.547

TIL (a+b)+ matches abaabaaabaaaab. – Alexandru – 2011-07-02T00:09:19.830