Count regex matches

6

1

Your task

Given a simple regular expression, you have to count how many strings of length n have a match of length n with the given simple regex. This will just be a subset of regexs. Like, no lookaheads or named groups or recursion or whatever weird things regexs have.

Simple regular expression

For the purposes of this challenge, a regex is said to be simple if, for one, it only contains characters in the ASCII range 32-126. Furthermore, it should only use the following functionalities:

  • match literal characters, much like the regex abc would only match the string "abc";
  • match options, like abc|def would match "abc" and "def";
  • match exactly 0 or 1 occurrence of something, e.g. https? matches "http" and "https";
  • match 1 or more occurrences of something, e.g. ah+ would match "ah", "ahh", "ahhh", "ahhhh", etc;
  • match any amount of occurrences of something, e.g. 1* matches "", "1", "11", "111", "1111", etc;
  • match between n and m occurrences of something, e.g. lo{1,4}l matches only "lol", "lool", "loool" and "looool". If n is ommited, it matches up to m occurrences. If m is ommited, it matches at least n occurrences. Assume at least one of n or m is present;
  • use () to group, e.g. ab(c|d)ef would match "abcef" and "abdef" (c.f. 2nd item in this list) or (10)+ would match "10", "1010", "101010", "10101010", etc;
  • use . to match any character (in the ASCII range [32, 126]), so ab. would match "abc", "ab9", "ab)", etc;
  • use \ to escape the special meaning of a character, e.g. ab? would match "a" and "ab", while ab\? only matches "ab?";
  • use [] as a group of possible characters. Inside the brackets, all characters lose their special behaviours, except for - and \. This means that, for one, ab[cde] is shorthand for ab(c|d|e) and secondly, ab[?+*] matches "ab?", "ab+" and "ab*"; also related to []:
  • use - to specify a character range within brackets. The ranges you have to support are a-z, A-Z and 0-9, as well as their subsets, like h-z or 3-8. E.g., the regex ab[c-g] matches "abc", "abd", "abe", "abf" and "abg"; Note that - has no special meaning outside of [] so a-z would only match "a-z".

Input

The input for your program/function/routine/etc should be a string representing the regex and an integer n. For the regex, you can further assume:

  • all characters that show up are in the ASCII range [32, 126]
  • if {n,m} is used, then \$n \leq m \$
  • if - is used inside [] then the specified range is well-formed

Output

The number of strings of length n that match the given regex. You only have to account for characters in the ASCII range [32, 126].

Test cases

".*", 0 -> 1
".*", 1 -> 95
".*", 2 -> 9025
".*", 3 -> 857375
".*", 4 -> 81450625
"abc", 2 -> 0
"abc", 4 -> 0
"ab|ac|ad", 2 -> 3
"a(b|c)", 2 -> 2
"hell(o|oo)", 5 -> 1
"https?", 5 -> 1
"ho{1,4}ly", 6 -> 1
"ho{3,}ly", 137 -> 1
"[abcde]{,2}", 2 -> 25
"(10)+", 7 -> 0
"(10)+", 8 -> 1
"ab\?", 3 -> 1
"[t7]h[i1][s5] is c[0o]d[Ee3] g[0oO][l1L]f", 17 -> 432
"\+351 9[1236] [0-9]{3,3} [0-9]{2,2} [0-9][0-9]", 17 -> 40000000
"-", 1 -> 1
"\\", 1 -> 1
"[+?*]", 1 -> 3
"Abc([d-z]*|(.H)+)", 11 -> 5132188812826241
"ab|ab", 2 -> 1
".(.(.(.(.|a))))|hello", 5 -> 7737809375

This is code so shortest solution in bytes, wins. If you like this challenge, consider upvoting it... And happy golfing!

RGS

Posted 2020-02-13T07:29:38.087

Reputation: 5 047

1Okay, I count 4 regex matches. Do I win? :P – Lyxal – 2020-02-13T07:41:49.587

1@Lyxal python, 12 bytes, lambda s,n:4 – RGS – 2020-02-13T07:44:20.073

1Keg, 1 byte: 4. :P – Lyxal – 2020-02-13T07:46:49.547

1Suggested test case: one containing {n:m} without n (i.e. "[abcde]{:2}", 2 -> 25) – Kevin Cruijssen – 2020-02-13T09:30:33.587

1Was the notion of simple regex an attempt at allowing someone to solve this by other means than brute force? (If yes, I think it's still much too complicated for that, as almost all key features of regular expressions are included.) – Arnauld – 2020-02-13T14:32:45.993

@Arnauld yes D: would you suggest further trimming down the specs of a simple regex? To what? – RGS – 2020-02-13T15:01:21.893

@RGS I would not recommend to change the spec now. But for another challenge, it might be interesting for instance to just count the number of strings matching a sequence of [...], e;g. [ac-f][xyz] -> 15, without \\ escaping or any other nasty features. – Arnauld – 2020-02-13T15:11:25.230

@Arnauld Thanks for your feedback! And what about the +*? "modifiers"? Would you think they make for a good addition? – RGS – 2020-02-13T15:44:32.783

1I don't know for sure. The more features you add, the less likely you'll get non-brute-force solutions. – Arnauld – 2020-02-13T15:47:35.223

Answers

6

Java 10, 356 342 284 275 233 225 220 bytes

import java.util.*;List S;r->k->{S=new Stack();p("",k);k=0;for(var p:S)k+=(p+"").matches(r.replaceAll("\\{(,\\d+\\})","{0$1"))?1:0;return k;}void p(String p,int k){if(k<1)S.add(p);else for(char i=32;i<127;)p(p+i++,k-1);}

Works, but way too slow for \$n\geq4\$.

Try it online.

Explanation:

import java.util.*; // Required import for Stack and List
List S;             // List on class-level, uninitialized

k->r->{             // Method with integer and String parameter and integer return-type
  S=new Stack();    //  Create a new List
  p("",k);          //  Put all permutations of length `k` consisting of printable ASCII
                    //  characters in this list
  k=0;              //  Re-use `k` as counter-integer, starting at 0
  for(var p:S)      //  Loop over all permutations in the List:
    k+=             //   Increase the counter by:
       (p+"")       //    Convert the permutation from Object to String
       .matches(r   //    If it matches the input-regex,
         .replaceAll("\\{(,\\d+\\})","{0$1"))?
                    //    after we've replaced all {,m} with {0,m}
          1         //     Increase the counter by 1
       :            //    Else:
          0;        //     Leave the count the same by increasing with 0
  return k;}        //  And return the counter as result

void p(             // Separated method with two parameters and no return-type, where:
    String p,       //   `p` is the prefix-String, starting empty
    int k){         //   `k` is the length the permutations we want to generate
  if(k<1)           //  If `k` is 0:
    S.add(p);       //   Add the current prefix-String to the List
  else              // Else:
    for(char i=32;i<127;)
                    //  Loop over the printable ASCII characters:
      p(            //   And do a recursive call, with:
        p+i++,      //    The prefix-String appended with the current character
        k-1);}      //    And `k` decreased by 1

Some notes:

  • Java's regex only supports {n,m} and {n,}, since {,m} could be written as {0,m}.
  • And Java's String#matches builtin implicitly adds a leading ^ and trailing $ to check the entire String.

Kevin Cruijssen

Posted 2020-02-13T07:29:38.087

Reputation: 67 575

1Will wait for further golfing and explanation :D – RGS – 2020-02-13T09:33:56.867

1Will leave a comment appearing to be commenting on the answer but actually end up being a meta continuation of a now existent joke. – Lyxal – 2020-02-13T10:09:24.147

Good luck with that, hope it beats my brute-force solution!

– RGS – 2020-02-13T12:10:22.017

@RGS I doubt it.. Java's regex uses {n,m} instead of {n:m}, so I have to account for that. In addition, it only supports {n,} and {n,m} (since {,m} can be {0,m}). So some bytes are already lost there in comparison to your answer. And Python just has better/shorter builtins in general tbh.. :) – Kevin Cruijssen – 2020-02-13T12:14:29.337

3

Python 3, 139 137 129 bytes

Saved 2 naughty bytes thanks to @Kevin, then 6 thanks to @ovs and then 2 more by rewriting f, thanks to @wilkben!

lambda r,n:sum(1for s in map(''.join,i.product(map(chr,range(32,127)),repeat=n))if re.match(f"^{r}$",s))
import re,itertools as i

You can try the smaller test cases online. My solution generates all strings of length n with the generator g and then tries to match the regex to the whole string. If the test case is the regex r, I use the regex ^r$ to force the match to be on the whole string.

RGS

Posted 2020-02-13T07:29:38.087

Reputation: 5 047

1This fails on any test case that uses the {n:m} syntax, since Python regexes don't have that (see the [abcde]{:2}, 2 -> 0 X in the results of your test suite). – Grimmy – 2020-02-13T12:17:09.667

1@Grimmy you are absolutely right! My problem was that I wrote the challenge believing the syntax I knew was {n:m} and didn't even check it. Since there is only one other answer that also benefits from me correcting the syntax to {n,m} which was what I intended, I edited the OP. Thanks for being on the lookout! Now the test cases are fine :) – RGS – 2020-02-13T12:36:18.730

1g can be 6 bytes shorter with some rearrangements: g=lambda n:[""][n:]or(s+chr(c+32)for c in range(95)for s in g(n-1)). – ovs – 2020-02-13T16:15:00.473

-8 bytes using itertools lambda r,n:sum(1for s in map(''.join,i.product(map(chr,range(32,127)),repeat=n))if re.match(f"^{r}$",s)) import re,itertools as i – wilkben – 2020-02-13T18:32:07.413

3

JavaScript (Node.js), 86 bytes

Another brute-force solution taking input as (n)(regex).

n=>g=(p,s='',c=127)=>n*!s[n-1]?--c>>5&&g(p,s+Buffer([c]))+g(p,s,c):!!s.match(`^${p}$`)

Try it online!

Commented

n =>                      // n = target string length
  g = (                   // g is a recursive function taking:
    p,                    //   p = regex pattern
    s = '',               //   s = current string
    c = 127               //   c = current ASCII code
  ) =>                    //
    n * !s[n - 1] ?       // if n is not equal to 0 and s is less than n char. long:
      --c >> 5 &&         //   decrement c; if it's greater than or equal to 32:
      g(                  //     do a 1st recursive call where:
        p,                //
        s + Buffer([c])   //       the character with ASCII code c is added to s
                          //       c is not passed so that it's reset to 127
      ) +                 //
      g(                  //     do a 2nd recursive call where:
        p, s, c           //       this value of c is discarded
      )                   //
    :                     // else:
      !!s.match(`^${p}$`) //   return true (1) if s matches p, or false (0) otherwise

Arnauld

Posted 2020-02-13T07:29:38.087

Reputation: 111 334

+1 Looking good! I see you also had to do the ^_$ tweak :) – RGS – 2020-02-13T14:30:20.430

1

I have the feeling you can contribute so many tips to the JavaScript tips page xD I.e. the Buffer([c]) to convert from integer to character. PS: Why do codegolfers always like to obscure their code? ;) The >>5 can simply be >31 (which imo makes it a lot more readable). Not that it matters here, since the byte-count remains the same, but stil.

– Kevin Cruijssen – 2020-02-13T15:00:48.830

1

@KevinCruijssen Buffer() doesn't exist in JS. It's a Node thing. So maybe we need a Node tips page, but I'm not sure there are enough Node-specific tricks to justify it. Or maybe we should include Node in Tips for Golfing in ECMAScript 6 and above.

– Arnauld – 2020-02-13T15:43:55.253

1

perl -E, 79 bytes

($,,$")=@ARGV;@@="";@@=map{$,=$_;map$,.chr,32..126}@@for 1..$,;say~~grep/$"/,@@

Humpty Dumpty

Posted 2020-02-13T07:29:38.087

Reputation: 11

1Welcome to this community! Can you please include a tio.run link, so that people can check your code passes the test cases? :) – RGS – 2020-02-13T17:12:11.623