Regex golf and ccTLDs, hooray!

4

Write a regex that only matches valid country code top level domains (ccTLDs). Your regex must match both the domains with the dot and without the dot (it must match tv and .tv). Any domain that is not a ccTLD or does not exist (e.g. .info or .jobs or .xz) must not be matched.

Use Perl, POSIX, PCRE or Python.

For reference, this is the full list of ccTLDs, as of challenge creation:

.ac .ad .ae .af .ag .ai .al .am .ao .aq .ar .as .at .au .aw .ax .az .ba .bb .bd .be .bf .bg .bh .bi .bj .bm .bn .bo .br .bs .bt .bw .by .bz .ca .cc .cd .cf .cg .ch .ci .ck .cl .cm .cn .co .cr .cu .cv .cw .cx .cy .cz .de .dj .dk .dm .do .dz .ec .ee .eg .er .es .et .eu .fi .fj .fk .fm .fo .fr .ga .gd .ge .gf .gg .gh .gi .gl .gm .gn .gp .gq .gr .gs .gt .gu .gw .gy .hk .hm .hn .hr .ht .hu .id .ie .il .im .in .io .iq .ir .is .it .je .jm .jo .jp .ke .kg .kh .ki .km .kn .kp .kr .kw .ky .kz .la .lb .lc .li .lk .lr .ls .lt .lu .lv .ly .ma .mc .md .me .mg .mh .mk .ml .mm .mn .mo .mp .mq .mr .ms .mt .mu .mv .mw .mx .my .mz .na .nc .ne .nf .ng .ni .nl .no .np .nr .nu .nz .om .pa .pe .pf .pg .ph .pk .pl .pm .pn .pr .ps .pt .pw .py .qa .re .ro .rs .ru .rw .sa .sb .sc .sd .se .sg .sh .si .sk .sl .sm .sn .so .sr .ss .st .sv .sx .sy .sz .tc .td .tf .tg .th .tj .tk .tl .tm .tn .to .tr .tt .tv .tw .tz .ua .ug .uk .us .uy .uz .va .vc .ve .vg .vi .vn .vu .wf .ws .ye .yt .za .zm .zw -- open for registration

.an .bl .bq .bv .eh .gb .mf .sj .su .tp .um -- not open (still must be matched)

^ and $ anchors are not needed.

Andrew

Posted 2019-08-02T14:43:56.787

Reputation: 2 067

6

There are a lot of different flavors of regex (some of which are not even regular expressions) and there are a lot of things that are regular expressions under the computer science definition. You don't have to specify a single flavor but it would be nice to know exactly what counts as a regex for this challenge, that is what langauges can compete.

– Post Rock Garf Hunter – 2019-08-02T15:07:17.703

Answers

4

275 bytes

\.?([cmst][cdghk-orvz]|[agim][delmq-t]|[dft][jkmo]|[acms]x|[bcgkmp][ghmnrwy]|[psu][agksy]|[aen][cegru]|[abcn][fioz]|l[abcikr-vy]|[bdjkprsvy]e|[bcgmnqvz]a|[behpsty]t|[cghmrv]u|[fgksv]i|[gjkmn]p|b[bdjs]|h[kmnr]|[gptw]f|[artz]w|v[cgn]|[joz]m|[ijr]o|[erw]s|[dku]z|[np]l|fr|in|sb)

(277 bytes including ^$.) A Python program was used to autogenerate part of this regular expression. Here is the first version of the program:

def makeregex(list):
  if len(list) == 1:
    return alpha[list[0]]
  regex = "["
  j = list[0]
  while j < 26:
    if j in list:
      if j + 1 in list and j + 2 in list and j + 3 in list:
        regex += alpha[j] + "-"
        while j + 1 in list:
          j += 1
      regex += alpha[j]
    j += 1
  return regex + "]"

def perform(rows, cols):
  replaced = '|'.join([alpha[j] + alpha[k] for j in rows for k in cols if array[j][k] > 0])
  regex = makeregex(rows) + makeregex(cols)
  print regex, 'replaces', replaced, 'saving', len(replaced) - len(regex), 'bytes'
  for j in rows:
    for k in cols:
      if (array[j][k]):
        array[j][k] = -1
      else:
        raise Exception("filling illegal entry")

all = ['ac', 'ad', 'ae', 'af', 'ag', 'ai', 'al', 'am', 'ao', 'aq', 'ar', 'as',
       'at', 'au', 'aw', 'ax', 'az', 'ba', 'bb', 'bd', 'be', 'bf', 'bg', 'bh',
       'bi', 'bj', 'bm', 'bn', 'bo', 'br', 'bs', 'bt', 'bw', 'by', 'bz', 'ca',
       'cc', 'cd', 'cf', 'cg', 'ch', 'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'cr',
       'cu', 'cv', 'cw', 'cx', 'cy', 'cz', 'de', 'dj', 'dk', 'dm', 'do', 'dz',
       'ec', 'ee', 'eg', 'er', 'es', 'et', 'eu', 'fi', 'fj', 'fk', 'fm', 'fo',
       'fr', 'ga', 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gp',
       'gq', 'gr', 'gs', 'gt', 'gu', 'gw', 'gy', 'hk', 'hm', 'hn', 'hr', 'ht',
       'hu', 'id', 'ie', 'il', 'im', 'in', 'io', 'iq', 'ir', 'is', 'it', 'je',
       'jm', 'jo', 'jp', 'ke', 'kg', 'kh', 'ki', 'km', 'kn', 'kp', 'kr', 'kw',
       'ky', 'kz', 'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls', 'lt', 'lu', 'lv',
       'ly', 'ma', 'mc', 'md', 'me', 'mg', 'mh', 'mk', 'ml', 'mm', 'mn', 'mo',
       'mp', 'mq', 'mr', 'ms', 'mt', 'mu', 'mv', 'mw', 'mx', 'my', 'mz', 'na',
       'nc', 'ne', 'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om',
       'pa', 'pe', 'pf', 'pg', 'ph', 'pk', 'pl', 'pm', 'pn', 'pr', 'ps', 'pt',
       'pw', 'py', 'qa', 're', 'ro', 'rs', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd',
       'se', 'sg', 'sh', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'ss', 'st',
       'sv', 'sx', 'sy', 'sz', 'tc', 'td', 'tf', 'tg', 'th', 'tj', 'tk', 'tl',
       'tm', 'tn', 'to', 'tr', 'tt', 'tv', 'tw', 'tz', 'ua', 'ug', 'uk', 'us',
       'uy', 'uz', 'va', 'vc', 've', 'vg', 'vi', 'vn', 'vu', 'wf', 'ws', 'ye',
       'yt', 'za', 'zm', 'zw']
alpha = 'abcdefghijklmnopqrstuvwxyz'
array = [[alpha[i] + alpha[j] in all for j in range(26)] for i in range(26)]

#perform([2, 12, 18, 19], [2, 3, 6, 7, 10, 11, 12, 13, 14, 17, 21, 25])
#perform([0, 6, 8, 12], [3, 4, 11, 12, 16, 17, 18, 19])
#perform([3, 5, 19], [9, 10, 12, 14])
#perform([0, 2, 12, 18], [23])

while max(max(row) for row in array) > 0:
  best = 0
  i = 1
  while i < 1 << 26:
    row = [-1 for j in range(26)]
    for j in range(26):
      if (i & 1 << j):
        for k in range(26):
          if array[j][k] == 0:
            row[k] = 0
          elif array[j][k] == 1 and row[k]:
            row[k] = max(row[k], 0) + 1
        if max(row) < 1:
          break;
    if max(row) > 0:
      total = 3 * sum(max(j, 0) for j in row)
      if total > best:
        rows = [j for j in range(26) if i & 1 << j]
        cols = [j for j in range(26) if row[j] > 0]
        total -= len(makeregex(rows)) + len(makeregex(cols)) + 1
        if total > best:
          best = total
          bestrows = rows
          bestcols = cols
      i += 1
    else:
      i += i & -i
  if best:
    perform(bestrows, bestcols)
  else:
    print '|'.join([alpha[j] + alpha[k] for j in range(26) for k in range(26) if array[j][k] > 0])
    break

This 282-byte regex results from its output:

\.?([bgmps][aeghmnrsty]|[cmst][cdghk-orvz]|[acn][cfgiloruz]|[agim][delmq-t]|[bck][ghimnrwyz]|[elm][cr-u]|[dft][jkmo]|[agpt][fltw]|[gnv][aegiu]|[lpu][aksy]|[gjkm][emp]|h[kmnrtu]|b[bdfjo]|r[eosuw]|[ls][biv]|[acms]x|[cmz][amw]|[dey]e|dz|eg|fi|fr|in|io|jo|np|om|qa|ug|uz|vc|vn|wf|ws|yt)

The program can also be seeded by adding calls to perform, so I tried seeding it with [cmst][cdghk-orvz]|[agim][delmq-t]|[dft][jkmo]|[acms]x but that just resulted in a rearranged regex. Obviously this wasn't good enough. The problem seemed to be that I was comparing the performance against a simple alternation. I therefore developed a slower but more accurate scoring function, which works by assuming that all of the remaining matches should be either x[yz] or [xy]z alternations, generating the largest alternations first, and compares the result with and without the trial regex.

def makeregex(list):
  if len(list) == 1:
    return alpha[list[0]]
  regex = "["
  j = list[0]
  while j < 26:
    if j in list:
      if j + 1 in list and j + 2 in list and j + 3 in list:
        regex += alpha[j] + "-"
        while j + 1 in list:
          j += 1
      regex += alpha[j]
    j += 1
  return regex + "]"

def simple(array):
  unused = [[j > 0 for j in row] for row in array]
  regex = ""
  while max(max(row) for row in unused):
    rowcount = [sum(row) for row in unused]
    colcount = [sum(col) for col in zip(*unused)]
    if max(colcount) > max(rowcount):
      colindex = colcount.index(max(colcount))
      col = zip(*unused)[colindex]
      regex += "|" + makeregex([j for j in range(26) if col[j]]) + alpha[colindex]
      for j in range(26):
        unused[j][colindex] = 0
    else:
      rowindex = rowcount.index(max(rowcount))
      row = unused[rowindex]
      regex += "|" + alpha[rowindex] + makeregex([j for j in range(26) if row[j]])
      unused[rowindex] = [0 for j in range(26)]
  return regex

def perform(rows, cols):
  previous = simple(array)[1:].split('|')
  for j in rows:
    for k in cols:
      if (array[j][k]):
        array[j][k] = -1
      else:
        raise Exception("filling illegal entry")
  regex = (makeregex(rows) + makeregex(cols) + simple(array)).split('|')
  replaced = '|'.join(j for j in previous if not j in regex)
  regex = '|'.join(j for j in regex if not j in previous)
  print regex, 'replaces', replaced, 'saving', len(replaced) - len(regex), 'bytes'

all = ['ac', 'ad', 'ae', 'af', 'ag', 'ai', 'al', 'am', 'ao', 'aq', 'ar', 'as',
       'at', 'au', 'aw', 'ax', 'az', 'ba', 'bb', 'bd', 'be', 'bf', 'bg', 'bh',
       'bi', 'bj', 'bm', 'bn', 'bo', 'br', 'bs', 'bt', 'bw', 'by', 'bz', 'ca',
       'cc', 'cd', 'cf', 'cg', 'ch', 'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'cr',
       'cu', 'cv', 'cw', 'cx', 'cy', 'cz', 'de', 'dj', 'dk', 'dm', 'do', 'dz',
       'ec', 'ee', 'eg', 'er', 'es', 'et', 'eu', 'fi', 'fj', 'fk', 'fm', 'fo',
       'fr', 'ga', 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gp',
       'gq', 'gr', 'gs', 'gt', 'gu', 'gw', 'gy', 'hk', 'hm', 'hn', 'hr', 'ht',
       'hu', 'id', 'ie', 'il', 'im', 'in', 'io', 'iq', 'ir', 'is', 'it', 'je',
       'jm', 'jo', 'jp', 'ke', 'kg', 'kh', 'ki', 'km', 'kn', 'kp', 'kr', 'kw',
       'ky', 'kz', 'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls', 'lt', 'lu', 'lv',
       'ly', 'ma', 'mc', 'md', 'me', 'mg', 'mh', 'mk', 'ml', 'mm', 'mn', 'mo',
       'mp', 'mq', 'mr', 'ms', 'mt', 'mu', 'mv', 'mw', 'mx', 'my', 'mz', 'na',
       'nc', 'ne', 'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om',
       'pa', 'pe', 'pf', 'pg', 'ph', 'pk', 'pl', 'pm', 'pn', 'pr', 'ps', 'pt',
       'pw', 'py', 'qa', 're', 'ro', 'rs', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd',
       'se', 'sg', 'sh', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'ss', 'st',
       'sv', 'sx', 'sy', 'sz', 'tc', 'td', 'tf', 'tg', 'th', 'tj', 'tk', 'tl',
       'tm', 'tn', 'to', 'tr', 'tt', 'tv', 'tw', 'tz', 'ua', 'ug', 'uk', 'us',
       'uy', 'uz', 'va', 'vc', 've', 'vg', 'vi', 'vn', 'vu', 'wf', 'ws', 'ye',
       'yt', 'za', 'zm', 'zw']
alpha = 'abcdefghijklmnopqrstuvwxyz'
array = [[alpha[i] + alpha[j] in all for j in range(26)] for i in range(26)]

#perform([2, 12, 18, 19], [2, 3, 6, 7, 10, 11, 12, 13, 14, 17, 21, 25])
#perform([0, 6, 8, 12], [3, 4, 11, 12, 16, 17, 18, 19])
#perform([3, 5, 19], [9, 10, 12, 14])
#perform([0, 2, 12, 18], [23])

while max(max(row) for row in array) > 0:
  best = simple(array)[1:]
  bestrows = []
  bestcols = []
  i = 1
  while i < 1 << 26:
    row = [-1 for j in range(26)]
    for j in range(26):
      if (i & 1 << j):
        for k in range(26):
          if array[j][k] == 0:
            row[k] = 0
          elif array[j][k] == 1 and row[k]:
            row[k] = max(row[k], 0) + 1
        if max(row) < 1:
          break;
    if max(row) > 0:
      rows = [j for j in range(26) if i & 1 << j]
      cols = [j for j in range(26) if row[j] > 0]
      unused = [row[:] for row in array]
      for j in rows:
        for k in cols:
          unused[j][k] = 0
      total = makeregex(rows) + makeregex(cols) + simple(unused)
      if len(total) < len(best):
        best = total
        bestrows = rows
        bestcols = cols
      i += 1
    else:
      i += i & -i
  if bestrows:
    perform(bestrows, bestcols)
  else:
    print best
    break

Now without seeding, this actually produces the worse 286-byte regex:

\.?([bgp][ae-hmnrstwy]|[cst][cdghk-orvz]|[nv][acegiu]|m[acdeghk-z]|a[c-gilmoq-uwxz]|k[eghimnprwyz]|l[abcikr-vy]|i[del-oq-t]|s[abeistxy]|c[afiuwxy]|e[cegr-u]|b[bdijoz]|d[ejkmoz]|f[ijkmor]|g[dilpqu]|h[kmnrtu]|n[floprz]|u[agksyz]|r[eosuw]|j[emop]|t[fjtw]|z[amw]|p[kl]|w[fs]|y[et]|om|qa|vn)

When seeded with [cmst][cdghk-orvz] (which I had independently discovered manually before writing the first program) this improves to this 282-byte regex, thus equalling the first program:

\.?([cmst][cdghk-orvz]|[bgmps][aeghmnrsty]|[ag][d-gilmq-uw]|[pt][fkltw]|[acs][cioxz]|[nv][acegiu]|k[eghimnprwyz]|l[abcikr-vy]|i[del-oq-t]|b[bdfijowz]|e[cegr-u]|d[ejkmoz]|f[ijkmor]|h[kmnrtu]|n[floprz]|u[agksyz]|c[afuwy]|m[pquwx]|r[eosuw]|j[emop]|z[amw]|w[fs]|y[et]|gp|om|qa|sb|tj|vn)

After some experimentation I discovered that when seeded with [cmst][cdghk-orvz]|[agim][delmq-t]|[dft][jkmo]|[acms]x (selected from the output of the first version), it results in the given 275-byte regex.

For completeness, I think the best I achieved manually was the following 282-byte regex:

\.?(a[c-gilmoq-uwxz]|b[abd-jostz]|c[afiuwxy]|d[ejkmoz]|e[cegr-u]|f[ijkmor]|g[ad-ilp-u]|h[kmnrtu]|i[del-oq-t]|j[emop]|k[ipz]|l[abcikr-vy]|m[aep-y]|n[acefgilopruz]|om|p[aflst]|qa|r[eosuw]|s[abeistxy]|t[fjtw]|u[agksyz]|v[aceginu]|w[fs]|y[et]|z[amw]|[cmst][cdghk-orvz]|[bgkp][eghmnrwy])

Neil

Posted 2019-08-02T14:43:56.787

Reputation: 95 035

3

282 bytes

\.?([abcmpt][fglmnrw]|[abgim][delmnq-t]|[acms]x|[adkn][ez]|[aemsv][cegu]|[afn][ior]|[bcmst][dghl-orvz]|[bgls][abirsty]|[clmpu][aky]|[clnv][aciu]|[dfst][jkmo]|[egp][eghrst]|[gmnt][fglpr]|[gr][esuw]|[ir]o|[ty]t|bj|h[kmnrtu]|j[emop]|k[ghimnprwy]|lv|om|qa|tc|u[gmsz]|vn|w[fs]|ye|z[amw])

Try it online!

276 bytes—old list

\.?([abt][dfgmortwz]|[aclnv][ciu]|[acms]x|[agim][delmq-t]|[bcgkmp][ghmnrwy]|[bdijnr][eo]|[cgnp][afglr]|[cmst][cdghk-orvz]|[dft][jkmo]|[dknu]z|[epsy][et]|[fgks]i|[gjkmn][ep]|[gmv][aegnu]|[jo]m|[psu][agksy]|b[abijs]|e[cgrsu]|fr|h[kmnrtu]|in|l[abkr-vy]|qa|r[suw]|sb|w[fs]|z[amw])

Try it online!

Anders Kaseorg

Posted 2019-08-02T14:43:56.787

Reputation: 29 242

Back in 2014 you claimed an incredible 17 character score on Glob of Regex Golf. Nobody has come close; somebody got 21 eventually; I've only been able to get 22 (the best published one is 23). Could you please share your amazing solution now? I think it's been long enough :) Or give a hint as to what schema was used. It would be sad for your solution to disappear into the mists of time. (My attempts to email you have failed, so I'm trying this now.)

– Deadcode – 2019-12-24T13:04:44.760