Regex: Match an egalitarian series

18

2

Introduction

I don't see many regex challenges on here, so I would like to offer up this deceptively simple one which can be done in a number of ways using a number of regex flavours. I hope it provides regex enthusiasts with a bit of fun golfing time.

Challenge

The challenge is to match what I've very loosely dubbed an "egalitarian" series: a series of equal numbers of different characters. This is best described with examples.

Match:

aaabbbccc
xyz 
iillppddff
ggggggoooooollllllffffff
abc
banana

Don't match:

aabc
xxxyyzzz
iilllpppddff
ggggggoooooollllllfff
aaaaaabbbccc
aaabbbc
abbaa
aabbbc

To generalize, we want to match a subject of the form (c1)n(c2)n(c3)n...(ck)n for any list of characters c1 to ck, where ci != ci+1 for all i, k > 1, and n > 0.

Clarifications:

  • Input will not be empty.

  • A character may repeat itself later in the string (eg. "banana")

  • k > 1, so there will always be at least 2 different characters in the string.

  • You can assume only ASCII characters will be passed as input and no character will be a line terminator.

Rules

(Thank you to Martin Ender for this excellently-stated block of rules)

Your answer should consist of a single regex, without any additional code (except, optionally, a list of regex modifiers required to make your solution work). You must not use features of your language's regex flavour that allow you to invoke code in the hosting language (e.g. Perl's e modifier).

You can use any regex flavour which existed before this challenge, but please specify the flavour.

Do not assume that the regex is anchored implicitly, e.g. if you're using Python, assume that your regex is used with re.search and not with re.match. Your regex must match the entire string for valid egalitarian strings and yield no matches for invalid strings. You may use as many capturing groups as you wish.

You may assume that the input will always be a string of two or more ASCII characters not containing any line-terminators.

This is regex golf, so the shortest regex in bytes wins. If your language requires delimiters (usually /.../) to denote regular expressions, don't count the delimiters themselves. If your solution requires modifiers, add one byte per modifier.

Criteria

This is good ol' fashioned golf, so forget efficiency and just try to get your regex as small as possible.

Please mention which regex flavour you have used and, if possible, include a link showing an online demo of your expression in action.

jaytea

Posted 2017-11-19T07:59:26.750

Reputation: 467

Is this specifically a regex golf? You should probably clarify that, along with the rules for it. Most challenges on this site are golfs of assorted programming languages. – LyricLy – 2017-11-19T09:14:40.720

@LyricLy Thanks for the advice! Yes, I would like it to be purely regex, ie. a single regular expression in a regex flavour of the submitter's choice. Are there any other rules I should be mindful of including? – jaytea – 2017-11-19T10:48:57.493

I don't understand your definition of "egalitarian", such that banana is egalitarian. – msh210 – 2017-11-20T11:31:14.173

@msh210 When I came up with the term "egalitarian" to describe the series, I didn't consider that I would allow characters to be repeated later in the series (such as in "banana", or "aaabbbcccaaa", etc.). I just wanted a term to represent the idea that every chunk of repeated characters is the same size. Since "banana" has no repeated characters, this definition holds true for it. – jaytea – 2017-11-20T12:44:52.227

Answers

11

.NET flavour, 48 bytes

^(.)\1*((?<=(\5())*(.))(.)(?<-4>\6)*(?!\4|\6))+$

Try it online! (using Retina)

Well, turns out that not negating the logic is simpler after all. I'm making this a separate answer, because the two approaches are completely different.

Explanation

^            # Anchor the match to the beginning of the string.
(.)\1*       # Match the first run of identical characters. In principle, 
             # it's possible that this matches only half, a quarter, an 
             # eighth etc of of the first run, but that won't affect the 
             # result of the match (in other words, if the match fails with 
             # matching this as the entire first run, then backtracking into
             # only matching half of it won't cause the rest of the regex to
             # match either).
(            # Match this part one or more times. Each instance matches one
             # run of identical letters.
  (?<=       #   We start with a lookbehind to record the length
             #   of the preceding run. Remember that the lookbehind
             #   should be read from the bottom up (and so should
             #   my comments).
    (\5())*  #     And then we match all of its adjacent copies, pushing an
             #     empty capture onto stack 4 each time. That means at the
             #     end of the lookbehind, we will have n-1 captures stack 4, 
             #     where n is the length of the preceding run. Due to the 
             #     atomic nature of lookbehinds, we don't have to worry 
             #     about backtracking matching less than n-1 copies here.
    (.)      #     We capture the character that makes up the preceding
             #     run in group 5.
  )
  (.)        #   Capture the character that makes up the next run in group 6.
  (?<-4>\6)* #   Match copies of that character while depleting stack 4.
             #   If the runs are the same length that means we need to be
             #   able to get to the end of the run at the same time we
             #   empty stack 4 completely.
  (?!\4|\6)  #   This lookahead ensures that. If stack 4 is not empty yet,
             #   \4 will match, because the captures are all empty, so the
             #   the backreference can't fail. If the stack is empty though,
             #   then the backreference will always fail. Similarly, if we
             #   are not at the end of the run yet, then \6 will match 
             #   another copy of the run. So we ensure that neither \4 nor
             #   \6 are possible at this position to assert that this run
             #   has the same length das the previous one.
)+
$            # Finally, we make sure that we can cover the entire string
             # by going through runs of identical lengths like this.

Martin Ender

Posted 2017-11-19T07:59:26.750

Reputation: 184 808

I love that you seesawed between the two methods! I also thought the negative approach ought to have been shorter until I actually attempted it and found it a lot more awkward (even though it feels like it should be simpler). I have 48b in PCRE, and 49b in Perl with a completely different method, and with your 3rd method in .NET around the same size, I'd say this is quite a cool regex challenge :D – jaytea – 2017-11-19T13:06:43.240

@jaytea I'd love to see those. If no one comes up with anything for a week or so, I hope you'll post them yourself. :) And yeah agreed, it's nice that the approaches are so close in byte count. – Martin Ender – 2017-11-19T15:02:30.647

I just might! Also, Perl one has been golfed down to 46b ;) – jaytea – 2017-11-19T16:44:22.127

So I figured you might want to see these now! Here's 48b in PCRE: ((^.|\2(?=.*\4\3)|\4(?!\3))(?=\2*+((.)\3?)))+\3$ I was experimenting with \3* in place of (?!\3) to make it 45b but that fails on "aabbbc" :( The Perl version is easier to understand, and it's down to 45b now: ^((?=(.)\2*(.))(?=(\2(?4)?\3)(?!\3))\2+)+\3+$ - the reason I call it Perl even though it appears to be valid PCRE is that PCRE thinks (\2(?4)?\3) could recurse indefinitely whereas Perl is a little smarter/forgiving! – jaytea – 2017-11-27T11:58:23.727

@jaytea Ah, those are really neat solutions. You should really post them in a separate answer. :) – Martin Ender – 2017-11-27T12:11:07.890

Thanks :) I may do when I have time to write out explanations – jaytea – 2017-12-02T08:47:52.700

9

.NET flavour, 54 bytes

^(?!.*(?<=(\2)*(.))(?!\2)(?>(.)(?<-1>\3)*)(?(1)|\3)).+

Try it online! (using Retina)

I'm pretty sure this is suboptimal, but it's the best I'm coming up with for balancing groups right now. I've got one alternative at the same byte count, which is mostly the same:

^(?!.*(?<=(\3())*(.))(?!\3)(?>(.)(?<-2>\4)*)(\2|\4)).+

Explanation

The main idea is to invert the problem, match non-egalitarian strings and put the whole thing in a negative lookahead to negate the result. The benefit is that we don't have to keep track of n throughout the entire string (because due to the nature of balancing groups, you usually consume n when checking it), to check that all runs are of equal length. Instead, we just look for a single pair of adjacent runs that don't have the same length. That way, I only need to use n once.

Here is a breakdown of the regex.

^(?!.*         # This negative lookahead means that we will match
               # all strings where the pattern inside the lookahead
               # would fail if it were used as a regex on its own.
               # Due to the .* that inner regex can match from any
               # position inside the string. The particular position
               # we're looking for is between two runs (and this
               # will be ensured later).

  (?<=         #   We start with a lookbehind to record the length
               #   of the preceding run. Remember that the lookbehind
               #   should be read from the bottom up (and so should
               #   my comments).
    (\2)*      #     And then we match all of its adjacent copies, capturing
               #     them separately in group 1. That means at the
               #     end of the lookbehind, we will have n-1 captures
               #     on stack 1, where n is the length of the preceding
               #     run. Due to the atomic nature of lookbehinds, we
               #     don't have to worry about backtracking matching
               #     less than n-1 copies here.
    (.)        #     We capture the character that makes up the preceding
               #     run in group 2.
  )
  (?!\2)       #   Make sure the next character isn't the same as the one
               #   we used for the preceding run. This ensures we're at a
               #   boundary between runs.
  (?>          #   Match the next stuff with an atomic group to avoid
               #   backtracking.
    (.)        #     Capture the character that makes up the next run
               #     in group 3.
    (?<-1>\3)* #     Match as many of these characters as possible while
               #     depleting the captures on stack 1.
  )
               #   Due to the atomic group, there are three two possible
               #   situations that cause the previous quantifier to stopp
               #   matching. 
               #   Either the run has ended, or stack 1 has been depleted.
               #   If both of those are true, the runs are the same length,
               #   and we don't actually want a match here. But if the runs
               #   are of different lengths than either the run ended but
               #   the stack isn't empty yet, or the stack was depleted but
               #   the run hasn't ended yet.
  (?(1)|\3)    #   This conditional matches these last two cases. If there's
               #   still a capture on stack 1, we don't match anything,
               #   because we know this run was shorter than the previous
               #   one. But if stack 1, we want to match another copy of 
               #   the character in this run to ensure that this run is 
               #   longer than the previous one.
)
.+             # Finally we just match the entire string to comply with the
               # challenge spec.

Martin Ender

Posted 2017-11-19T07:59:26.750

Reputation: 184 808

I tried to make it fail on: banana, aba, bbbaaannnaaannnaaa, bbbaaannnaaannnaaaaaa, The Nineteenth Byte, 11, 110, ^(?!.*(?<=(\2)*(.))(?!\2)(?>(.)(?<-1>\3)*)(?(1)|\3)).+, bababa. It's me who failed. :( +1 – Erik the Outgolfer – 2017-11-19T11:12:45.093

1That moment when you finish your explanation and then figure out that you can save 1 byte by using the exact opposite approach... I guess I'll make another answer in a bit... :| – Martin Ender – 2017-11-19T11:34:01.790

@MartinEnder ... And then realise you could golf this one by 2 bytes haha :P – Mr. Xcoder – 2017-11-19T11:36:06.157

@Mr.Xcoder Would have to be 7 bytes now, so I hope I'm safe. ;) – Martin Ender – 2017-11-19T12:02:51.323