Similar to our threads for language-specific golfing tips: what are general tricks to shorten regular expressions?

I can see three uses of regex when it comes to golfing: classic regex golf ("here is a list that should match, and here is a list that should fail"), using regex to solve computational problems and regular expressions used as parts of larger golfed code. Feel free to post tips addressing any or all of these. If your tip is limited to one or more flavours, please state these flavours at the top.

As usual, please stick to one tip (or family of very closely related tips) per answer, so that the most useful tips can rise to the top via voting.

Martin Ender

Posted 2015-03-05T11:06:40.263

Reputation: 184 808

Flagrant self-promotion: what category of regex-use does this fall into? http://codegolf.stackexchange.com/a/37685/8048

– Kyle Strand – 2015-03-08T08:53:06.547

@KyleStrand "regular expressions used as parts of larger golfed code." – Martin Ender – 2015-03-08T12:42:37.647

Answers

When not to escape

These rules apply to most flavours, if not all:

] doesn't need escaping when unmatched.
{ and } don't need escaping when they are not part of a repetition, e.g. {a} matches {a} literally. Even if you want to match something like {2}, you only need to escape one of them, e.g. {2\}.

In character classes:

] doesn't need escaping when it's the first character in a character set, e.g. []abc] matches one of ]abc, or when it's the second character after a ^, e.g. [^]] matches anything but ]. (Notable exception: ECMAScript flavour!)
[ doesn't need escaping at all. Together with the above tip, this means you can match both brackets with the horribly counter-intuitive character class [][].
^ doesn't need escaping when it's not the first character in a character set, e.g. [ab^c].
- doesn't need escaping when it's either the first (second after a ^) or last character in a character set, e.g. [-abc], [^-abc] or [abc-].
No other characters need escaping inside a character class, even if they are meta characters outside of character classes (except for backslash \ itself).

Also, in some flavours ^ and $ are matched literally when they are not at the start or end of the regex respectively.

(Thanks to @MartinBüttner for filling in a few details)

Sp3000

Posted 2015-03-05T11:06:40.263

Reputation: 58 729

Some prefer escaping the actual dot by enclosing it in a character class where it doesn't need escaping (eg. [.]). Escaping it normally would save 1 byte in this case \. – CSᵠ – 2015-03-06T23:14:25.720

Note that [ must be escaped in Java. Not sure about ICU (used in Android and iOS) or .NET, though. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2015-03-09T06:14:57.890

A simple regular expression to match all printable characters in the ASCII table.

[ -~]

hwnd

Posted 2015-03-05T11:06:40.263

Reputation: 433

1pure awesomeness, all the chars from a standard US keyboard! note: the standard ascii table (not including the extended range 127-255 – CSᵠ – 2015-03-07T08:07:37.997

I use it often, but it is missing a common "regular" character: TAB. And it assumes you are using LC_ALL="C" (or similar) as some other locales will fail. – Olivier Dulac – 2016-12-28T18:27:06.013

Can the hyphen be used like that to specify any range of characters in the ASCII table? Does that work for all flavours of regex? – Josh Withee – 2017-12-14T16:03:14.703

Know your regex flavours

There are a surprising amount of people who think that regular expressions are essentially language agnostic. However, there are actually quite substantial differences between flavours, and especially for code golf it's good to know a few of them, and their interesting features, so you can pick the best for each task. Here is an overview over several important flavours and what sets them apart from others. (This list can't really be complete, but let me know if I missed something really glaring.)

Perl and PCRE

I'm throwing these into a single pot, as I'm not too familiar with the Perl flavour and they're mostly equivalent (PCRE is for Perl-Compatible Regular Expressions after all). The main advantage of the Perl flavour is that you can actually call Perl code from inside the regex and substitution.

Recursion/subroutines. Probably the most important feature for golfing (which only exists in a couple of flavours).
Conditional patterns (?(group)yes|no).
Supports change of case in the replacement string with \l, \u, \L and \U.
PCRE allows alternation in lookbehinds, where each alternative can have a different (but fixed) length. (Most flavours, including Perl require lookbehinds to have an overall fixed length.)
\G to anchor a match to the end of the previous match.
\K to reset the beginning of the match
PCRE supports both Unicode character properties and scripts.
\Q...\E to escape longer runs of characters. Useful when you're trying to match a string that contains many meta-characters.

.NET

This is probably the most powerful flavour, with only very few shortcomings.

Supports arbitrary-length lookbehinds, which are matched right-to-left.
Has the unique concept of balancing groups - originally designed to match strings with balanced parentheses, they can be used to perform arithmetic and match 2D patterns.
Also supports conditional patterns.
Has concise syntax for character class difference: [\w-[aeiou]]
The normal character classes like \d are Unicode aware.
Supports Unicode categories and blocks.

One important shortcoming in terms of golfing is that it doesn't support possessive quantifiers like some other flavours. Instead of .?+ you'll have to write (?>.?).

Java

Due to a bug (see Appendix) Java supports a limited type of variable-length lookbehind: you can lookbehind all the way to the beginning of the string with .* from where you can now start a lookahead, like (?<=(?=lookahead).*).
Supports union and intersection of character classes.
Has the most extensive support for Unicode, with character classes for "Unicode scripts, blocks, categories and binary properties".
\Q...\E as in Perl/PCRE.

Ruby

In recent versions, this flavour is similarly powerful as PCRE, including the support for subroutine calls. Like Java, it also supports union and intersection of character classes. One special feature is the built-in character class for hex digits: \h (and the negated \H).

The most useful feature for golfing though is how Ruby handles quantifiers. Most notably, it's possible to nest quantifiers without parentheses. .{5,7}+ works and so does .{3}?. Also, as opposed to most other flavours, if the lower bound on a quantifier is 0 it can be omitted, e.g. .{,5} is equivalent to .{0,5}.

As for subroutines, the major difference between PCRE's subroutines and Ruby's subroutines, is that Ruby's syntax is a byte longer (?n) vs \g<n>, but Ruby's subroutines can be used for capturing, whereas PCRE resets captures after a subroutine finishes.

Finally, Ruby has different semantics for line-related modifiers than most other flavours. The modifier that's usually called m in other flavours is always on in Ruby. So ^ and $ always match the beginning and end of a line not just the beginning and end of the string. This can save you a byte if you need this behaviour, but it will cost you extra bytes if you don't, because you'll have to replace ^ and $ with \A and \z, respectively. In addition to that, the modifier that is usually called s (which makes . match linefeeds) is called m in Ruby instead. This doesn't affect byte counts, but should be kept in mind to avoid confusion.

Python

Python has a solid flavour, but I'm not aware of any particularly useful features that you wouldn't find anywhere else.

However, there is an alternative flavour which is intended to replace the re module at some point, and which contains a lot of interesting features. In addition to adding support for recursion, variable-length lookbehinds and character class combination operators, it also has the unique feature of fuzzy matching. In essence you can specify a number of errors (insertions, deletions, substitutions) which are allowed, and the engine will also give you approximate matches.

ECMAScript

The ECMAScript flavour is very limited, and hence rarely very useful for golfing. The only thing it's got going for it is the negated empty character class [^] to match any character as well as the unconditionally failing empty character class [] (as opposed to the usual (?!)). Unfortunately, the flavour does not have any features which makes the latter useful for normal problems.

Lua

Lua has its own fairly unique flavour, which is quite limited (e.g. you can't even quantify groups) but does come with a handful of useful and interesting features.

It's got a large number of shorthands for built-in character classes, including punctuation, upper/lower case characters and hex digits.
With %b it supports a very compact syntax to match balanced strings. E.g. %b() matches a ( and then everything up to a matching ) (correctly skipping inner matched pairs). ( and ) can be any two characters here.

Boost

Boost's regex flavour is essentially Perl's. However, it has some nice new features for regex substitution, including case changes and conditionals. The latter is unique to Boost as far as I'm aware.

Martin Ender

Posted 2015-03-05T11:06:40.263

Reputation: 184 808

Note that look-ahead in look-behind will punch through the bound limit in the look-behind. Tested in Java and PCRE. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2015-03-09T05:57:12.397

Isn't .?+ equivalent to .*? – CalculatorFeline – 2017-02-23T22:04:58.580

@CalculatorFeline The former is a possessive 0-or-1 quantifier (in flavours that support possessive quantifiers), the latter is a 0-or-more quantifier. – Martin Ender – 2017-02-23T22:06:53.363

@CalculatorFeline ah I understand the confusion. There was a typo. – Martin Ender – 2017-02-23T22:07:56.823

Know your character classes

Most regex flavours have predefined character classes. For example, \d matches a decimal digit, which is three bytes shorter than [0-9]. Yes, they might be slightly different as \d may also match Unicode digits as well in some flavours, but for most challenges this won't make a difference.

Here are some character classes found in most regex flavours:

\d      Match a decimal digit character
\s      Match a whitespace character
\w      Match a word character (typically [a-zA-Z0-9_])

In addition, we also have:

\D \S \W

which are negated versions of the above.

Be sure to check your flavour for any additional escape codes it might have. For example, PCRE has \R for newlines and Lua even has classes such as lowercase and uppercase characters.

(Thanks to @HamZa and @MartinBüttner for pointing these out)

Sp3000

Posted 2015-03-05T11:06:40.263

Reputation: 58 729

3\R for newlines in PCRE. – HamZa – 2015-03-05T11:56:15.873

Don't bother with non-capturing groups (unless...)

This tip applies to (at least) all popular Perl-inspired flavours.

This may be obvious, but (when not golfing) it's good practice to use non-capturing groups (?:...) whenever possible. These two extra characters ?: are wasteful when golfing though, so just use capturing groups, even if you're not going to backreference them.

There's one (rare) exception though: if you happen to backreference group 10 at least 3 times, you can actually save bytes by turning an earlier group into a non-capturing group, such that all those \10s become \9s. (Similar tricks apply, if you use group 11 at least 5 times and so on.)

Martin Ender

Posted 2015-03-05T11:06:40.263

Reputation: 184 808

Why does 11 need 5 times to be worth it when 10 requires 3? – Fund Monica's Lawsuit – 2016-03-11T02:58:40.400

1@QPaysTaxes being able to use $9 instead of $10 or $11 once saves one byte. Turning $10 into $9 requires one ?:, which is two bytes, so you'll need three $10s to save something. Turning $11 into $9 requires two ?:s which is four bytes, so you'll need five $11s to save something (or five of $10 and $11 combined). – Martin Ender – 2016-03-11T08:06:08.163

Recursion for pattern reuse

A handful of flavours support recursion (to my knowledge, Perl, PCRE and Ruby). Even when you're not trying to solve recursive problems, this feature can save a lot of bytes in more complicated patterns. There is no need to make the call to another (named or numbered) group inside that group itself. If you have a certain pattern that appears several times in your regex, just group it and refer to it outside that group. This is no different from a subroutine call in normal programming languages. So instead of

...someComplexPatternHere...someComplexPatternHere...someComplexPatternHere...

in Perl/PCRE you could do:

...(someComplexPatternHere)...(?1)...(?1)...

or in Ruby:

...(someComplexPatternHere)...\g<1>...\g<1>...

provided that is the first group (of course, you can use any number in the recursive call).

Note that this is not the same as a backreference (\1). Backreferences match the exact same string that the group matched last time. These subroutine calls actually evaluate the pattern again. As an example for someComplexPatternHere take a lengthy character class:

a[0_B!$]b[0_B!$]c[0_B!$]d

This would match something like

aBb0c!d

Note that you cannot use backreferences here while preserving the behaviour. A backreference would fail on the above string, because B and 0 and ! are not the same. However, with subroutine calls, the pattern is actually reevaluated. The above pattern is completely equivalent to

a([0_B!$])b(?1)c(?1)d

Capturing in subroutine calls

One note of caution for Perl and PCRE: if group 1 in the above examples contains further groups, then the subroutine calls will not remember their captures. Consider this example:

(\w(\d):)\2 (?1)\2 (?1)\2

This will not match

x1:1 y2:2 z3:3

because after the subroutine calls return, the new capture of group 2 is discarded. Instead, this pattern would match this string:

x1:1 y2:1 z3:1

This is different from Ruby, where subroutine calls do retain their captures, so the equivalent Ruby regex (\w(\d):)\2 \g<1>\2 \g<1>\2 would match the first of the examples above.

Martin Ender

Posted 2015-03-05T11:06:40.263

Reputation: 184 808

You can use \1 for Javascript. And PHP too (I guess). – Ismael Miguel – 2015-03-05T23:13:29.360

5@IsmaelMiguel This is not a backreference. This actually evaluates the pattern again. For instance (..)\1 would match abab but fail on abba whereas (..)(?1) will match the latter. It's actually a subroutine call in the sense that the expression is applied again, instead of literally matching what it matched last time. – Martin Ender – 2015-03-05T23:16:03.280

Wow, I had no idea! Learning something new everyday – Ismael Miguel – 2015-03-05T23:22:23.420

In .NET (or other flavors without this feature): (?=a.b.c)(.[0_B!$]){3}d – jimmy23013 – 2015-03-13T08:45:46.897

@user23013 that seems very specific to this particular example. I'm not sure that's applicable if I reuse a certain subpattern in various lookarounds. – Martin Ender – 2015-03-13T09:19:19.907

Causing a match to fail

When using regex to solve computational problems or match highly non-regular languages, it is sometimes necessary to make a branch of the pattern fail regardless of where you are in the string. The naive approach is to use an empty negative lookahead:

(?!)

The contents (the empty pattern) always matches, so the negative lookahead always fails. But more often than not, there is a much simpler option: just use a character you know will never appear in the input. E.g. if you know your input will always consist only of digits, you can simply use

or any other non-digit, non-meta character to cause failure.

Even if your input could potentially contain any substrings whatsoever, there are shorter ways than (?!). Any flavour which allows anchors to appear within a pattern as opposed to the end, could use either of the following 2-character solutions:

a^
$a

Note however that some flavours will treat ^ and $ as literal characters in these positions, because they obviously don't actually make sense as anchors.

In the ECMAScript flavour there is also the rather elegant 2-character solution

[]

This is an empty character class, which tries to make sure that the next characters is one of those in the class - but there are no characters in the class, so this always fails. Note that this won't work in any other flavour, because character classes can't usually be empty.

Martin Ender

Posted 2015-03-05T11:06:40.263

Reputation: 184 808

Optimize you OR's

Whenever you have 3 or more alternatives in your RegEx:

/aliceblue|antiquewhite|aquamarine|azure/

Check to see if there's a common start:

/a(liceblue|ntiquewhite|quamarine|zure)/

And maybe even a common ending?

/a(liceblu|ntiquewhit|quamarin|zur)e/

Note: 3 is just the start and would account for the same length, 4+ would make a difference

But what if not all of them have a common prefix? _{(whitespace only added for clarity)}

/aliceblue|antiquewhite|aqua|aquamarine|azure
|beige|bisque|black|blanchedalmond|blue|blueviolet|brown|burlywood
|cadetblue|chartreuse|chocolate|coral|cornflowerblue|cornsilk|crimson|cyan/

Group them, as long as the 3+ rule makes sense:

/a(liceblue|ntiquewhite|qua|quamarine|zure)
|b(eige|isque|lack|lanchedalmond|lue|lueviolet|rown|urlywood)
|c(adetblue|hartreuse|hocolate|oral|ornflowerblue|ornsilk|rimson|yan)/

Or even generalise if the entropy satisfies your usecase:

/\w(liceblue|ntiquewhite|qua|quamarine|zure
|eige|isque|lack|lanchedalmond|lue|lueviolet|rown|urlywood
|adetblue|hartreuse|hocolate|oral|ornflowerblue|ornsilk|rimson|yan)/

^{^ in this case we're sure we don't get any clue or crown slack Ryan}

This "according to some tests" also improves performance, as it provides an anchor to start at.

CSᵠ

Posted 2015-03-05T11:06:40.263

Reputation: 484

1If the common start or end is longer than one character, even grouping two can make a difference. Like aqua|aquamarine → aqua(|marine) or aqua(marine)?. – Paŭlo Ebermann – 2015-03-08T11:16:40.040

This one is fairly simple, but worth stating:

If you find yourself repeating the character class [a-zA-Z] you can probably just use [a-z] and append the i (case-insensitive modifier) to your regex.

For example, in Ruby, the following two regexes are equivalent:

/[a-zA-Z]+\d{3}[a-zA-Z]+/
/[a-z]+\d{3}[a-z]/i - 7 bytes shorter

For that matter, the other modifiers can shorten your total length as well. Instead of doing this:

/(.|\n)/

which matches ANY character (because dot doesn't match newline), use the single-line modifier s, which makes dot match newlines.

/./s - 3 bytes shorter

In Ruby, there are a ton of built-in Character Classes for regex. See this page and search for "Character Properties".
A great example is the "Currency Symbol". According to Wikipedia there are a ton of possible currency symbols, and to put them in a character class would be very expensive ([$฿¢₡Ð₫€.....]) whereas you can match any of them in 6 bytes: \p{Sc}

Devon Parsons

Posted 2015-03-05T11:06:40.263

Reputation: 173

1Excepting JavaScript, where s modifier is not supported. :( But there you can use JavaScript's proprietary /[^]/ trick. – manatwork – 2015-03-05T15:08:55.853

Note that (.|\n) doesn't even work in some flavours, because . often also doesn't match other types of line separators. However, the customary way to do this (without s) is [\s\S] which is the same bytes as (.|\n). – Martin Ender – 2015-03-05T23:55:33.023

@MartinBüttner, my idea was to keep it together with the other line ending related tips. But if you feel this answer is more about modifiers, I have no objections if you repost it. – manatwork – 2015-03-06T08:40:58.087

@manatwork done (and added a related non-ES specific trick as well) – Martin Ender – 2015-03-06T09:07:59.147

A simple language parser

You can build a very simple parser with an RE like \d+|\w+|".*?"|\n|\S. The tokens you need to match are separated with the RE 'or' character.

Each time the RE engine tries to match at the current position in the text, it will try the first pattern, then the second, etc. If it fails (on a space character here for example), it moves on and tries the matches again. Order is important. If we placed the \S term before the \d+ term, the \S would match first on any non-space character which would break our parser.

The ".*?" string matcher uses a non-greedy modifier so we only match one string at a time. If your RE doesn't have non-greedy functions, you can use "[^"]*" which is equivalent.

Python Example:

text = 'd="dogfinder"\nx=sum(ord(c)*872 for c in "fish"+d[3:])'
pat = r'\d+|\w+|".*?"|\n|\S'
print re.findall(pat, text)

['d', '=', '"dogfinder"', '\n', 'x', '=', 'sum', '(', 'ord', '(', 'c', ')',
    '*', '872', 'for', 'c', 'in', '"fish"', '+', 'd', '[', '3', ':', ']', ')']

Golfed Python Example:

# assume we have language text in A, and a token processing function P
map(P,findall(r'\d+|\w+|".*?"|\n|\S',A))

You can adjust the patterns and their order for the language you need to match. This technique works well for JSON, basic HTML, and numeric expressions. It has been used successfully many times with Python 2, but should be general enough to work in other environments.

Logic Knight

Posted 2015-03-05T11:06:40.263

Reputation: 6 622

`\K` instead of positive lookbehind

PCRE and Perl support the escape sequence \K, which resets the beginning of the match. That is ab\Kcd will require your input string to contain abcd but the reported match will only be cd.

If you are using a positive lookbehind at the start of your pattern (which is probably the most likely place), then in most cases, you can use \K instead and save 3 bytes:

(?<=abc)def
abc\Kdef

This is equivalent for most purposes, but not entirely. The differences bring both advantages and disadvantages with them:

Upside: PCRE and Perl don't support arbitrary-length lookbehinds (only .NET does). That is, you can't do something like (?<=ab*). But with \K you can put any sort of pattern in front of it! So ab*\K works. This actually makes this technique vastly more powerful in the cases where it's applicable.
Upside: Lookarounds don't backtrack. This is relevant if you want to capture something in the lookbehind to backreference later, but there are several possible captures which all lead to valid matches. In this case, the regex engine would only ever try one of those possibilities. When using \K that part of the regex is being backtracked like everything else.
Downside: As you probably know, several matches of a regex cannot overlap. Often, lookarounds are used to partially work around this limitation, since the lookahead can validate a portion of the string that was already consumed by an earlier match. So if you wanted to match all the characters that followed ab you might use (?<=ab).. Given the input
```
ababc
```
this would match the second a and the c. This cannot be reproduced with \K. If you used ab\K., you would only get the first match, because now the ab is not in a lookaround.

Martin Ender

Posted 2015-03-05T11:06:40.263

Reputation: 184 808

If a pattern uses the \K escape sequence within a positive assertion, the reported start of a successful match can be greater than the end of the match. – hwnd – 2015-03-07T04:03:58.233

@hwnd My point is that given ababc, there is no way to match both the second a and the c with \K. You'll only get one match. – Martin Ender – 2015-03-07T04:07:52.570

You're correct, not with the feature itself. You would have to anchor with \G – hwnd – 2015-03-07T04:08:57.243

@hwnd Ah I see your point now. But I guess at that point (from a golfing perspective) you're better off with a negative lookbehind, becaue you actually might even need it anyway since you can't be sure that the . from the last match was actually an a. – Martin Ender – 2015-03-07T04:12:09.890

Interesting use of \K =)

– hwnd – 2015-03-07T15:57:48.810

Matching any character

The ECMAScript flavour is lacking the s modifiers which makes . match any character (including newlines). This means there is no single-character solution to matching completely arbitrary characters. The standard solution in other flavours (when one doesn't want to use s for some reason) is [\s\S]. However, ECMAScript is the only flavour (to my knowledge) which supports empty character classes, and hence has a much shorter alternative: [^]. This is a negated empty character class - that is, it matches any character whatsoever.

Even for other flavours, we can learn from this technique: if we don't want to use s (e.g. because we still need to usual meaning of . in other places), there can still be a shorter way to match both newline and printable characters, provided there is some character we know doesn't appear in the input. Say, we're processing numbers delimited by newlines. Then we can match any character with [^!], since we know that ! won't ever be part of the string. This saves two bytes over the naive [\s\S] or [\d\n].

Martin Ender

Posted 2015-03-05T11:06:40.263

Reputation: 184 808

4In Perl, \N means exactly what . means outside of /s mode, except it isn't affected by a mode. – Konrad Borowski – 2015-03-06T17:48:12.590

Use atomic groups and possessive quantifiers

I found atomic groups ((?>...)) and possessive quantifiers (?+, *+, ++, {m,n}+) sometimes very useful for golfing. It matches a string and disallows backtracking later. So it will only match the first matchable string which is found by the regex engine.

For example: To match a string with odd number of a's at the beginning, which is not followed by more a's, you can use:

^(aa)*+a
^(?>(aa)*)a

This allows you to use things like .* freely, and if there is an obvious match, there won't be another possibility matching too many or too few characters, which may break your pattern.

In .NET regex (which doesn't have possessive quantifiers), you can use this to pop group 1 the greatest multiple of 3 (with maximum 30) times (not golfed very well):

(?>((?<-1>){3}|){10})

jimmy23013

Posted 2015-03-05T11:06:40.263

Reputation: 34 042

1ECMAscript is also missing possessive quantifiers or atomic groups :( – CSᵠ – 2015-03-06T23:24:35.320

Forget a captured group after a subexpression (PCRE)

For this regex:

^((a)(?=\2))(?!\2)

If you want to clear the \2 after group 1, you can use recursion:

^((a)(?=\2)){0}(?1)(?!\2)

It will match aa while the previous one won't. Sometimes you can also use ?? or even ? in place of {0}.

This might be useful if you used recursions a lot, and some of the backreferences or conditional groups appeared in different places in your regex.

Also note that atomic groups are assumed for recursions in PCRE. So this won't match a single letter a:

^(a?){0}(?1)a

I didn't try it in other flavors yet.

For lookaheads, you can also use double negatives for this purpose:

^(?!(?!(a)(?=\1))).(?!\1)

jimmy23013

Posted 2015-03-05T11:06:40.263

Reputation: 34 042

Optional expressions

It is sometimes useful to remember that

(abc)?

is mostly the same as

(abc|)

There is a small difference though: in the first case, the group either captures abc or doesn't capture at all. The latter case would make a backreference fail unconditionally. In the second expression, the group will either capture abc or an empty string, where the latter case would make a backreference match unconditionally. To emulate the latter behaviour with ? you'd need to surround everything in another group which would cost two bytes:

((abc)?)

The version using | is also useful when you want to wrap the expression in some other form of group anyway and don't care about the capturing:

(?=(abc)?)
(?=abc|)

(?>(abc)?)
(?>abc|)

Finally, this trick can also be applied to ungreedy ? where it saves a byte even in its raw form (and consequently 3 bytes when combined with other forms of groups):

(abc)??
(|abc)

Martin Ender

Posted 2015-03-05T11:06:40.263

Reputation: 184 808

Capturing groups hold the last value matched

(REGEX)* will hold in capturing group 1 the last match of REGEX. (REGEX) can be combined with any such repeaters.

For getting the last character of a string, there is the straightforward 5-byter

.*(.)

which captures the last character in capturing group 1. Another byte can be saved by noting the point in the title of this post, giving the 4-byter

(.)*

Another example, getting the penultimate character

.*(.).        # 6 bytes
(.)*.         # 5 bytes

user41805

Posted 2015-03-05T11:06:40.263

Reputation: 16 320

Multiple lookaheads that always match (.NET)

If you have 3 or more lookahead constructs that always match (to capture subexpressions), or there is a quantifier on a lookahead followed by something else, so they should be in a not necessarily captured group:

(?=a)(?=b)(?=c)
((?=a)b){...}

These are shorter:

(?(?(?(a)b)c))
(?(a)b){...}

where a should not be the name of a captured group. You can't use | to mean the usual thing in b and c without adding another pair of parentheses.

Unfortunately, balancing groups in the conditionals seemed buggy, making it useless in many cases.

jimmy23013

Posted 2015-03-05T11:06:40.263

Reputation: 34 042

Tips for Regex Golf

Answers

When not to escape

Know your regex flavours

Perl and PCRE

.NET

Java

Ruby

Python

ECMAScript

Lua

Boost

Know your character classes

Don't bother with non-capturing groups (unless...)

Recursion for pattern reuse

Capturing in subroutine calls

Causing a match to fail

Optimize you OR's

A simple language parser

Python Example:

Golfed Python Example:

\K instead of positive lookbehind

Matching any character

Use atomic groups and possessive quantifiers

Forget a captured group after a subexpression (PCRE)

Optional expressions

Capturing groups hold the last value matched

Multiple lookaheads that always match (.NET)

`\K` instead of positive lookbehind