Match strings whose length is a fourth power

28

4

Within the scope of this question, let us consider only strings which consist of the character x repeated arbitrary number of times.

For example:

<empty>
x
xx
xxxxxxxxxxxxxxxx

(Well, actually it doesn't have to be x - any character is fine as long as the whole string only has 1 type of character)

Write a regex in any regex flavor of your choice to match all strings whose length is n4 for some non-negative integer n (n >= 0). For example, strings of length 0, 1, 16, 81, etc. are valid; the rest are invalid.

Due to the technical limitation, values of n bigger than 128 is hard to test against. However, your regex should logically work correctly regardless.

Note that you are not allowed to execute arbitrary code in your regex (to Perl users). Any other syntax (look-around, back-reference, etc.) is allowed.

Please also include a short explanation about your approach to the problem.

(Please don't paste auto generated regex syntax explanation, since they are useless)

n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳

Posted 2014-01-23T20:21:47.400

Reputation: 5 683

"xx" isn't valid, is it? – Kendall Frey – 2014-01-23T20:32:43.957

@KendallFrey: Nope. It is not valid. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-01-23T20:34:23.680

@nhahtdh do you think there is a possible answer to that? – xem – 2014-01-23T20:49:28.677

@xem: I have found and verify the answer up to 2^16. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-01-23T20:51:03.523

you mean lengths up to 2^16 power 4? – xem – 2014-01-23T20:56:59.650

@xem: I mean I tested against string length 0 to 2^16. Which means n was from 0 to 16. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-01-23T21:00:11.353

I'm pretty sure this is impossible using "true" regular expressions; it's easy to prove (e.g. by the Myhill-Nerode theorem) that the language of strings whose length is a fourth power is not regular. You're going to need some kind of "extended" regexps to solve this.

– Ilmari Karonen – 2014-01-23T21:29:37.007

@IlmariKaronen: This question does not limit the syntax to theoretical regular expression. Just that it doesn't allow arbitrary code execution as allowed in Perl. Other stuffs such as backref, etc. are allowed – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-01-23T21:32:34.860

Am I right in assuming that you know a solution to this in at least one regex flavour? – Timwi – 2014-01-23T23:17:07.850

@Timwi Yup, he knows. I've found the solution for n^2, now I need to work it out for n^4 :) – HamZa – 2014-01-23T23:28:15.673

1@Timwi: Yes. Java, PCRE (probably Perl also, but can't test), .NET. Mine doesn't work in Ruby/JS, though. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-01-23T23:42:55.137

In your post you say that "strings of length 0, 1, 16, 81, etc. are valid". Even if I assume you meant 64=4^3 instead of 81-3^4 I still have to ask, why is the empty string allowed? – Kaya – 2014-02-24T16:53:10.633

@Kaya: I think you misunderstood my question. I am asking for length n to the 4th power (n^4). I even wrote that in the question. Empty string is allowed, since 0^4 = 0. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-02-24T17:44:27.107

For sure--thanks for the clarification. This is indeed what occurred. – Kaya – 2014-02-24T19:09:41.717

1

This question has been added to the Stack Overflow Regular Expression FAQ, under "Advanced Regex-Fu".

– aliteralmind – 2014-04-10T01:33:37.327

Answers

21

This (ir)regular expression seems to work.

^((?(1)((?(2)\2((?(3)\3((?(4)\4x{24}|x{60}))|x{50}))|x{15}))|x))*$

This regex is compatible with PCRE, Perl, .NET flavors.

This basically follows a "difference tree" (not sure if there's a proper name for it), which tells the regex how many more x's to match for the next fourth power:

1     16    81    256   625   1296  2401 ...
   15    65    175   369   671   1105 ...
      50    110   194   302   434 ...
         60    84    108   132 ...
            24    24    24 ...  # the differences level out to 24 on the 4th iteration

\2, \3, \4 stores and updates the difference as shown on the 2nd, 3rd and 4th rows, respectively.

This construct can easily be extended for higher powers.

Certainly not an elegant solution, but it does work.

Volatility

Posted 2014-01-23T20:21:47.400

Reputation: 3 206

neat idea re difference tree. for squares the tree is 1 4 9 16 ... 3 5 7 ... 2 2 2, right? – Sparr – 2016-02-26T21:20:39.360

@Sparr thanks, and yes – Volatility – 2016-02-27T02:52:38.153

+1. Great answer. Although this answer is different from mine (it uses conditional regex, while mine does not), it has the same spirit as my solution (exploiting the difference tree and make use of the forward-declared back-reference of some regex engines). – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-01-26T08:32:56.570

24

Another Solution

This is, in my opinion, one of the most interesting problems on the site. I need to thank deadcode for bumping it back up to the top.

^((^|xx)(^|\3\4\4)(^|\4x{12})(^x|\1))*$

39 bytes, without any conditionals or assertions... sort of. The alternations, as they're being used (^|), are a type of conditional in a way, to select between "first iteration," and "not first iteration."

This regex can be seen to work here: http://regex101.com/r/qA5pK3/1

Both PCRE and Python interpret the regex correctly, and it has also been tested in Perl up to n = 128, including n4-1, and n4+1.


Definitions

The general technique is the same as in the other solutions already posted: define a self-referencing expression which on each subsequent iteration matches a length equal to the next term of the forward difference function, Df, with an unlimited quantifier (*). A formal definition of the forward difference function:

Definition 1: forward difference function

Additionally, higher order difference functions may also be defined:

Definition 2: second forward difference function

Or, more generally:

Definition 3: kth forward difference function

The forward difference function has a lot of interesting properties; it is to sequences what the derivative is to continuous functions. For example, Df of an nth order polynomial will always be an n-1th order polynomial, and for any i, if Dfi = Dfi+1, then the function f is exponential, in much the same way that the derivative of ex is equal to itself. The simplest discrete function for which f = Df is 2n.


f(n) = n2

Before we examine the above solution, let's start with something a bit easier: a regex which matches strings whose lengths are a perfect square. Examining the forward difference function:

FDF: n^2

Meaning, the first iteration should match a string of length 1, the second a string of length 3, the third a string of length 5, etc., and in general, each iteration should match a string two longer than the previous. The corresponding regex follows almost directly from this statement:

^(^x|\1xx)*$

It can be seen that the first iteration will match only one x, and each subsequent iteration will match a string two longer than the previous, exactly as specified. This also implies an amazingly short perfect square test in perl:

(1x$_)=~/^(^1|11\1)*$/

This regex can be further generalized to match any n-gonal length:

Triangular numbers:
^(^x|\1x{1})*$

Square numbers:
^(^x|\1x{2})*$

Pentagonal numbers:
^(^x|\1x{3})*$

Hexagonal numbers:
^(^x|\1x{4})*$

etc.


f(n) = n3

Moving on to n3, once again examining the forward difference function:

FDF: n^3

It might not be immediately apparent how to implement this, so we examine the second difference function as well:

FDF^2: n^3

So, the forwards difference function does not increase by a constant, but rather a linear value. It's nice that the initial ('-1th') value of Df2 is zero, which saves an initialization on the second iteration. The resulting regex is the following:

^((^|\2x{6})(^x|\1))*$

The first iteration will match 1, as before, the second will match a string 6 longer (7), the third will match a string 12 longer (19), etc.


f(n) = n4

The forward difference function for n4:

FDF: n^4

The second forward difference function:

FDF^2: n^4

The third forward difference function:

FDF^3: n^4

Now that's ugly. The initial values for Df2 and Df3 are both non-zero, 2 and 12 respectively, which will need to be accounted for. You've probably figured out by now that the regex will follow this pattern:

^((^|\2\3{b})(^|\3x{a})(^x|\1))*$

Because the Df3 must match a length of 12 on the second iteration, a is necessarily 12. But because it increases by 24 each term, the next deeper nesting must use its previous value twice, implying b = 2. The final thing to do is initialize the Df2. Because Df2 influences Df directly, which is ultimately what we want to match, its value can be initialized by inserting the appropriate atom directly into the regex, in this case (^|xx). The final regex then becomes:

^((^|xx)(^|\3\4{2})(^|\4x{12})(^x|\1))*$

Higher Orders

A fifth order polynomial can be matched in with the following regex:
^((^|\2\3{c})(^|\3\4{b})(^|\4x{a})(^x|\1))*$

f(n) = n5 is a fairly easy excercise, as the initial values for both the second and fourth forward difference functions are zero:

^((^|\2\3)(^|\3\4{4})(^|\4x{30})(^x|\1))*$

For six order polynomials:
^((^|\2\3{d})(^|\3\4{c})(^|\4\5{b})(^|\5x{a})(^x|\1))*$

For seventh order polynomials:
^((^|\2\3{e})(^|\3\4{d})(^|\4\5{c})(^|\5\6{b})(^|\6x{a})(^x|\1))*$

etc.

Note that not all polynomials can be matched in exactly this way, if any of the necessary coefficients are non-integer. For example, n6 requires that a = 60, b = 8, and c = 3/2. This can be worked around, in this instance:

^((^|xx)(^|\3\6\7{2})(^|\4\5)(^|\5\6{2})(^|\6\7{6})(^|\7x{60})(^x|\1))*$

Here I've changed b to 6, and c to 2, which have the same product as the above stated values. It's important that the product doesn't change, as a·b·c·… controls the constant difference function, which for a sixth order polynomial is Df6. There are two initialization atoms present: one to initialize Df to 2, as with n4, and the other to initialize the fifth difference function to 360, while at the same time adding in the missing two from b.

primo

Posted 2014-01-23T20:21:47.400

Reputation: 30 891

@Lynn thanks! Was not expecting that... – primo – 2019-02-11T12:14:19.627

Which engines have you tested this on? – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-02-25T11:08:14.493

I finally understand what is going on. Indeed the only thing needed is support for forward-reference. +1 – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-02-25T11:39:56.337

@nhahtdh ahh, you're right. Forward-references aren't necessarily a universal feature either. – primo – 2014-02-25T13:23:59.893

1

Excellent! I love how short, simple and easy to understand this is. With its shallow nesting, it is easy to calculate by hand how it will behave. Also, it is equally as fast as Volatility's and nhahtdh's solutions. And I love your detailed explanation, including the demonstration that this can even be extended to polynomials. I'd give bonus points if I could.

– Deadcode – 2014-02-26T18:17:42.877

14

Here is a solution that does not use conditionals, forward-declared or nested backreferences, lookbehind, balancing groups, or regex recursion. It only uses lookahead and standard backreferences, which are very widely supported. I was inspired to operate under these limitations due to Regex Golf, which uses the ECMAScript regex engine.

The way this 50 byte regex works is conceptually rather simple, and completely different than all the other submitted solutions to this puzzle. It was surprising to discover that this kind of mathematical logic was expressible in a regex.

      \2                     \4  \5
^((?=(xx+?)\2+$)((?=\2+$)(?=(x+)(\4+)$)\5){4})*x?$

(Capture groups are labeled above the regex)

The regex can be generalized to any power simply by replacing the 4 in {4} with the desired power.

Try it online!

It works by repeatedly dividing away the smallest fourth power of a prime that the current value is divisible by. As such the quotient at each step is always a fourth power, iff the original value was a fourth power. A final quotient of 1 indicates that the original value was indeed a fourth power; this completes the match. Zero is also matched.

First it uses a lazy capture group \2 to capture the number's smallest factor larger than 1. As such, this factor is guaranteed to be prime. For example, with 1296 (6^4) it will initially capture \2 = 2.

Then, at the beginning of a loop that is repeated 4 times, it tests to see if the current number is divisible by \2, with (?=\2+$). The first time through this loop, this test is useless, but its purpose will become apparent later.

Next inside this inner loop, it uses the greedy capture group \4 to capture the number's largest factor smaller than itself: (?=(x+)(\4+)$). In effect this divides the number by its smallest prime factor, \2; for example, 1296 will initially be captured as \4 = 1296/2 = 648. Note that the division of the current number by \2 is implicit. While it is possible to explicitly divide the current number by a number contained in a capture group (which I only discovered four days after posting this answer), doing this would make for a slower and harder-to-understand regex, and it is not necessary, since a number's smallest factor larger than 1 will always match up with its largest factor smaller than itself (such that their product is equal to the number itself).

Since this kind of regex can only "eat away" from the string (making it smaller) by leaving a result at the end of the string, we need to "move" the result of the division to the end of the string. This is done by capturing the result of subtraction (the current number minus \4), into the capture group \5, and then, outside the lookahead, matching a portion of the beginning of the current number corresponding to \5. This leaves the remaining unprocessed string at the end matching \4 in length.

Now it loops back to the beginning of the inner loop, where it becomes apparent why there is a test for divisibility by the prime factor. We have just divided by the number's smallest prime factor; if the number is still divisible by that factor, it means the original number might be divisible by the fourth power of that factor. The first time this test is done it is useless, but the next 3 times, it determines if the result of implicitly dividing by \2 is still divisible by \2. If it is still divisible by \2 at the beginning of each iteration of the loop, then this proves that each iteration divided the number by \2.

In our example, with an input of 1296, this will loop through as follows:

\2 = 2
\4 = 1296/2 = 648
\4 = 648/2 = 324
\4 = 324/2 = 162
\4 = 162/2 = 81

Now the regex can loop back to the first step; this is what the final * does. In this example, 81 will become the new number; the next loop will go as follows:

\2 = 3
\4 = 81/3 = 27
\4 = 27/3 = 9
\4 = 9/3 = 3
\4 = 3/3 = 1

It will now loop back to the first step again, with 1 as the new number.

The number 1 cannot be divided by any prime, which would make it a non-match by (?=(xx+?)\2+$), so it exits the top-level loop (the one with * at the end). It now reaches the x?$. This can only match zero or one. The current number at this point will be 0 or 1 if and only if the original number was a perfect fourth power; if it is 0 at this point, it means that the top-level loop never matched anything, and if it is 1, it means the top-level loop divided a perfect fourth power down until it wasn't divisible by anything anymore (or it was 1 in the first place, meaning the top-level loop never matched anything).

It's also possible to solve this in 49 bytes by doing repeated explicit division (which is also generalized for all powers – replace the desired power minus one into the {3}), but this method is far, far slower, and an explanation of the algorithm it uses is beyond the scope of this Answer:

^((x+)((\2(x+))(?=(\4*)\2*$)\4*(?=\5$\6)){3})?x?$

Try it online!

Deadcode

Posted 2014-01-23T20:21:47.400

Reputation: 3 022

From my testing (up to length 1024), it seems that it is correct. However, the regex is too slow - it takes a lot of time just to match length 16^4, so it is very hard to verify for large number. But since performance is not required, I'll upvote when I understand your regex. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-02-24T12:50:29.687

1Your regex and Volatility's are awesome. Their speed and brevity amaze me, both of them matching 100000000 in 7.5 seconds on my i7-2600k, much faster than I would have expected a regex to be.

My solution here is on a totally different order of magnitude, as it takes 12 seconds to match 50625. But the goal with mine was not speed, but rather, accomplishing the job in minimal code length using a much more limited set of operations. – Deadcode – 2014-02-24T15:14:12.430

Our answers are fast, since they barely do any backtracking. Yours do a lot of backtracking in ((((x+)\5+)\4+)\3+)\2+$. Yours is also amazing in its own way, since I can't even think of how to match a square number without forward-declared backreference. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-02-24T15:14:25.407

By the way, this question is not code-golf, but a puzzle. I don't judge solution by code length. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-02-24T17:59:52.917

Oh. That explains why you used (?:). So should I edit my answer to make the optimized version the primary one? – Deadcode – 2014-02-24T18:42:51.820

Feel free to do so. As a suggestion, you might want to insert parts of the regex into your explanation, so that it would be easier to follow. – n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳ – 2014-02-24T19:04:49.793

After my optimizations, this regex now matches 234256=22^4 faster than it formerly matched 50625=15^4. (Still nowhere near as fast as the solutions using forward-declared/nested backreferences.) – Deadcode – 2014-02-26T18:30:39.893

8

Solution

^(?:(?=(^|(?<=^x)x|xx\1))(?=(^|\1\2))(^x|\3\2{12}xx))*$

This regex is compatible with Java, Perl, PCRE and .NET flavors. This regex uses quite a range of features: look-ahead, look-behind and forward-declared back-reference. Forward-declared back-reference kinds of limits the compatibility of this regex to a few engines.

Explanation

This solution makes use of the following derivation.

By fully expanding the summation, we can prove the following equality:

\sum\limits_{i=1}^n (i+1)^4 - \sum\limits_{i=1}^n i^4 = (n+1)^4 - 1
\sum\limits_{i=1}^n i^4 - \sum\limits_{i=1}^n (i-1)^4 = n^4

Let us combine the summation on the left-hand-side:

\sum\limits_{i=1}^n (4(i+1)^3 - 6(i+1)^2 + 4(i+1) - 1) = (n+1)^4 - 1
\sum\limits_{i=1}^n (4i^3 - 6i^2 + 4i - 1) = n^4

Subtract the 2 equations (top equation minus bottom equation) and then combine the summations on the left-hand-side, then simplify it:

\sum\limits_{i=1}^n (12i^2 + 2) = (n+1)^4 - n^4 - 1

We obtain the difference between consecutive fourth powers as power sum:

(n+1)^4 - n^4 = \sum\limits_{i=1}^n (12i^2 + 2) + 1

This means that the difference between consecutive fourth powers will increase by (12n2 + 2).

To make it easier to think, referring to the difference tree in Volatility's answer:

  • The right-hand-side of the final equation is the 2nd row in the difference tree.
  • The increment (12n2 + 2) is the 3rd row in the difference tree.

Enough mathematics. Back to the solution above:

  • The 1st capturing group maintains a series of odd number to calculate i2 as seen in the equation.

    Precisely speaking, the length of the 1st capturing group will be 0 (unused), 1, 3, 5, 7, ... as the loop iterates.

    (?<=^x)x sets the initial value for the odd number series. The ^ is just there to allow the look-ahead to be satisfied in the first iteration.

    xx\1 adds 2 and advance to the next odd number.

  • The 2nd capturing group maintains the square number series for i2.

    Precisely speaking, the length of the 2nd capturing group will be 0, 1, 4, 9, ... as the loop iterates.

    ^ in (^|\1\2) sets the initial value for the square number series. And \1\2 adds the odd number to the current square number to advance it to the next square number.

  • The 3rd capturing group (outside any look-ahead and actually consume text) matches the whole right-hand-side of the equation we derived above.

    ^x in (^x|\3\2{12}xx) sets the initial value, which is + 1 the right-hand-side of the equation.

    \3\2{12}xx adds the increase in difference (12n2 + 2) using n2 from capturing group 2, and match the difference at the same time.

This arrangement is possible due to the amount of text matched in each iteration is more than or equal to the amount of text needed to execute the look-ahead to construct n2.

n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳

Posted 2014-01-23T20:21:47.400

Reputation: 5 683