ValiDate ISO 8601 by RX

16

4

Challenge

Find the shortest regex that

  1. validates, i.e. matches, every possible date in the Proleptic Gregorian calendar (which also applies to all dates before its first adoption in 1582) and
  2. does not match any invalid date.

Output

Output is therefore truthy or falsey.

Input

Input is in any of 3 expanded ISO 8601 date formats – no times.

The first two are ±YYYY-MM-DD (year, month, day) and ±YYYY-DDD (year, day). Both need special-casing for the leap day. They are naively matched separately by these extended RXs:

(?<year>[+-]?\d{4,})-(?<month>\d\d)-(?<day>\d\d)
(?<year>[+-]?\d{4,})-(?<doy>\d{3})

The third input format is ±YYYY-wWW-D (year, week, day). It is the complicated one because of the complex leap week pattern.

(?<year>-?\d{4,})-W(?<week>\d\d)-(?<dow>\d)

A basic, but insufficient validity check for all three combined would look something like this:

[+-]?\d{4,}-((0\d|1[0-2])-([0-2]\d|3[01]) ↩
            |([0-2]\d\d|3[0-5]\d|36[0-6]) ↩
            |(W([0-4]\d|5[0-3])-[1-7]))

Conditions

A leap year in the Proleptic Gregorian calendar contains the leap day …-02-29 and thus it is 366 days long, hence …-366 exists. This happens in any year whose ordinal number is divisible by 4, but not by 100 unless it’s also divisible by 400. Year zero exists in this calendar and it is a leap year.

A long year in the ISO week calendar contains a 53rd week, which one could term a “leap week”. This happens in all years where 1 January is a Thursday and additionally in all leap years where it’s a Wednesday. It turns out to occur every 5 or 6 years usually, in a seemingly irregular pattern.

A year has at least 4 digits. Years with more than 10 digits do not have to be supported, because that’s close enough to the age of the universe (ca. 14 billion years). The leading plus sign is optional, although the actual standard suggests it should be required for years with more than 4 digits.

Partial or truncated dates, i.e. with less than day-precision, must not be accepted.

The parts of the date notation, e.g. the month, do not have to be matched by a group that could be referenced.

Rules

This is code-golf. The shortest regex without executed code wins. Update: You may use features like recursion and balanced groups, but will be fined by a factor of 10, which the character count then is multiplied with! This is now different from the rules in Hard code golf: Regex for divisibility by 7. Earlier answer wins a tie.

Test cases

Valid tests

2015-08-10
2015-10-08
12015-08-10
-2015-08-10
+2015-08-10
0015-08-10
1582-10-10
2015-02-28
2016-02-29
2000-02-29
0000-02-29
-2000-02-29
-2016-02-29
200000-02-29
2016-366
2000-366
0000-366
-2016-366
-2000-366
2015-081
2015-W33-1
2015-W53-7
 2015-08-10 

The last one is optionally valid, i.e. leading and trailing spaces in input strings may be trimmed.

Invalid formats

-0000-08-10     # that's an arbitrary decision
15-08-10        # year is at least 4 digits long
2015-8-10       # month (and day) is exactly two digits long, i.e. leading zero is required
015-08-10       # year is at least 4 digits long
20150810        # though a valid ISO format, we require separators; could also be interpreted as a 8-digit year
2015 08 10      # separator must be hyphen-minus
2015.08.10      # separator must be hyphen-minus
2015–08–10      # separator must be hyphen-minus
2015-0810
201508-10       # could be October in the year 201508
2015 - 08 - 10  # no internal spaces allowed
2015-w33-1      # letter ‘W’ must be uppercase
2015W33-1       # it would be unambiguous to omit the separator in front of a letter, but not in the standard
2015W331        # though a valid ISO format we require separators
2015-W331
2015-W33        # a valid ISO date, but we require day-precision
2015W33

Invalid dates

2015        # a valid ISO format, but we require day-precision
2015-08     # a valid ISO format, but we require day-precision
2015-00-10  # month range is 1–12
2015-13-10  # month range is 1–12
2015-08-00  # day range is 1–28 through 31
2015-08-32  # max. day range is 1–31
2015-04-31  # day range for April is 1–30
2015-02-30  # day range for February is 1–28 or 29
2015-02-29  # day range for common February is 1–28
2100-02-29  # most century years are non-leap
-2100-02-29 # most century years are non-leap
2015-000    # day range is 1–365 or 366
2015-366    # day range is 1–365 in common years
2016-367    # day range is 1–366 in leap years
2100-366    # most century years are non-leap
-2100-366   # most century years are non-leap
2015-W00-1  # week range is 1–52 or 53
2015-W54-1  # week range is 1–53 in long years
2016-W53-1  # week range is 1–52 in short years
2015-W33-0  # day range is 1–7
2015-W33-8  # day range is 1–7

Crissov

Posted 2015-08-12T12:20:25.937

Reputation: 333

2This question is not well-defined because the regex language is not specified. – orlp – 2015-08-12T18:28:41.517

1@orlp If it’s not specified, the choice is not limited. I wrote “regex” or “RX” on purpose, so one could use dialects that allow recursion etc. (i.e. CFG, not RG). – Crissov – 2015-08-12T21:56:24.407

I would strongly suggest you to limit the regex language, because it will be very sour for a contestant to work for hours on a solution only to be trivially beaten by a language that is fundamentally more powerful. If you were to limit the language to the actual CS definition of regular expressions (like DFA), then the problem becomes an interesting optimization answer. – orlp – 2015-08-12T22:04:40.413

Validating ISO-8601 dates using regular expressions is something I've actually had to do for work. But agree with orlp, I think a language is necessary here. – Alex A. – 2015-08-13T02:50:07.020

@orlp I will not disallow recursion etc., but put a heavy penalty on it, since there have not been any submissions yet. – Crissov – 2015-08-13T08:59:40.583

Some dates in year 1582 should also be invalid, eg. 1582-10-10, as after October 4th there was immediately October 15th, also days larger than 355 (eg. 1582-365 is invalid) and week numbers over 51 for that year.

– Voitcus – 2015-08-13T10:22:31.880

@Voitcus No! This task is specifically about the proleptic Gregorian calendar. Otherwise the leap rule would also change to the simplistic Julian „every fourth year“ and the first day of the years had to be 1 March up—I don’t know when, furthermore there would probably be no year zero, but some strange affixes like AD or BCE. Let’s not do that this time. – Crissov – 2015-08-13T10:45:30.573

1Regex inherits from Method in Perl 6 so is itself a form of executable code. – Brad Gilbert b2gills – 2016-01-17T19:27:42.753

Answers

4

PCRE (also Perl), 778 bytes

/^([+-]?\d*((([02468][048]|[13579][26]|\d\d(?!00))([02468][048]|[13579][26]))|\d{4}(?!-02-29|-366))-((?!02-3|(0[469]|11)-31|000)((0[1-9]|1[012])-(0[1-9]|[12]\d|30|31)|([012]\d\d|3([0-5]\d|6[0-6])))|(W(?!00)([0-4]\d|51|52)-[1-7]))|((\+?\d*([02468][048]|[13579][26])|-\d*([02468][159]|[13579][37]))(04|09|15|20|26|32|37|43|48|54|60|65|71|76|82|88|93|99)|(\+?\d*([02468][159]|[13579][37])|-\d*([02468][26]|[13579][048]))(05|11|16|22|28|33|39|44|50|56|61|67|72|78|84|89|95)|(\+?\d*([02468][26]|[13579][048])|-\d*([02468][37]|[13579][159]))(01|07|12|18|24|29|35|40|46|52|57|63|68|74|80|85|91|96)|\+?\d*(([02468][37]|[13579][159])(03|14|20|25|31|36|42|53|59|64|70|76|81|87|92|[049]8))|-\d*(([02468][048]|[13579][26])([059]2|08|13|19|24|30|36|41|47|58|64|69|75|80|86|97)))-W53-[1-7])$/

I have included the delimiters in the byte count to show that it doesn't rely on any flags.

It does not match valid dates within other strings, such as 1234-56-89 2016-02-29 9876-54-32. The regex is shorter by not checking for a maximum of 10 digits for the year.

Extended with comments:

/^  # Start of pattern (no leading space)
  (
    # YEAR
    # Optional sign and digits if more than 4 in year
    [+-]?\d*(
      # Years 00??, 04??, 08?? ... 92??, 96?? OR dd not followed by 00
      # followed by 00, 04, 08 ... 92, 96 OR
      (([02468][048]|[13579][26]|\d\d(?!00))([02468][048]|[13579][26])) |
      # any year not followed by 29 February or day 366
      \d{4}(?!-02-29|-366)
    # dash
    ) -
    # MONTH AND DAY, or DAY OF YEAR, or WEEK OF YEAR AND DAY if less than 53 weeks
    (
      # Not (30 or 31 February OR 31 April, June, September or December OR day 0)
      (?!02-3|(0[469]|11)-31|000)
      (
        # Month         dash         day         OR
        (0[1-9]|1[012]) - (0[1-9]|[12]\d|30|31) |
        # 001-299 OR 300-359 OR 360-366
        ([012]\d\d | 3([0-5]\d | 6[0-6]))
      # OR
      ) |
      (
        # W    01-52    dash    1-7
        W(?!00)([0-4]\d|51|52)-[1-7]
      )
    # OR
    ) |
    # WEEK OF YEAR AND DAY only if week is 53
    (
      # Optional plus and extra year digits
      \+?\d*(
        # Years +0303 - +9998
        ([02468][37]|[13579][159])(03|14|20|25|31|36|42|53|59|64|70|76|81|87|92|[049]8)
      ) |
      # Minus and extra year digits
      -\d*(
        # Years -0002 - -9697
        ([02468][048]|[13579][26])([059]2|08|13|19|24|30|36|41|47|58|64|69|75|80|86|97)
      ) |
      # Years +0004 - +9699, -0104 - -9799
      (\+?\d*([02468][048]|[13579][26])|-\d*([02468][159]|[13579][37]))
          (04|09|15|20|26|32|37|43|48|54|60|65|71|76|82|88|93|99) |
      # Years +0105 - +9795, -0205 - -9895
      (\+?\d*([02468][159]|[13579][37])|-\d*([02468][26]|[13579][048]))
          (05|11|16|22|28|33|39|44|50|56|61|67|72|78|84|89|95) |
      # Years +0201 - +9896, -0301 - -9996
      (\+?\d*([02468][26]|[13579][048])|-\d*([02468][37]|[13579][159]))
          (01|07|12|18|24|29|35|40|46|52|57|63|68|74|80|85|91|96)
    # dash W 53 dash 1-7
    )-W53-[1-7]
  # End of pattern (no trailing space)
  )$/x

CJ Dennis

Posted 2015-08-12T12:20:25.937

Reputation: 4 104

I haven‘t checked everything yet, but it seems you gain the most bytes by (?!…) expressions compared to my solution. – Crissov – 2016-03-05T12:22:07.187

1@Crissov The (?!…) expressions only save a few bytes each. I reduced a lot of bytes by combining three of the positive/negative week-of-year/day-of-week year patterns into one each. The last ones don't correspond to each other. So I got 8 long sub-patterns down to 5. Also, since |20|25| is the same length as |2[05]| I went for the more readable option. – CJ Dennis – 2016-03-06T04:02:42.843

This expression matches the test case -0000-08-10 and doesn’t match ␠2015-08-10␠ with leading and trailing whitespace, but since both were arbitrary decisions or optional features I’ll let it slide.

– Crissov – 2016-04-06T09:38:08.353

I think this solution has a bug for dates within W50. – Crissov – 2018-10-29T23:49:41.910

W(?!00)([0-4]\d|51|52)-[1-7] must be something equivalent to W(?!00)([0-4]\d|5[0-2])-[1-7]. This adds one character to the length. 779 – Crissov – 2018-10-30T03:53:25.340

@Crissov No, 51|52 matches 51 or 52. 5[0-2] matches 50, 51 or 52. – CJ Dennis – 2018-10-31T02:29:57.533

@Crissov If you think there's a bug, you should post an example that demonstrates the bug. – CJ Dennis – 2018-10-31T02:31:50.850

2018-W50-1 vs. 2018-W51-1 and 2018-W49-1 – Crissov – 2018-10-31T04:32:55.090

9

PCRE: 603 940 947 949 956 bytes

^\s*[+-]?(\d{4,10}-((00[1-9]|0[1-9]\d|[12]\d\d|3[0-5]\d|36[0-5])|(0[1-9]|1[0-2])-(0[1-9]|1\d|2[0-8])|(0[13-9]|1[0-2])-(29|30)|(0[13578]|1[02])-31|W(0[1-9]|[1-4]\d|5[0-2])-[1-7]))|((\d{2,8}([13579][26]|[2468][048]|0[48])|(\d{0,6}([13579][26]|[02468][048])00))-(366|02-29))|(\+?\d{0,6}(([02468][048]|[13579][26])([26]0|71|[38]2|[49]3|[05]4|15|[27]6|37|[48]8|[09]9)|([02468][159]|[13579][37])(50|[16]1|[27]2|33|[48]4|[09]5|[15]6|67|[27]8|[38]9)|([02468][26]|[13579][048])([48]0|[09]1|[15]2|63|[27]4|[38]5|[49]6|[05]7|[16]8|29)|([02468][37]|[13579][159])([27]0|[38]1|[49]2|[05]3|[16]4|25|[37]6|87|[049]8|[5]9))|-\d{0,6}(([02468][048]|[13579][26])(0[28]|1[39]|24|3[06]|4[17]|5[28]|6[49]|75|8[06]|9[27])|([02468][159]|[13579][37])(0[49]|15|2[06]|3[27]|4[38]|54|6[05]|7[16]|8[28]|9[39])|([02468][26]|[13579][048])(0[51]|16|2[28]|3[39]|44|5[06]|6[17]|7[28]|8[49]|95)|([02468][37]|[13579][159])(0[17]|1[28]|2[49]|35|4[06]|5[27]|6[38]|74|8[05]|9[16])))-W53-[1-7]\s*$

Note: Some pairs of parentheses could possibly be dropped.

Divisibility by 4

The multiples of 4 repeat in a simple pattern:

  • 00, 04, 08, 12, 16,
    20, 24, 28, 32, 36,
    40, 44, 48, 52, 56,
    60, 64, 68, 72, 76,
    80, 84, 88, 92, 96, …

This, or the inverse, could be matched by a likewise simple regular expression for all two-digit numbers with leading zero:

(?<divisible-by-four>[13579][26]|[02468][048])
(?<not-divisible-by-four>[13579][048]|[02468][26]|\d[13579])

It could save some bytes if there were character classes for odd and even digits (like \o and \e), but there are not as far as I’m aware.

Years

That expression would suffice for the Julian calendar, but the Gregorian leap year detection needs to special-case trailing 00 with century divisibility by 4:

(?<leap-year>[+-]?(\d{2,8}([13579][26]|[2468][048]|0[48])|(\d{0,6}([13579][26]|[02468][048])00))
(?<year>[+-]?\d{4,10})

This would need some changes to outlaw -0000-… (along with -00000-… etc.) or to enforce the plus sign for positive year numbers with more than 4 digits. The latter would be rather simple, but is not required:

(?<leap-year>([+-]?(\d\d([13579][26]|[2468][048]|0[48])|(([13579][26]|[02468][048])00)))|([+-](\d{3,8}([13579][26]|[2468][048]|0[48])|(\d{1,6}([13579][26]|[02468][048])00))))
(?<year>([+-]?\d{4})|([+-]\d{5,10}))

Day of year

Three-digit ordinal dates are rather simple, we just have to restrict -366 to leap years (and disallow -000).

(?<ordinal-day>-(00[1-9]|0[1-9]\d|[12]\d\d|3[0-5]\d|36[0-5]))
(?<ordinal-leap-day>-366)

Day of month of year

The seven months with 31 days are 01 January, 03 March, 05 May, 07 July, 08 August, 10 October and 12 December. Just four month have exactly 30 days, 04 April, 06 June, 09 September and 11 November. Finally, 02 February has 28 days in common years and 29 in leap years. We can first construct a regular expression for the always valid days 01 through 28 and then add special cases.

(?<month-day>-(0[1-9]|1[0-2])-(0[1-9]|1\d|2[0-8]))
(?<short-month-day>-(0[13-9]|1[0-2])-(29|30))
(?<long-month-day>-(0[13578]|1[02])-31)
(?<month-leap-day>-02-29)

Neither month nor day must be 00 which was not covered by an earlier version.

Day of week of year

All years include 52 weeks

(?<week-day>-W(0[1-9]|[1-4]\d|5[0-2])-[1-7])

Long years that include -W53 repeat in a 400-year cycle, e.g. add 2000 for the current cycle and find the current year in the third entry:

  • 004, 009, 015, 020, 026, 032, 037, 043, 048, 054, 060, 065, 071, 076, 082, 088, 093, 099,
  • 105, 111, 116, 122, 128, 133, 139, 144, 150, 156, 161, 167, 172, 178, 184, 189, 195,
  • 201, 207, 212, 218, 224, 229, 235, 240, 246, 252, 257, 263, 268, 274, 280, 285, 291, 296,
  • 303, 308, 314, 320, 325, 331, 336, 342, 348, 353, 359, 364, 370, 376, 381, 387, 392, 398.

Each of the four centuries has a unique pattern. There is probably not much room for optimization.

  1. 04|09|15|20|26|32|37|43|48|54|60|65|71|76|82|88|93|99
  2. 05|11|16|22|28|33|39|44|50|56|61|67|72|78|84|89|95
  3. 01|07|12|18|24|29|35|40|46|52|57|63|68|74|80|85|91|96
  4. 03|08|14|20|25|31|36|42|48|53|59|64|70|76|81|87|92|98

We can group by either digit to find out that we can save two bytes or so:

  • Grouped by 1st digit.
    1. 0[49]|15|2[06]|3[27]|4[38]|54|6[05]|7[16]|8[28]|9[39]
    2. 05|1[16]|2[28]|3[39]|44|5[06]|6[17]|7[28]|8[49]|95
    3. 0[17]|1[28]|2[49]|35|4[06]|5[27]|6[38]|74|8[05]|9[16]
    4. 0[38]|14|2[05]|3[16]|4[28]|5[39]|64|7[06]|8[17]|9[28]
  • Grouped by 2nd digit.
    1. [26]0|71|[38]2|[49]3|[05]4|15|[27]6|37|[48]8|[09]9
    2. 50|[16]1|[27]2|33|[48]4|[09]5|[15]6|67|[27]8|[38]9
    3. [48]0|[09]1|[15]2|63|[27]4|[38]5|[49]6|[05]7|[16]8|29
    4. [27]0|[38]1|[49]2|[05]3|[16]4|25|[37]6|87|[049]8|[5]9

The century number is easily matched again by a variation of the divisibility expression.

  • 1st century: [02468][048]|[13579][26]
  • 2nd century: [02468][159]|[13579][37]
  • 3rd century: [02468][26]|[13579][048]
  • 4th century: [02468][37]|[13579][159]

So far, this does only work for positive years, including year zero. For negative years, we have to subtract the values from the list above from 400 and do the rest again, because the pattern is not symmetric.

  1. 02|08|13|19|24|30|36|41|47|52|58|64|69|75|80|86|92|97
  2. 04|09|15|20|26|32|37|43|48|54|60|65|71|76|82|88|93|99
  3. 05|11|16|22|28|33|39|44|50|56|61|67|72|78|84|89|95
  4. 01|07|12|18|24|29|35|40|46|52|57|63|68|74|80|85|91|96

or

  1. 0[28]|1[39]|24|3[06]|4[17]|5[28]|6[49]|75|8[06]|9[27]
  2. 0[49]|15|2[06]|3[27]|4[38]|54|6[05]|7[16]|8[28]|9[39]
  3. 0[51]|16|2[28]|3[39]|44|5[06]|6[17]|7[28]|8[49]|95
  4. 0[17]|1[28]|2[49]|35|4[06]|5[27]|6[38]|74|8[05]|9[16]

Putting it all together

Any year

[+-]?\d{4,10}-((00[1-9]|0[1-9]\d|[12]\d\d|3[0-5]\d|36[0-5])|(0[1-9]|1[0-2])-(0[1-9]|1\d|2[0-8])|(0[13-9]|1[0-2])-(29|30)|(0[13578]|1[02])-31|W(0[1-9]|[1-4]\d|5[0-2])-[1-7])

Leap-day year additions

[+-]?(\d{2,8}([13579][26]|[2468][048]|0[48])|(\d{0,6}([13579][26]|[02468][048])00))-(366|02-29)

Leap-week year additions

+?\d{0,6}(([02468][048]|[13579][26])([26]0|71|[38]2|[49]3|[05]4|15|[27]6|37|[48]8|[09]9)|([02468][159]|[13579][37])(50|[16]1|[27]2|33|[48]4|[09]5|[15]6|67|[27]8|[38]9)|([02468][26]|[13579][048])([48]0|[09]1|[15]2|63|[27]4|[38]5|[49]6|[05]7|[16]8|29)|([02468][37]|[13579][159])([27]0|[38]1|[49]2|[05]3|[16]4|25|[37]6|87|[049]8|[5]9))-W53-[1-7]
-\d{0,6}(([02468][048]|[13579][26])(0[28]|1[39]|24|3[06]|4[17]|5[28]|6[49]|75|8[06]|9[27])|([02468][159]|[13579][37])(0[49]|15|2[06]|3[27]|4[38]|54|6[05]|7[16]|8[28]|9[39])|([02468][26]|[13579][048])(0[51]|16|2[28]|3[39]|44|5[06]|6[17]|7[28]|8[49]|95)|([02468][37]|[13579][159])(0[17]|1[28]|2[49]|35|4[06]|5[27]|6[38]|74|8[05]|9[16]))-W53-[1-7]

Crissov

Posted 2015-08-12T12:20:25.937

Reputation: 333

Your pattern is not anchored at the beginning and end, so it will match valid dates inside an otherwise invalid string. – CJ Dennis – 2016-03-05T09:30:06.040

@CJDennis That’s true, I’ll add the two characters now. – Crissov – 2016-03-05T12:22:55.270

I’ve also added optional leading and trailing spaces \s*. – Crissov – 2016-04-06T09:31:05.717