2
Based on my SO question, but more specifically:
Find a RegEx such that, given a text where paragraphs end with an empty line (or $
at the very end), match the up to twenty last characters of a paragraph excluding the \r
or $
but no more than three complete "words" including the "separators" in between and the punctuation following it. The following constraints apply:
- "words" also include abbreviations and the like (and thus punctuation), i.e. "i.e." and "non-trivial" count as one word each
- whitespaces are "separators"
- isolated dashes do not count as "words" but as "separators"
- "separators" before the first word must not be included
- trailing punctuation counts to that limit, but not the leading one nor the linebreaks and
$
that must not be in the matching group - whether trailing punctuation is included in the match or not is up to you
Some examples:
The early bird catches the worm. Two words.
But only if they got up - in time.
Here we have a supercalifragilisticexpialidocious etc. sentence.
Short words.
Try accessing the .html now, please.
Vielen Dank und viele Grüße.
Did any-one try this?
Score is given in bytesteps: bytes multiplied by amount of steps according to https://regex101.com, but there are three penalties:
- if the upper limit of words cannot trivially+ be modified: +10%
- if the upper limit of character count cannot trivially+ be modified: +10%
- failure to handle unicode used in actual words: add the percentage of world population speaking the language according to Wikipedia the failing word stems from. Example: Failing to match the German word "Grüße" as a word => +1.39%. Whether you consider a language's letters as single characters or count the bytes to utf8-encode them is up to you. The penalty does not have to be applied before someone actually provides a counter-example, so feel free to demote your competitors ;)
Since this is my first challenge, please suggest any clarifications. I assume regular-expression means there are no "loopholes" to exclude.
+ With trivially modified I mean expressions such as {0,2}
and {21}
are valid but repeating something three times (which would have to be re-repeated to increase the amount words) is not.
Bonuses and penalties are generally disliked in challenges. And, while it makes sense to restrict this to regular expressions, why not remove the tag and let other languages compete as well? – Addison Crump – 2016-05-19T09:21:34.680
2@VTCAKAVSMoACE I assume because it's not just scored by bytes but also by the number of steps executed by the regex engine, which doesn't make sense for arbitrary languages. – Martin Ender – 2016-05-19T09:22:37.343
Could you add test cases where a) there's leading punctuation, b) there's a hyphen that isn't surrounded by whitespace c) a paragraph that consists only of one of two words taking up much less than 20 characters? – Martin Ender – 2016-05-19T09:24:52.417
Also the penalties aren't entirely clear. Of course the upper limit of words and characters can always be modified, it's just a question of how many places need to be changed. Do you mean
3
and20
should appear as numbers in a single place which can be changed? – Martin Ender – 2016-05-19T09:25:38.077Will
\s\s+
ever match the paragraph? – Leaky Nun – 2016-05-19T09:41:03.323@MartinBüttner Thanks for the feedback, I hope my examples and clarifications help. I don't require
3
and20
to exactly occur, things like{0,2}
and{21}
are ok as well. I just wanted to penalize explicit pattern repetitions (e.g.\w+\s\w+\s\w+
instead of a more generalize-able(?:\w+\s){0,2}\w+
even though that is longer) – Tobias Kienzler – 2016-05-19T09:44:36.180Can you clarify what you mean by matching group, i.e. can you specify a particular capturing group as the result? – Neil – 2016-05-19T10:01:46.890
@Neil I'd prefer having only one capturing group which would then be the result – Tobias Kienzler – 2016-05-19T10:06:34.077
@KennyLau Markdown's manual line-break (two spaces at the end of the line and no empty line) matches that but I wouldn't consider it a paragraph. – Tobias Kienzler – 2016-05-19T10:08:10.157
Is there any manual repetition that is not able to be converted into a thing like
{0,2}
? – Leaky Nun – 2016-05-19T10:21:58.137Can I match the leading whitespace? – Leaky Nun – 2016-05-19T10:26:55.170
You should clarify more. I thought the spaces do not count towards the character limit... – Leaky Nun – 2016-05-19T10:31:07.717
@KennyLau 1) no there isn't but one shouldn't use repetition even if it's shorter 2) matching the leading whitespace is not valid 3) the spaces in between also count towards the character limit - the regex would become pretty complex otherwise I guess? – Tobias Kienzler – 2016-05-19T10:55:33.140
start="2">
@KennyLau Sorry, you're right. Matching it is disallowed. – Tobias Kienzler – 2016-05-19T10:57:08.953
I am still not seeing the word "regex" in your question anywhere. – Leaky Nun – 2016-05-19T10:59:56.457
@KennyLau fixed. I also forgot an example with hyphen... – Tobias Kienzler – 2016-05-19T11:15:03.820