Find the last three words of each paragraph if they are shorter than 20 characters in total

2

Based on my SO question, but more specifically:

Find a RegEx such that, given a text where paragraphs end with an empty line (or $ at the very end), match the up to twenty last characters of a paragraph excluding the \r or $ but no more than three complete "words" including the "separators" in between and the punctuation following it. The following constraints apply:

  • "words" also include abbreviations and the like (and thus punctuation), i.e. "i.e." and "non-trivial" count as one word each
  • whitespaces are "separators"
  • isolated dashes do not count as "words" but as "separators"
  • "separators" before the first word must not be included
  • trailing punctuation counts to that limit, but not the leading one nor the linebreaks and $ that must not be in the matching group
  • whether trailing punctuation is included in the match or not is up to you

Some examples:

The early bird catches the worm. Two words.

But only if they got up - in time.

Here we have a supercalifragilisticexpialidocious etc. sentence.

Short words.

Try accessing the .html now, please.

Vielen Dank und viele Grüße.

Did any-one try this?

Score is given in bytesteps: bytes multiplied by amount of steps according to https://regex101.com, but there are three penalties:

  • if the upper limit of words cannot trivially+ be modified: +10%
  • if the upper limit of character count cannot trivially+ be modified: +10%
  • failure to handle unicode used in actual words: add the percentage of world population speaking the language according to Wikipedia the failing word stems from. Example: Failing to match the German word "Grüße" as a word => +1.39%. Whether you consider a language's letters as single characters or count the bytes to utf8-encode them is up to you. The penalty does not have to be applied before someone actually provides a counter-example, so feel free to demote your competitors ;)

Since this is my first challenge, please suggest any clarifications. I assume means there are no "loopholes" to exclude.


+ With trivially modified I mean expressions such as {0,2} and {21} are valid but repeating something three times (which would have to be re-repeated to increase the amount words) is not.

Tobias Kienzler

Posted 2016-05-19T08:43:47.570

Reputation: 179

Bonuses and penalties are generally disliked in challenges. And, while it makes sense to restrict this to regular expressions, why not remove the tag and let other languages compete as well? – Addison Crump – 2016-05-19T09:21:34.680

2@VTCAKAVSMoACE I assume because it's not just scored by bytes but also by the number of steps executed by the regex engine, which doesn't make sense for arbitrary languages. – Martin Ender – 2016-05-19T09:22:37.343

Could you add test cases where a) there's leading punctuation, b) there's a hyphen that isn't surrounded by whitespace c) a paragraph that consists only of one of two words taking up much less than 20 characters? – Martin Ender – 2016-05-19T09:24:52.417

Also the penalties aren't entirely clear. Of course the upper limit of words and characters can always be modified, it's just a question of how many places need to be changed. Do you mean 3 and 20 should appear as numbers in a single place which can be changed? – Martin Ender – 2016-05-19T09:25:38.077

Will \s\s+ ever match the paragraph? – Leaky Nun – 2016-05-19T09:41:03.323

@MartinBüttner Thanks for the feedback, I hope my examples and clarifications help. I don't require 3 and 20 to exactly occur, things like {0,2} and {21} are ok as well. I just wanted to penalize explicit pattern repetitions (e.g. \w+\s\w+\s\w+ instead of a more generalize-able (?:\w+\s){0,2}\w+ even though that is longer) – Tobias Kienzler – 2016-05-19T09:44:36.180

Can you clarify what you mean by matching group, i.e. can you specify a particular capturing group as the result? – Neil – 2016-05-19T10:01:46.890

@Neil I'd prefer having only one capturing group which would then be the result – Tobias Kienzler – 2016-05-19T10:06:34.077

@KennyLau Markdown's manual line-break (two spaces at the end of the line and no empty line) matches that but I wouldn't consider it a paragraph. – Tobias Kienzler – 2016-05-19T10:08:10.157

Is there any manual repetition that is not able to be converted into a thing like {0,2}? – Leaky Nun – 2016-05-19T10:21:58.137

Can I match the leading whitespace? – Leaky Nun – 2016-05-19T10:26:55.170

You should clarify more. I thought the spaces do not count towards the character limit... – Leaky Nun – 2016-05-19T10:31:07.717

@KennyLau 1) no there isn't but one shouldn't use repetition even if it's shorter 2) matching the leading whitespace is not valid 3) the spaces in between also count towards the character limit - the regex would become pretty complex otherwise I guess? – Tobias Kienzler – 2016-05-19T10:55:33.140

start="2">

  • Please avoid unclear terms such as "desired".
  • < – Leaky Nun – 2016-05-19T10:56:33.957

    @KennyLau Sorry, you're right. Matching it is disallowed. – Tobias Kienzler – 2016-05-19T10:57:08.953

    I am still not seeing the word "regex" in your question anywhere. – Leaky Nun – 2016-05-19T10:59:56.457

    @KennyLau fixed. I also forgot an example with hyphen... – Tobias Kienzler – 2016-05-19T11:15:03.820

    Answers

    3

    37 bytes * 1403 steps = 51911 bytesteps

    34 bytes * 1036 steps = 35224 bytesteps

    37 bytes * 666 steps = 24642 bytesteps

    (?<!\S)(?!.{21})\S+(( | - )\S+){0,2}$
    

    Verify it here!

    Leaky Nun

    Posted 2016-05-19T08:43:47.570

    Reputation: 45 011

    I think the challenge is explicitly asking for just a regex, not a solution in any other programming language. So a) you can omit the `!``, but b) you'll have to make sure in works in PCRE so that it can be scored on regex101. – Martin Ender – 2016-05-19T09:46:05.653

    I don't think you need the 1 in {1,3}, neither in {1,20} – Bálint – 2016-05-19T09:46:53.660

    @Bálint He does, otherwise it matches exactly 3 words or 20 characters. (And if you mean {,20} that only works in Ruby and some other flavours, but definitely not .NET.) – Martin Ender – 2016-05-19T09:58:46.423

    @MartinBüttner is right I'm afraid, unless you can provide means to calculate an equivalent to regex101's step counting in order to obtain a score in bytesteps. Sorry for the confusion - maybe I should ask a second question that is pure code-golf without the regex-bytesteps, but it would be otherwise identical, so I don't know if that is acceptable – Tobias Kienzler – 2016-05-19T10:00:00.997

    @MartinBüttner nope, I meant {20} – Bálint – 2016-05-19T10:02:13.930

    1@Bálint Yeah that won't work for less then 3 words and 20 characters then. Also Kenny, I don't think your use of \b works when the match would start with something other than a word character. – Martin Ender – 2016-05-19T10:06:12.703

    @TobiasKienzler It's extremely unclear. The word regex only appears once throughout your whole question (and it even is a part of the address a website), and it is not even tagged as [tag:regex] – Leaky Nun – 2016-05-19T10:37:57.187

    @KennyLau Sorry about the confusion - unfortunately there is only a tag [tag:regular-expression], I also wanted to use [tag:regex]... – Tobias Kienzler – 2016-05-19T10:49:07.903

    Oh, I overlooked. Anyhow, the word regex still isn't explicitly mentioned in the challenge. – Leaky Nun – 2016-05-19T10:51:27.643

    Sorry, I forgot to add a hyphen-example. That increases the steps (my bad) slightly to 1403

    – Tobias Kienzler – 2016-05-19T11:19:59.307

    Instead of (?=.{1,20}$) you can use (?!.{21}), right? The $ is checked for again at the end anyway. (?<!\S)(?!.{21})(\S+ ?(- )?){1,3}$ gives you 34*1036 = 35224 bytesteps

    – Tobias Kienzler – 2016-05-19T11:44:08.500

    You can get down to 578 steps by replacing (?<!\S with (?<=\s

    – Tobias Kienzler – 2016-05-19T14:37:50.713

    @TobiasKienzler Doesn't work for the beginning? – Leaky Nun – 2016-05-19T14:57:27.870

    You're right, I didn't consider that, thanks for pointing it out – Tobias Kienzler – 2016-05-20T06:36:11.490