15
Time for a new typography challenge! It’s a common problem when copy-pasting between various document formats: hyphenation. While it reduces the raggedness of a left-aligned layout or evens spacing in a justified layout, it’s a complete pain when your PDF is not properly constructed and retains the hyphens in the layout, making your copied text difficult to edit or reflow.
Luckily, if we are to believe the countless self-help books out there, nothing is a problem if you see it as a challenge. I believe these self-help books are without exception referring to PPCG, where any problem will be solved if presented as a challenge. Your task is to remove offending hyphenation and linebreaks from a text, so that it is ready to paste in any text editor.
Problem description
You will write a program or function that removes hyphenation and line-breaks where applicable. The input will be a string on stdin
(or closest alternative) or as a function input. The output (on stdout
or closest alternative or function output) will be the 'corrected' text. This text should be directly copy-pastable. This means that leading or trailing output is OK, but additional output halfway your corrected text (e.g., leading spaces on every line) is not.
The most basic case is the following (note: no trailing spaces)
Lorem ipsum dolor sit amet, con-
sectetur adipiscing elit. Morbi
lacinia nisi sed mauris rhoncus.
The offending hyphen and linebreaks should be removed, to obtain
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi lacinia nisi sed mauris rhoncus.
However, a few exceptions should be observed.
- Double newlines indicate a paragraph break, and should be retained.
- Proper nouns and names are never broken across two lines, unless they already contain a hyphen (e.g. Navier-Stokes equations). The line-break should be removed, but the hyphen retained. These cases can be identified by having only the first letter capitalized.
- Sometimes, a hyphen indicates a word group (e.g. nineteenth- and twentieth century). When this happens across two lines, this is indicated with a leading space on the next line.
An example: (views expressed in this example are fictional and do not necessarily represent the view of the author; opponents of the Runge-Kutta-Fehlberg method are equally welcome to participate in this challenge)
Differential equations can
be solved with the Runge-Kutta-
Fehlberg method.
Developed in the nineteenth-
or twentieth century, this
method is completely FANTAS-
TIC.
will become
Differential equations can be solved with the Runge-Kutta-Fehlberg method.
Developed in the nineteenth- or twentieth century, this method is completely FANTASTIC.
The linebreaks can be either the \n
or \r\n
ASCII code-point depending on your preference, and the hyphen is a simple ASCII -
(minus sign). UTF-8 support is not required. This challenge is code-golf, so shortest code wins.
I actually wanted to tackle this one in Retina, but didn't want to mess with Mono :( – orlp – 2015-11-16T19:10:09.090