Remove the hyphenation

15

Time for a new challenge! It’s a common problem when copy-pasting between various document formats: hyphenation. While it reduces the raggedness of a left-aligned layout or evens spacing in a justified layout, it’s a complete pain when your PDF is not properly constructed and retains the hyphens in the layout, making your copied text difficult to edit or reflow.

Luckily, if we are to believe the countless self-help books out there, nothing is a problem if you see it as a challenge. I believe these self-help books are without exception referring to PPCG, where any problem will be solved if presented as a challenge. Your task is to remove offending hyphenation and linebreaks from a text, so that it is ready to paste in any text editor.

Problem description

You will write a program or function that removes hyphenation and line-breaks where applicable. The input will be a string on stdin (or closest alternative) or as a function input. The output (on stdout or closest alternative or function output) will be the 'corrected' text. This text should be directly copy-pastable. This means that leading or trailing output is OK, but additional output halfway your corrected text (e.g., leading spaces on every line) is not.

The most basic case is the following (note: no trailing spaces)

Lorem ipsum dolor sit amet, con-
sectetur adipiscing elit. Morbi
lacinia nisi sed mauris rhoncus.

The offending hyphen and linebreaks should be removed, to obtain

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi lacinia nisi sed mauris rhoncus.

However, a few exceptions should be observed.

  • Double newlines indicate a paragraph break, and should be retained.
  • Proper nouns and names are never broken across two lines, unless they already contain a hyphen (e.g. Navier-Stokes equations). The line-break should be removed, but the hyphen retained. These cases can be identified by having only the first letter capitalized.
  • Sometimes, a hyphen indicates a word group (e.g. nineteenth- and twentieth century). When this happens across two lines, this is indicated with a leading space on the next line.

An example: (views expressed in this example are fictional and do not necessarily represent the view of the author; opponents of the Runge-Kutta-Fehlberg method are equally welcome to participate in this challenge)

Differential equations can
be solved with the Runge-Kutta-
Fehlberg method.

Developed in the nineteenth-
 or twentieth century, this
method is completely FANTAS-
TIC.

will become

Differential equations can be solved with the Runge-Kutta-Fehlberg method. 

Developed in the nineteenth- or twentieth century, this method is completely FANTASTIC. 

The linebreaks can be either the \n or \r\n ASCII code-point depending on your preference, and the hyphen is a simple ASCII - (minus sign). UTF-8 support is not required. This challenge is , so shortest code wins.

Sanchises

Posted 2015-11-16T11:56:57.533

Reputation: 8 530

Answers

9

Retina, 58 bytes

(?<!\n)\n(?!\n)
<space>
- (?! |[A-Z][a-z])| (?= )|(?<=-) (?=[A-Z])
<empty>

<space> represents a single space on its own line and <empty> represents an empty trailing line. For counting purposes, each line goes into a separate file and the \n are replaced with actual linefeed characters. For convenience you can put all of the above in a single file though and run it with the -s flag.

I'm pretty sure there's a shorter way to do this, so I'll wait with an explanation until I'm done golfing.

Martin Ender

Posted 2015-11-16T11:56:57.533

Reputation: 184 808

I actually wanted to tackle this one in Retina, but didn't want to mess with Mono :( – orlp – 2015-11-16T19:10:09.090

2

GNU Sed, 68

Score includes +2 for -zr options passed to sed.

s/\n\n/:/g
s/-\n([A-Z][a-z])/-\1/g
s/-\n /- /g
s/-\n//g
y/\n:/ \n/

Assumes that the input stream doesn't contain any : characters. If this is not acceptable, then the :'s in the code may be all replaced with some other non-printable ASCII character, e.g. 0x7 BEL.

Digital Trauma

Posted 2015-11-16T11:56:57.533

Reputation: 64 644

2

TeaScript, 76 bytes

xB(`([A-Z][a-z]+-)
(\\S)`,b="$1$2",`(\\S)-
(\\S)`,b,`(\\S-?)
 ?(.)`,"$1 $2")

Very "brute force" method.

Try it online

Downgoat

Posted 2015-11-16T11:56:57.533

Reputation: 27 116