Text Processing #1: Hyphenation

14

1

Background

This is the first part of a 3-hole golf course on text processing. The over-arching idea is that if you take an input text and pipe it through the solutions to all three challenges (with a small amount of glue code), it will spit out a beautifully formatted paragraph. In this first challenge, your task is a hyphenate a piece of text using given hyphenation patterns.

Input

Your program shall take two string inputs: a piece of text and a list of hyphenation patterns. The first input is simply a non-empty string of printable ASCII characters and spaces; it will not contain line breaks or tildes ~. The second input is a comma-delimited list of words, which consist of tilde-delimited syllables of lowercase ASCII characters. An example is ex~cel~lent,pro~gram~ming,abil~i~ties.

Output

Your program shall modify the first input in the following way. Any word (maximal substring of alphabetical ASCII characters) whose hyphenated lowercase version is found in the second input shall be replaced by that hyphenated version, but its case shall be preserved. With the above example list, if the text contains the word Excellent, it shall be replaced by Ex~cel~lent; however, Excellently shall not be modified. Your output shall be this modified string.

Detailed Rules and Scoring

You can assume the following about the inputs:

  • The first input contains no tildes, and no leading, trailing or repeated spaces. It is not empty.
  • The second input contains at least one word, and each word in it contains at least two syllables. Each syllable is non-empty.
  • The second input does not contain a word that occurs as a syllable in another word.

You can change the order of the two inputs, if desired, and optionally add one trailing newline to the output.

You can write a function or a full program. The lowest byte count wins, and standard loopholes are disallowed.

Test Cases

These are listed in the format 1st input [newline] 2nd input [newline] output.

Excellent programming abilities, you work excellently!
ex~cel~lent,pro~gram~ming,abil~i~ties
Ex~cel~lent pro~gram~ming abil~i~ties, you work excellently!

Superman (sometimes incorrectly spelled "Super-man") is super #&%@ing strong.
su~per,some~times,in~cor~rectly,spell~ing
Superman (some~times in~cor~rectly spelled "Su~per-man") is su~per #&%@ing strong.

IncONsISTent caPItalizATIon!
in~con~sis~tent,cap~i~tal~iza~tion
In~cON~sIS~Tent caP~I~tal~izA~TIon!

Such short words.
awk~ward
Such short words.

Digits123 are456cool789.
dig~its,dig~i~tal,are~cool
Dig~its123 are456cool789.

magic magic
ma~gic
ma~gic ma~gic

Any possible hyphenation error in this challenge is due to this hyphenation tool.

Zgarb

Posted 2015-09-03T00:14:29.160

Reputation: 39 083

I assume the input is standard 7-bit ASCII, and not some extended 8-bit version? – orlp – 2015-09-03T00:50:11.400

Is it okay to assume that any non-alphanumerical character will not count as a change to a word (e.g. a first input like #programming! will be still be affected by a second input of pro~gram~ming)? Do numbers also not count (i.e. are only alphabetical characters allowed)? – cole – 2015-09-03T00:57:45.710

@orlp Yes, input consists of standard printable ASCII characters as listed here.

– Zgarb – 2015-09-03T01:22:46.147

@Cole Non-alphabetical characters are not part of words (see the second test case). Digits count as non-alphabetical, I'll add a test case about that. – Zgarb – 2015-09-03T01:26:28.980

Can I assume some maximum number of syllables in one word? – Qwertiy – 2015-09-03T15:30:12.223

Actually I was wrong - I don't need this. – Qwertiy – 2015-09-03T15:42:54.307

Please, add a test case: magic magic ma~gic ma~gic ma~gic. – Qwertiy – 2015-09-03T20:40:24.193

@Qwertiy Thanks, I'll do that. Also, for the record, there is no limit for the number of syllables. – Zgarb – 2015-09-03T21:38:27.950

Where are Part 2 and Part 3 of this? Can you link from the question? – ShreevatsaR – 2017-06-08T22:24:21.013

Answers

5

Pip, 60 54 bytes

Fwa^`([A-Za-z]+)`O{aQ'~?'~w@++y}M(LCwQ_RM'~FIb^',Yv)|w

GitHub repository for Pip

Takes inputs as command-line arguments (which necessitates quotes around input 1, assuming it contains spaces). No trailing newline is printed (add an x to the end of the program to add one).

Somewhat ungolfed, with comments:

 ; Split 1st input on runs of letters, including the separators in the results
a^:`([A-Za-z]+)`
 ; Split 2nd input on commas
b^:',
 ; Iterate over the words w in that list
Fwa {
  ; Filter b for entries that match the current word (lowercase, with tildes removed)
 m:(LCw EQ _RM'~)FIb
  ; We expect this to be a list of 0 or 1 elements
  ; If it has one, m gets that element (the hyphenation pattern); if it's empty, m gets nil
 i:-1
 m:m@i
  ; Map this function to each character of pattern m: if it's tilde, return tilde;
  ; otherwise, return corresponding character of w
 m:{aEQ'~ ? '~ w@++i}Mm
  ; Output the result, unless it was nil (falsey), in which case output the original word
 Om|w
}

Sample run:

C:\Users\dlosc> pip.py hyphens.pip "IncONsISTent caPItalizATIon!" in~con~sis~tent,cap~i~tal~iza~tion
In~cON~sIS~Tent caP~I~tal~izA~TIon!

DLosc

Posted 2015-09-03T00:14:29.160

Reputation: 21 213

8

Retina, 88 bytes

+is`(?<![a-z~])([a-z~]+)(?=([a-z]+)+[^a-z~].*(?<=[\n,]\1(?(2)!)(?<-2>~\2)+[\n,]))
$1~
\n.*
<empty>

For counting purposes, each line goes into a separate file, \n are replaced with actual newline characters and <empty> is an empty file. For convenience, you can run the above code from a single file (where <empty> is an empty line) if you use the -s interpreter flag.

Martin Ender

Posted 2015-09-03T00:14:29.160

Reputation: 184 808

2

Javascript ES6, 117 141 chars

f=(t,p)=>p.split`,`.map(p=>t=t.replace(RegExp("((?:^|[^a-z])"+p.replace(/~/g,")(")+")(?=$|[^a-z])","ig"),(...x)=>x.slice(1,-2).join("~")))&&t

Test:

document.querySelector(".question pre").textContent.split("\n\n").map(t=>(t=t.split("\n"))&&f(t[0],t[1])==t[2])
// Array [ true, true, true, true, true ]

Qwertiy

Posted 2015-09-03T00:14:29.160

Reputation: 2 697

You can use eval instead of RegExp constructor. String templates may also save a few bytes – Downgoat – 2015-09-03T22:03:30.350

1

Javascript (ES6), 173 169

Basic regex search and replace

(a,b)=>(b.split`,`.map(s=>a=a.replace(eval(`/(^|[^a-z])(${s.replace(/~/g,"")})(?=[^a-z]|$)/gi`),(_,n,o)=>(x=0,n+s.split``.map((q,i)=>(q=='~'&&++x?q:o[i-x])).join``))),a)

Fiddle

Edit: Fixed bug for test case magic magic, ma~gic

DankMemes

Posted 2015-09-03T00:14:29.160

Reputation: 2 769

Wrong: f("magic magic", "ma~gic") returns "ma~gic magic" – Qwertiy – 2015-09-03T20:39:02.400

@Qwertiy fixed. Somehow fixing it saved me 4 bytes too! – DankMemes – 2015-09-03T22:42:47.677

0

Perl, 146

$a=<>;$d=$_=~s/~//rg,$a=~s/(?<!\pL)$d(?!\pL)/h($&,$_)/gie for(split/,|\n/,<>);
print$a;
sub h{($g,$h)=@_;while($h=~/~/g){substr($g,"@-",0)='~'}$g}

Just a first attempt, lots of things can be shortened - will continue tomorrow!

Jarmex

Posted 2015-09-03T00:14:29.160

Reputation: 2 045