Guess how to pronounce German words

37

11

Introduction

Unlike English, German is considered to have quite a phonemic writing system. That means that the correspondence between spelling and pronunciation is close. Given any word you aren't familiar with, you would still know how to pronounce it because of the spelling system. This means a computer should be able to do it too right?

Challenge

Write a program or function that takes as input a string representing a German word, and prints or returns its pronunciation in the International Phonetic Alphabet (IPA).

I am of course not going to make you learn German or the full IPA. This one Wikipedia section provides almost all the German to IPA rules you need, and I've coded an ungolfed C# reference implementation.

Also provided in that link is a list of 400 common German words and their IPA pronunciation (needed for validation). Taking an example from that list, if the input is solltest, the correct output is ˈzɔltəst.

The reference implementation adds two helpful rules not mentioned in the Wikipedia section: It assumes word stress is on first syllable (very likely in German), and uses a better heuristic for determining when the letter "e" represents the schwa sound /ə/. It also implements special processing for prefixes, but that didn't improve results as much as I thought it would.

Details

To be considered a valid entry, your program must meet the following requirements:

  • Your IPA output must be an exact match for at least 300 of the 400 words in the reference word list (the reference implementation gets 333 correct)
  • Your program must make a guess for any plausibly German word. So we have a technical requirement, this will mean that for any input that matches the regex [a-zA-ZäÄöÖüÜ][a-zäöüß]* and has at least one vowel (aeiouyäöü), you must produce non-whitespace-only output and not error out.
  • The program must be deterministic (always produce the same output given the same input)
  • Otherwise, standard loopholes are forbidden (especially the one about fetching off-site resources)

Miscellaneous things you are allowed to do:

  • Have leading and trailing whitespace in your output if you must
  • Use any pre-existing character encoding in the output (I can't imagine anything other than Unicode working well, but if you can, congrats)
  • Assume the input is in some normalized form like Unicode normalization forms NFD, NFC, etc. For example, is ä written as a single character or a base character + a combining character?
  • Use standard input and output methods

Scoring & IPA Characters

Scoring is in bytes. Be warned that the German characters and the IPA characters are 2 bytes in UTF-8. Also, the IPA character U+0327 COMBINING INVERTED BREVE BELOW ( ̯) is a Unicode combining character and is a 2 byte UTF-8 character on its own. That means something like ɐ̯ would count as 4 bytes in UTF-8. For the curious, this symbol means the vowel does not form the syllable nucleus (the previous one does instead).

Also, beware of these IPA characters that in some fonts look like other ASCII characters: ɡ, ɪ, ʏ, ː (marks a long vowel), ˈ (marks which syllable has the stress in a multisyllable word).

How the reference word list was created

This section is extra info not needed for the challenge.

The word list was grabbed from this Wiktionary word frequency list, removing repeats because of casing difference and two words that didn't have German entries in the English Wiktionary (oh & hey). The IPA was from looking at both the English and German Wiktionaries. Where multiple pronunciations were offered, I chose the more formal and standard one. If this wasn't clear, I chose the one that most fit the general rules.

I also had to standardize how the letter "r" is pronounced. It is heavily dependent on the region how this letter pronounced, and Wiktionary was not at all consistent in which one it chose. I felt it tended towards the following: "r" is pronounced /ɐ̯/ when followed by a long vowel and a vowel does not follow, otherwise, it is ʁ. So, I changed all of them to follow that rule, except for the ver- and er- prefixes that were quite consistently /(f)ɛɐ̯/. Similarly, I standardized "eu" as /ɔʏ̯/.

DPenner1

Posted 2017-05-15T22:30:00.040

Reputation: 971

16Mathematica has a built-in for this (#~WordData~"PhoneticForm"&), but it works only for English words. – JungHwan Min – 2017-05-15T23:12:58.353

29@JungHwanMin I read your comment as follows: Immediate heart attack, sigh of relief. – DPenner1 – 2017-05-15T23:35:05.627

1How should I know "gestern" is pronounced "GHES-tern" instead of "ge-SHTERN"? "bester" as "BEST-er" not "be-SHTER"? – Leaky Nun – 2017-05-16T08:15:41.060

@LeakyNun There's no 100% algorithm for this, but your implementation only needs to get 75%. My reference implementation also gets those words wrong. – DPenner1 – 2017-05-16T12:01:55.267

@LeakyNun Well, your program will also be able to deal with German dialects, so it is more powerful. – P. Siehr – 2017-06-22T15:30:53.157

Do we need to take 'ẞ' as well as 'ß'? – Élektra – 2017-08-22T20:00:22.323

@Xenon No, i purposely left that character out of the requirements. I did however, just edit the requirement regex to be a bit more lax - I don't expect people to have to support random uppercase in the middle of a word, just at the start. Also, no need to support 'ß' at start of word. – DPenner1 – 2017-08-22T20:34:33.343

@P.Siehr which German dialect pronounces the second syllable of "gestern" or "bester"? – Leif Willerts – 2017-08-22T21:32:54.750

@LeifWillerts A lot of German dialects contain the softening/flattening of hard sounds, like 'st' (as in the english word "best") to 'sch' (as in the english word "english"), or a hard 't' to a soft 'd' sound. Examples are Schwäbisch (Swabian), kurpfälzisch (dialect of Electoral Palatinate (?)), and probably sächsisch (Saxon), bayrisch (Bavarian). [Hard to tell how word x is spoken in dialect z, if you don't speak it.] About the pronounciation of first or second syllable: I'd say, that you still pronounce the first syllable, and the word just sounds different: ge-stern to ge-schtern. – P. Siehr – 2017-08-23T06:41:58.433

The problem of "st" can be solved by: If "st" is at the beginning of the word it is prounced soft [∫t]. Examples: Stadt (city), Stuhl (chair), Stuttgart. If it is in the middle of the word, it is pronounced hard [st]. Examples: gestern (yesterday), bester (best), hast ([you] have), Hast (haste). (Same goes for "sp" at beginning and end.) I can't think of words breaking these rules. – P. Siehr – 2017-08-23T07:04:05.987

@PhilippSiehr It seems to be more about the beginning of the syllable: Hauptstadt, Gestirn, Gestalt, Versteck, Besteck... – Christian Sievers – 2017-10-14T09:03:20.257

@PhilippSiehr negative for Saxon – hiergiltdiestfu – 2017-10-16T14:53:38.960

@ChristianSievers More correctly, it’s the beginning of a morpheme. – Jan – 2017-10-27T09:52:02.230

@PhilippSiehr CC previous comment. A word that violates your rule (but admittedly is a proper noun) is Konstanz (pronounced with [∫t]). Also, overwhelming but not complete negative for Bavarian for the words you mention (only very small areas very near to the Allgäu may). Erst, however, is pronounced with [∫t] in Bavarian pretty consistently (to provide a different example) – Jan – 2017-10-27T09:54:27.423

Answers

9

PHP, 3311 2988 2916 2845 2759 2671 2667 2509 2484 bytes, passing 301/400

<?$f=fopen(__FILE__,r);fseek($f,__COMPILER_HALT_OFFSET__);eval(strtr(stream_get_contents($f),[F=>'=>',M=>'==','&'=>'&&',H=>'function ',A_=>'array',G=>'if',4=>'for','|'=>'||','~'=>'))','%'=>str,7=>'=$',8=>'[]',9=>'!$','@'=>'count(','#'=>';$',5=>'return ',3=>':(']));__halt_compiler();define(J,[ieh,ah,aa,Ah,eh,ee,ie,ih,oh,oo,Oh,uh,Uh,au,eu,Au,ei,ai,ey,ay,a,e,i,o,u,A,O,U,y])#b7e=8;Hv($a){5in_A_($a,J);}Hn($a){5!v($a);}Hpronounce($w){global$b,$e#w=%tr(%tolower(%tr($w,[ßF1,ÄF2,äF2,ÖF0,öF0,ÜF6,üF6]~,[1FS,2FA,0FO,6FU])#W=8#L7w;while($L)4each(A__merge([tzsch,dsch,tsch,zsch,sch,chs,ch,ck,dt,ng,nk,pf,ph,qu,ss,th,tz,b,c,d,f,g,h,j,k,l,m,n,p,r,s,S,t,v,w,x,z],J)as$c){$l=%len($c);G(sub%($L,0,$l)M$c){$W87c#L=sub%($L,$l);break;}}$s=8#l=@$W);4($o7t7i=0#i<$l#i++){$c7W[$i]#p7i?$W[$i-1]:0#n7iM$l-1?0:$W[$i+1];G(9n|!(n($c)&$cM$n&n($W[$i+2]~)$s[$o]87c;G($p&((9n&v($c)&n($p~|(n($n)&v($W[$i+2]~~$s[++$o]=8;}$s[@$s)-1]|A__pop($s);4each($s as$z){$b87t#t+=@$z)#e87t;}$o=[sieFziQ,duFduQ,'die'FdiQ,derFdeQT,zuFtsuQ,wirFviQT,mirFmiQT,denFdeQn,dassFdas,erFeQT,soFzoQ,warFvaQT,fürFfyQT,jaFjaQ,wieFviQ,dirFdiQT,nurFnuQT,demFdeQm,ihnFiQn,auchFaUBx,ihrFiQT,daFdaQ,schonFʃoQn,wennFvEn,malFmaQl,gutFguQt,nachFnaQx,willFvIl,mussFmUs,habFhaQp,vorFfoQT,ihmFmiQm,tunFtuQn][$w]?:'';G($o)goto x#P7B7S7V7Z=0;@$s)>1&$o=[verFfET,daFda,geFgC][join($s[0])]#o&$P=1&A__shGt($s);(($P|@$s)M2)&$o.=W)|(@$s)>2&$d=1&$h=(int)@$s)/2)#w=A__merge(...$s);4each($w as$l){G(0<$S--)continue#n7w[$B+1]#p7w[$B-1]#f=''#Z+=in_A_($B,$b)#f7lMd&9n?t3$lMb&(9n|$nMt)?p3$lMg&(9n|$nMt)?((9n&$pMi)?K:k)3$lMc?(($nMA|$nMe|$nMi)?ts:k)3$lMch?(($pMa|$pMo|$pMu)?x:K)3$lMchs|$lMx?ks3$lMck?k3$lMdsch?dZ3$lMdt|$lMth?t3$lMph|$lMv?f39f&$lMg?g3$lMh?(v($n)?h:'')3$lMng?N3$lMnk?Nk3$lMqu?kv3$lMr?((!v($n)&9nMt)?T:R)3$lMsch?S3$lMss|$lMS?s3$lMtsch|$lMtzsch|$lMzsch?tS3$lMtz|$lMz?ts3$lMw?v3$lMs?(9p&($nMp|$nMt~?S3v($n)?z:s):$f~~~~~~~~~~)#U=0;G(v($l~{G(%len($l)>1)($f=[auFaUB,euFcYB,eiFaIB][$l])|$U=1;else{G(n($n)&((9w[$B+2]&$n!=n)|v($w[$B+2]~)$U=1;G($lMe){$U=9n?:$U;G(9w[$B+2]){G($nMr)($f=A)&$U=9S=1;G($nMm|$nMl)$f=C;}}elseG($nMch)$U=0;G(in_A_($B,$e~$U=0;}$f=($U|9Z)&9f?($l[0]MO?D3$l[0]MU?y3$l[0]MA?E:$l[0]~).Q39f?($lMe?((9n|(9w[$B+2]&($nMn|$nMs~)?C:E)3$lMA?E3$lMi?I3$lMo?c3$lMu?U3($lMU|$lMy)?Y:$l~~~:$f)#V++;}$f7f?:$l;G($d&$ZM$h)$f.=W#o.7f#B++;}G(%pos($o,W)M=false&$V>1)$o=W.$o;4(#j++<%len($o);)G($o[$j]M$o[$j+1])$o=sub%($o,0,$j).sub%($o,$j+1);x:5%tr($o,[SFʃ,ZFʒ,KFç,gFɡ,NFŋ,QF'ː',WFˈ,TFɐ̯,BF'̯',RFʁ,AFɐ,EFɛ,OFœ,IFɪ,YFʏ,UFʊ,cFɔ,CFə,DFø]);}

Defines pronounce(string $word).

Usage:

assert(
    pronounce('darüber') == "daˈʁyːbɐ"
);

One note: 3 prefixes and 33 words are hard-coded, and some of the code is mildly optimized towards the testing list.

Testing code is here, though it does depend on this file.

To test:

php test.php all

Powered by the tears of ternary statements.

EDIT 7: Squeezed off ~170 bytes by writing a preprocessor into the program. As a consequence, the actual program (everything after __halt_compiler();) is kinda hard to read. If you want the unprocessed program, switch eval with print in the third statement.

Élektra

Posted 2017-05-15T22:30:00.040

Reputation: 284

This is actually 2667 bytes, not 2671 (assuming UTF-8) – caird coinheringaahing – 2017-10-13T19:40:00.903