Romanize Korean

12

1

Yes, It's basically You're a Romanizer, Baby, but harder. like, way harder.

Learning Korean is HARD. at least for a person outside Asia. But they at least have the chance to learn, right?

What you must do

You will be given a Korean Statement. For example, 안녕하세요. You must convert the input to its Roman pronunciation. For the given example, the output can be annyeonghaseyo.

Now it gets technical

A Korean character has three parts, Starting consonant, Vowel, and Ending consonant. The Ending consonant may not exist in the character.

For example, is (Starting consonant) and (Vowel), and is (Starting consonant), (Vowel), and (Ending consonant).

Evert consonant and vowel has its pronunciation. The pronunciation for each consonant is as following.

Korean                 ㄱ   ㄲ  ㄴ  ㄷ   ㄸ  ㄹ  ㅁ  ㅂ  ㅃ  ㅅ  ㅆ  ㅇ   ㅈ   ㅉ  ㅊ ㅋ  ㅌ   ㅍ  ㅎ
Romanization Starting   g   kk  n   d   tt  r   m   b   pp  s   ss  –   j   jj  ch  k   t   p   h
               Ending   k   k   n   t   –   l   m   p   –   t   t   ng  t   –   t   k   t   p   h

( - means no pronunciation or not used. you do not have to handle them.)

and Pronunciation for each vowels is as following.

Hangul          ㅏ  ㅐ  ㅑ  ㅒ   ㅓ  ㅔ  ㅕ  ㅖ  ㅗ   ㅘ   ㅙ  ㅚ ㅛ  ㅜ  ㅝ  ㅞ  ㅟ   ㅠ  ㅡ   ㅢ ㅣ
Romanization    a   ae  ya  yae eo  e   yeo ye  o   wa  wae oe  yo  u   wo  we  wi  yu  eu  ui  i

Now its the real hard part

The consonant's pronunciation changes by the Ending consonant in before. The pronunciation for every Starting/Ending consonant is as the following image.Thank you, Wikipedia. If there wasn't this, I'd have to WRITE all this. (You do not have to do the hyphen between pronunciations. Its unnecessary. If a cell has two or more pronunciations, choose one. If there's no ending consonant, use the original pronunciation.)

Examples

Korean => English
안녕하세요 => annyeonghaseyo
나랏말싸미 듕귁에달아 => naranmalssami dyunggwigedara  //See how the ㅅ in 랏 changes from 't' to 'n'

Example suggestion welcomed. You can get answers for your own inputs here. (The one in "General text", Revised is what I'm asking for)

Matthew Roh

Posted 2018-02-10T05:14:52.663

Reputation: 5 043

Will the input always consist of Unicode characters AC00-D7AF + space? – Arnauld – 2018-02-10T09:53:26.593

1There are several special ㅎ + X combinations which are not highlighted in yellow (e.g. ㅎ+ ㅈ = ch). Does that mean that we don't have to support them? (Also, ㅎ is 'romanized' as t instead of h in the picture, which is a bit confusing.) – Arnauld – 2018-02-10T10:33:24.940

1

Test cases: https://gist.github.com/perey/563282f8d62c2292d11aabcde0b94d2d As @Arnauld says, there are some oddities in the special combinations; this has tests for all of the ones I found in the table, whether highlighted or not. Where multiple options exist, they are space-separated. No hyphens are used as I expect people to golf them out.

– Tim Pederick – 2018-02-10T11:47:11.730

1I don't see "General text" in your suggested output-checking link; do you mean "General things"? If so, which one of the three should we use (Revised, McCune, Yale)? None seem to match your table; for example, ㅈ followed by ㄹ should be "nn" according to you but is "tr" or "cl" at that link. (Note that my test cases in the previous comment are based on transliterations in the question!) – Tim Pederick – 2018-02-10T11:57:23.270

followed by ㄱ, ㄷ, ㅈ are also special cases (they become aspirated to ㅋ, ㅌ, ㅈ (k, t, j) ) should highlight those too. – JungHwan Min – 2018-02-10T17:03:53.733

As a Korean myself, I like to see Korean-related challenges. But it's kinda sad that double final consonants aren't considered in this challenge... – JungHwan Min – 2018-02-10T17:06:36.313

@JungHwanMin I could (As I am a Korean myself too), but that would only complicate things more. :P – Matthew Roh – 2018-02-10T18:05:52.053

@JungHwanMin Also It's already highlighted in the image. – Matthew Roh – 2018-02-12T03:37:47.020

@lol doesn't seem like it: (for instance ㅎ then ㄱ makes k, but that's not highlighted) – JungHwan Min – 2018-02-12T19:05:20.617

Answers

8

Python 3.6, 400 394 bytes

Edit: Thanks to RootTwo for -6 bytes.

This is my first submission on CodeGolf, so I'm pretty sure there are better ways to golf it, but I thought I'd still post it, as nobody has mentioned the key idea yet, and this is still significantly shorter than other solutions.

import re,unicodedata as u
t='-'.join(u.name(i)[16:]for i in input()).lower()
for i in range(19):t=re.sub('h-[gdb]|(?<!n)([gdbsjc]+)(?!\\1)(?!-?[aeiouyw]) gg dd bb -- - h(?=[nmrcktp])|hh hj l(?=[aeiouyw]) l[nr] [nt][nr] tm pm [pm][nr] km kn|kr|ngr c yi weo'.split()[i],([lambda m:'ktpttt'['gdbsjc'.index(m[0][-1])]]+'kk,tt,pp, ,,t,c,r,ll,nn,nm,mm,mn,ngm,ngn,ch,ui,wo'.split(","))[i],t)
print(t)

How it works

The solution attempts to exploit the fact (which I learned from the original Japanese romanization challenge) that romanized character names are accessible through Python's unicodedata module. For Korean language, they take the form of HANGUL SYLLABLE <NAME>. Unfortunately, processing these names to meet the provided specification and to cover all the syllable combination scenarios still requires quite a bit of effort (and bytes).

The obtained character names list all consonants in their voiced form anywhere in the syllable, e.g. GGAGG for , R/L are transcribed as intended (starting R, ending L), and CH is given as C (this actually saves us a bit of headache).

First of all, we strip off the HANGUL SYLLABLE part (first 16 chars), mark the syllable boundaries with -, and then apply a series of RegEx'es to do the conversions.

The first RegEx looks particularly nasty. What it basically does, is conversion of starting consonants into their ending equivalents (also removing the extra letter in case of double consonants), when they are not followed by a vowel, or for some letters - when they are preceded by h. The (?<!n) lookbehind prevents matching g which is part of ng, and (?!\\1) lookahead ensures that we don't convert, e.g., ssa to tsa.

The next few RegEx'es convert starting double consonants into their unvoiced equivalents. Here's where - separators also come in handy as they help discerning boundary collisions (g-g) from double consonants (gg). Now they can also be removed.

Next, we handle the remaining h+consonant combinations, l->r before vowels, and other special cases.

Finally, we restore c to ch, and resolve some other pecularities of our incoming char names, such as yi instead of ui and weo instead of wo.

I'm not an expert in Korean and can't comment much more, but this seems to pass all the tests posted in the task and on Github. Obviously, a few more bytes could be shaved off, if the output is acceptable in uppercase, as this is what we get from the name function.

Kirill L.

Posted 2018-02-10T05:14:52.663

Reputation: 6 693

Welcome to PPCG! Great first answer. – FantaC – 2018-02-15T16:17:44.460

1Nice answer. As of python 3.6, m[0] is the same as m.group(0); saving 6 bytes. – RootTwo – 2018-02-16T22:51:50.613

5

JavaScript (ES6), 480 bytes (WIP)

This is an early attempt based on the current specs to get the ball rolling. It may require some fixing when the questions in the comments are addressed.

s=>[...s].map(c=>c<'!'?c:(u=c.charCodeAt()-44032,y='1478ghjlmnpr'.search((p=t).toString(36)),t=u%28,u=u/28|0,v=u%21,x=[2,5,6,11,18].indexOf(u=u/21|0),~x&~y&&(z=parseInt(V[y+68][x],36))>10?V[z+69]:V[p+40]+V[u+21])+V[v],t=0,V='8a6y8ye6e46ye4y64w8wa6o6y4u/w4w6wi/yu/eu/ui/i/g/k21d/t7r/3b/p0s/ss95j5ch/270h922/197l999930/77ng/77270h/bbcd6afaa8gghi5ffak8alaa8llmn4gghp8abaa8gghq5gghr5ggha5gghs8ng1ng3g/2ll/n1n3d/7r/m1m3b/0s/5ch/h'.replace(/\d/g,n=>'pnkmojeta/'[n]+'/').split`/`).join``

Test cases

let f =

s=>[...s].map(c=>c<'!'?c:(u=c.charCodeAt()-44032,y='1478ghjlmnpr'.search((p=t).toString(36)),t=u%28,u=u/28|0,v=u%21,x=[2,5,6,11,18].indexOf(u=u/21|0),~x&~y&&(z=parseInt(V[y+68][x],36))>10?V[z+69]:V[p+40]+V[u+21])+V[v],t=0,V='8a6y8ye6e46ye4y64w8wa6o6y4u/w4w6wi/yu/eu/ui/i/g/k21d/t7r/3b/p0s/ss95j5ch/270h922/197l999930/77ng/77270h/bbcd6afaa8gghi5ffak8alaa8llmn4gghp8abaa8gghq5gghr5ggha5gghs8ng1ng3g/2ll/n1n3d/7r/m1m3b/0s/5ch/h'.replace(/\d/g,n=>'pnkmojeta/'[n]+'/').split`/`).join``

console.log(f("안녕하세요"))
console.log(f("나랏말싸미 듕귁에달아"))

How?

Once decompressed, the array V contains the following data:

00-20 vowels
a/ae/ya/yee/eo/e/yeo/ye/o/wa/wae/oe/yo/u/wo/we/wi/yu/eu/ui/i

21-39 starting consonants
g/kk/n/d/tt/r/m/b/pp/s/ss//j/jj/ch/k/t/p/h

40-67 ending consonants
/k/k//n///t/l////////m/p//t/t/ng/t/t/k/t/p/h

68-79 indices of substitution patterns for consecutive consonants
      ('a' = no substitution, 'b' = pattern #0, 'c' = pattern #1, etc.)
bbcde/afaaa/gghij/ffaka/alaaa/llmno/gghpa/abaaa/gghqj/gghrj/gghaj/gghsa

80-97 substitution patterns
ngn/ngm/g/k/ll/nn/nm/d/t/r/mn/mm/b/p/s/j/ch/h

We split each Hangul character into starting consonant, vowel and ending consonant. We append to the result:

  • V[80 + substitution] + V[vowel] if there's a substitution
  • V[40 + previousEndingConsonant] + V[21 + startingConsonant] + V[vowel] otherwise

Arnauld

Posted 2018-02-10T05:14:52.663

Reputation: 111 334

Can '!' not be 33? – Jonathan Frech – 2018-02-19T12:12:31.130

@JonathanFrech c is not a byte. It's a 1-character string. That said, when applying an arithmetic operation, a space is coerced to 0 while other non-digit characters are coerced to NaN. Which means that c<1 should actually worked as expected. (And c<33 would also work for non-digit characters, although this is kind of fortuitous.) – Arnauld – 2018-02-19T12:38:41.847

@JonathanFrech Addendum: c<1 would also be truthy for "0" (which is probably OK if the input is guaranteed not to contain any Arabic numeral.) – Arnauld – 2018-02-19T12:49:20.920

Thanks. I did not think that JavaScript would have characters implemented as a single byte, though tried nonetheless. It, however, seemed to work. Glad to now know why. – Jonathan Frech – 2018-02-19T12:58:40.530

2

Tcl, 529 bytes

fconfigure stdin -en utf-8
foreach c [split [read stdin] {}] {scan $c %c n
if {$n < 256} {append s $c} {incr n -44032
append s [string index gKndTrmbPsS-jJCktph [expr $n/588]][lindex {a ae ya yae eo e yeo ye o wa wae oe yo u wo we wi yu eu ui i} [expr $n%588/28]][string index -Ak-n--tl-------mp-BGQDEkFph [expr $n%28]]}}
puts [string map {nr nn
A- g An ngn Ar ngn Am ngm A kk
t- d p- b B- s D- j
nr ll l- r ln ll lr ll
A k B t G t D t E t F t
K kk T tt P pp S ss J jj C ch Q ng
- ""} [regsub -all -- {[tpBDEFh]([nrm])} $s n\\1]]

Algorithm

  1. Decomposition into lead, vowel, and tail indices
  2. First lookup to intermediate alphabetic representation
  3. Apply an initial pass for all the xn→nn/xm→nm transforms
  4. Apply a final pass for the remaining transforms

This algorithm is crunched for purposes of the challenge; the trade-off being that the input is assumed to not contain any Latin alphabetic characters, nor to use characters outside the U+AC00 Hangul block as described in the challenge. Were this real code, I'd keep all the transforms in Jamo until the final pass.

I suppose I could throw some more brainpower at crunching those vowels and some of the repetitions in the lookup table, but this is as good as it gets from me today.

Testing

Make sure you can supply UTF-8 input to the Tcl interpreter. This is most easily accomplished with a simple UTF-8 text file. Alas, Tcl still does not default to UTF-8 by default; this cost me 33 bytes.

Here’s my (currently pathetic) test file:

한
안녕하세요
나랏말싸미 듕귁에달아

Notes

I know nothing about Korean language (except what little I have learned here). This is a first attempt, pending potential revision due to updates in the question specification.

And, about that, some additional information is useful. In particular, there is not a 1:1 correspondence between lead and tail consonants as seems to be suggested in the challenge. The following two sites helped immensely figuring that out:
Wikipedia: Korean language, Hangul
Wikipedia: Hangul Jamo (Unicode block)

Dúthomhas

Posted 2018-02-10T05:14:52.663

Reputation: 541