North Korean dictionary order

9

1

The objective

Given a string of Hangul syllables, sort the characters in North Korean dictionary order.

Introduction to Hangul syllables

Hangul(한글) is the Korean writing system invented by Sejong the Great. Hangul syllables are allocated in Unicode point U+AC00 – U+D7A3. A Hangul syllable consists of an initial consonant, a vowel, and an optional final consonant.

The initial consonants are:

ㄱ ㄲ ㄴ ㄷ ㄸ ㄹ ㅁ ㅂ ㅃ ㅅ ㅆ ㅇ ㅈ ㅉ ㅊ ㅋ ㅌ ㅍ ㅎ

The vowels are:

ㅏ ㅐ ㅑ ㅒ ㅓ ㅔ ㅕ ㅖ ㅗ ㅘ ㅙ ㅚ ㅛ ㅜ ㅝ ㅞ ㅟ ㅠ ㅡ ㅢ ㅣ

The final consonants are:

(none) ㄱ ㄲ ㄳ ㄴ ㄵ ㄶ ㄷ ㄹ ㄺ ㄻ ㄼ ㄽ ㄾ ㄿ ㅀ ㅁ ㅂ ㅄ ㅅ ㅆ ㅇ ㅈ ㅊ ㅋ ㅌ ㅍ ㅎ

For example, has initial consonant , vowel , and final consonant .

South Korean dictionary order

The consonants and vowels above are sorted in South Korean dictionary order. The syllables are firstly sorted by initial consonants, secondly by vowels, and finally by (optional) final consonants.

The Unicode block for Hangul syllables contains every consonant/vowel combinations, and is entirely sorted in South Korean dictionary order.

The Unicode block can be seen here, and the first 256 characters are shown for illustrative purpose:

가각갂갃간갅갆갇갈갉갊갋갌갍갎갏감갑값갓갔강갖갗갘같갚갛개객갞갟갠갡갢갣갤갥갦갧갨갩갪갫갬갭갮갯갰갱갲갳갴갵갶갷갸갹갺갻갼갽갾갿걀걁걂걃걄걅걆걇걈걉걊걋걌걍걎걏걐걑걒걓걔걕걖걗걘걙걚걛걜걝걞걟걠걡걢걣걤걥걦걧걨걩걪걫걬걭걮걯거걱걲걳건걵걶걷걸걹걺걻걼걽걾걿검겁겂것겄겅겆겇겈겉겊겋게겍겎겏겐겑겒겓겔겕겖겗겘겙겚겛겜겝겞겟겠겡겢겣겤겥겦겧겨격겪겫견겭겮겯결겱겲겳겴겵겶겷겸겹겺겻겼경겾겿곀곁곂곃계곅곆곇곈곉곊곋곌곍곎곏곐곑곒곓곔곕곖곗곘곙곚곛곜곝곞곟고곡곢곣곤곥곦곧골곩곪곫곬곭곮곯곰곱곲곳곴공곶곷곸곹곺곻과곽곾곿

For example, the following sentence (without spaces and punctuations):

키스의고유조건은입술끼리만나야하고특별한기술은필요치않다

is sorted to:

건고고기끼나다리만별술술스않야요유은은의입조치키특필하한

In C++, if the string is in std::wstring, the sorting above is plain std::sort.

North Korean dictionary order

North Korean dictionary has different consonant/vowel order.

The initial consonants are sorted like:

ㄱ ㄴ ㄷ ㄹ ㅁ ㅂ ㅅ ㅈ ㅊ ㅋ ㅌ ㅍ ㅎ ㄲ ㄸ ㅃ ㅆ ㅉ ㅇ

The vowels are sorted like:

ㅏ ㅑ ㅓ ㅕ ㅗ ㅛ ㅜ ㅠ ㅡ ㅣ ㅐ ㅒ ㅔ ㅖ ㅚ ㅟ ㅢ ㅘ ㅝ ㅙ ㅞ

The final consonants are sorted like:

(none) ㄱ ㄳ ㄴ ㄵ ㄶ ㄷ ㄹ ㄺ ㄻ ㄼ ㄽ ㄾ ㄿ ㅀ ㅁ ㅂ ㅄ ㅅ ㅇ ㅈ ㅊ ㅋ ㅌ ㅍ ㅎ ㄲ ㅆ

Like South, the syllables are firstly sorted by initial consonants, secondly by vowels, and finally by (optional) final consonants.

If the sentence above is given, the output must be:

건고고기나다리만별술술스조치키특필하한끼않야요유은은입의

Rules

  1. If the input contains a character not within U+AC00 – U+D7A3, it falls in don't care situation.

  2. As this is a code-golf, the shortest code in bytes wins.

Dannyu NDos

Posted 2019-10-06T09:22:12.377

Reputation: 2 087

Partly related. – Arnauld – 2019-10-06T11:02:40.007

If that makes sense, I'd suggest to add a test case where the characters are sorted differently because of the final consonant exclusively (using ㄲ or ㅆ with the same initial consonant and the same vowel). – Arnauld – 2019-10-07T08:00:59.450

(More generally speaking, adding a few more test cases would be great.) – Arnauld – 2019-10-07T08:02:12.277

Suggested test cases: 가까나다따라마바빠사싸아자짜차카타파 (all initial consonants), 가개갸걔거게겨계고과괘괴교구궈궤귀규그긔기 (all vowels), 가각갂갃간갅갆갇갈갉갊갋갌갍갎갏감갑값갓갔강갖갗갘같갚갛 (all trailing consonants). – Grimmy – 2019-10-07T12:25:47.230

1Well, so much for that... 86 different Korean SQL collations; all of them sort in the "South Korean" manner. Nice (tough) question. – BradC – 2019-10-07T16:54:28.940

Answers

1

05AB1E, 47 45 38 bytes

Σ•¡®šúIтÝ„Š’#„λ†x!·“•4B33¡€.ā`ââyÇ68+è

Try it online!

Σ                        # sort characters of the input by:
 •...•                   #  compressed integer 13096252834522000035292405913882127943177557
      4B                 #  converted to base 4: 211211121231211111033010101010231002310010331121111111111111111121111111
        33¡              #  split on 33: [2112111212312111110, 010101010231002310010, 1121111111111111111121111111]
           €.ā           #  enumerate each (pairs each digit with its index)
              `ââ        #  reduce by cartesian product (yields a list with 11172 elements)
                 yÇ      #  codepoint of the current character
                   68+   #  + 68
                      è  #  index into the large list (with wraparound)

Grimmy

Posted 2019-10-06T09:22:12.377

Reputation: 12 521

7

JavaScript (ES6),  150 148  137 bytes

Saved 10 bytes thanks to @Grimy

I/O: arrays of characters.

a=>a.map(c=>"ANBCODEFPGQSHRIJKLM"[(n=c.charCodeAt()-44032)/588|0]+"AKBLCMDNERTOFGSUPHIQJ"[n/28%21|0]+~(n%28%18==2)+c).sort().map(s=>s[4])

Try it online!

Splitting Hangul syllables

Given a Hangul character of code point 0xAC00 + \$n\$, the initial consonant \$I\$, vowel \$V\$ and final consonant \$F\$ are given by:

$$I=\left\lfloor\frac{n}{588}\right\rfloor,\ V=\left\lfloor\frac{n}{28}\right\rfloor\bmod 21,\ F=n\bmod 28$$

Commented

a => a.map(c =>                  // for each character c in the input:
  "ANBCODEFPGQSHRIJKLM"[         //   start with a letter from 'A' to 'S'
    (n = c.charCodeAt() - 44032) //   for the initial consonant
    / 588 | 0                    //
  ] +                            //
  "AKBLCMDNERTOFGSUPHIQJ"[       //   append a letter from 'A' to 'U'
    n / 28 % 21 | 0              //   for the vowel
  ] +                            //
  ~(                             //   append "-2" for ㄲ or ㅆ (the only North
    n % 28 % 18 == 2             //   Korean final consonants that are sorted
  ) +                            //   differently) or "-1" otherwise
  c                              //   append the original character
)                                // end of map()
.sort()                          // sort in lexicographical order
.map(s => s[4])                  // isolate the original characters

Arnauld

Posted 2019-10-06T09:22:12.377

Reputation: 111 334

1

Charcoal, 80 bytes

F”&→∧⁶⍘⎚%γD¦ρJG”F”E⎇↓Nη⊙��⭆Ws@}4”F”E↖hY9 t⟧⊙γIO↶5ε∧¬⁶⦃”Φθ⁼℅μΣ⟦⁴⁴⁰³²×⌕βι⁵⁸⁸⍘⁺κλ²⁸

Try it online! Link is to verbose version of code. Explanation: Works by generating all 11172 Hangul syllables in North Korean dictionary order and checking to see which ones are present in the input (so all other characters get deleted; also somewhat slow: takes 18 seconds on TIO). Explanation:

F”&→∧⁶⍘⎚%γD¦ρJG”

Loop over the compressed string acdfghjmopqrsbeiknl. This represents the list of South Korean initial consonants (numbered using the Western lowercase alphabet) in North Korean dictionary order.

F”E⎇↓Nη⊙��⭆Ws@}4”

Loop over the compressed string 02468cdhik1357bgj9eaf. This represents the list of South Korean vowels (numbered using ASCII digits and lowercase alphabet) in North Korean dictionary order.

F”E↖hY9 t⟧⊙γIO↶5ε∧¬⁶⦃”

Loop over the compressed string 013456789abcdefghijlmnopqr2k. This represents the list of South Korean final consonants (using the same numbering as the vowels) in North Korean dictionary order.

Φθ⁼℅μΣ⟦⁴⁴⁰³²×⌕βι⁵⁸⁸⍘⁺κλ²⁸

Concatenate the vowel and final consonant and decode as a base 28 number, then add on 588 times the initial vowel and 0xAC00. Print all characters from the input that have that as their ordinal.

Neil

Posted 2019-10-06T09:22:12.377

Reputation: 95 035

Are the replacement characters valid syntax? – Dannyu NDos – 2019-10-14T21:56:52.050

@DannyuNDos It represents byte value \xFF in Charcoal's code page. – Neil – 2019-10-14T23:58:59.273