You're a Romanizer, Baby

38

5

Romanization of Japanese is converting Japanese text into Latin characters. In this challenge, you will be given a string of Japanese characters as input and expected to convert them to the correct ASCII string.

What You'll Need To Know

The Japanese language has three writing systems: hiragana (the curvy one used for short words), katakana (the angle-y one used for sounds and words borrowed from other langauges), and kanji (the dense characters originally from Chinese). In this challenge we will only worry about hiragana.

There are 46 characters in the hiragana syllabary. Each character represents a syllable. The characters are organized by first sound (consonant) and second sound (vowel). The columns in order are aiueo.

 : あいうえお
k: かきくけこ
s: さしすせそ
t: たちつてと
n: なにぬねの
h: はひふへほ
m: まみむめも
y: や ゆ よ
r: らりるれろ
w: わ   を
N: ん

(if you copy and paste this table note that I have used ideographic spaces U+3000 to space out y and w)

So, for instance, あとめ should produce an output of atome. The first character is a, the second is to, and the third is me.

Exceptions

Like any good language, Japanese has exceptions to its rules, and the hiragana table has several. These characters are pronounced slightly differently than their location in the table would imply:

し: shi, not si
ち: chi, not ti
つ: tsu, not tu
ふ: fu, not hu

Dakuten ゛

The word 'dakuten' means 'muddy mark': the dakuten turns sounds into their voiced equivalents (usually); for example, か ka turns into か゛ ga. A full list of the changes:

kg
sz
td
hb

The exceptions change too: し゛: ji (or zhi), not zi
ち゛: ji, not di
つ゛: dzu, not du
(ふ゛ acts as you would expect; it is not an exception)

The handakuten is an additional character ゜ that applies to the h row. If placed after a character, it changes the character's sound to p rather than b.

Both the dakuten and handakuten are going to be given as individual characters. You will not need to deal with the precomposed forms or the combining characters.

Small Characters

Finally, there are small versions of some of the characters. They modify characters that come before or after them.

ゃゅょ

These are the small forms of ya, yu, and yo. They are only placed after sounds in the i-column; they remove the i and add their sound. So, きや turns into kiya; きゃ turns into kya.

If placed after chi or shi (or their dakuten-ed forms), the y is removed too. しゆ is shiyu; しゅ is shu.

The last thing you'll have to deal with is the small tsu. っ doubles the consonant that comes after it, no matter what; it does nothing else. For instance, きた is kita; きった is kitta.

Summary, Input, and Output

Your program must be able to transliterate: the 46 basic hiragana, their dakuten and handakuten forms, and their combinations with small characters.

Undefined behavior includes: small ya, yu, and yo not after a character with i, small tsu at the end of a string, dakuten on an unaffected character, handakuten on a non-p character, and anything else not mentioned in the above spec/introduction.

You may assume all inputs are valid and contain only the Japanese characters mentioned above.

Case does not matter in output; you may also replace r with l or a lone n with m. Output can have either one space between every syllable or no spaces at all.

This is : shortest code in bytes wins.

Test Cases

Many test cases for each individual part are given in the spec. Some additional cases:

ひらか゛な → hiragana

かたかな → katakana

た゛いき゛ゃくてんさいは゛ん → daigyakutensaiban

ふ゜ろく゛らみんく゛は゜す゛るこうと゛こ゛るふ → puroguramingupazurucoudogorufu

か゛んほ゛って → ganbatte

Notes

  • I do not know much Japanese besides what I've written here. Please let me know if I've made any mistakes.

  • I was originally planning to include katakana too (so my English transliteration test case could be slightly more accurate), but that would be too much for a code golf challenge.

  • The Unicode names include the transliteration of each character individually, but without the exceptions. This may or may not be helpful to you.

  • Thanks to squeamishossifrage for correcting two typos!

  • I'm sorry if this is too long; I attempted to fit most of the quirks of hiragana into the challenge but some things (like small vowel-only hiragana, changing n to m in front of some consonants, and the repetition mark) had to be cut to keep the challenge manageable.

  • I'm not at all sorry for the title. It's a masterpiece.

Deusovi

Posted 2015-11-29T21:16:39.747

Reputation: 1 420

1What should be the output for きっった? – lirtosiast – 2015-11-29T21:57:25.557

@Thomas: That's an invalid input. Output can be whatever you want. – Deusovi – 2015-11-29T21:58:26.760

1should っし be sshi or shshi? – lirtosiast – 2015-11-29T23:19:43.983

@Thomas: It shouls be sshi. – Deusovi – 2015-11-29T23:20:22.700

Romanization is converting any non-Latin text into Latin characters. – samgak – 2015-11-30T00:00:01.483

Should the programs support ーs? For example: こーどごるふ? – clap – 2015-11-30T01:32:22.677

@ConfusedMr_C: No need; AFAIK that's only (or at least mostly) used with katakana. – Deusovi – 2015-11-30T01:33:09.647

True enough, but you specified that katakana should be written as hiragana, so コードゴルフ should be こーどごるふ, correct? – clap – 2015-11-30T01:34:53.153

@Mr_C: Hm? I don't quite understand. I said your program only had to support hiragana; the program does not have to know the rules of Japanese besides the ones mentioned in the spec. – Deusovi – 2015-11-30T01:36:52.913

@Desuovi: Oh, okay. Just leaving out support for ー is okay then? – clap – 2015-11-30T01:39:17.577

@Mr_C: Yeah, you don't need to support ー at all. In my test case I specifically used こうと゛こ゛るふ rather than こーどごるふ because as far as I know using ー with hiragana is extremely uncommon. – Deusovi – 2015-11-30T01:40:56.930

2I'm not at all sorry for the title. It's a masterpiece. Downvoted – Fatalize – 2015-11-30T08:07:38.793

4@Fatalize No need to bring your anti-Britney bias here. Even though I may personally be more of a J-Lo fan, I'm not gonna downvote an excellent puzzle over that. – semi-extrinsic – 2015-11-30T22:19:15.583

It's never zhi, while zi is standard for kunrei-shiki. – idrougge – 2017-08-03T10:17:04.860

Answers

7

Python 2, 638 bytes

import unicodedata
s=input()
k=[0x309B,0x309C,0x3063]
m=[0x3083,0x3085,0x3087]
e={0x3057:'shi',0x3061:'chi',0x3064:'tsu',0x3075:'fu'}
d={0x3057:'ji',0x3061:'ji',0x3064:'dzu'}
D=dict(zip('ksth','gzdb'))
f=lambda c:unicodedata.name(c).split()[-1].lower()if ord(c)not in e else e[ord(c)]
g=lambda c:d[c]if c in d else D[f(c)[0]]+f(c)[1:]
R=[]
r=[]
t=[]
i=0
while i<len(s):
 c=ord(s[i])
 if c==k[0]:R[-1]=g(s[i-1])
 elif c==k[1]:R[-1]='p'+R[-1][1:]
 elif c in m:R[-1]=R[-1][:-1];n=f(s[i]);R+=[n[1:]]if r[-1]in[0x3057,0x3061]else[n];r+=[c]
 elif c==k[2]:t+=[len(R)]
 else:R+=[f(s[i])];r+=[c]
 i+=1
for i in t:R[i]=R[i][0]+R[i]
print ''.join(R)

Takes input as unicode string.

Test it on Ideone

TFeld

Posted 2015-11-29T21:16:39.747

Reputation: 19 246

1You can save a measly bye by changing print ''.join(R) to print''.join(R) – Zacharý – 2017-08-03T11:21:36.773

6

Python 2, 447 bytes

import unicodedata as u
r=str.replace
i=''.join('x'*('SM'in u.name(x)or ord(x)==12444)+u.name(x)[-2:].strip()for x in raw_input().decode('utf-8'))
for a,o in zip('KSTH','GZDB'):
    for b in'AEIOU':i=r(r(i,a+b+'xRK','P'+b),a+b+'RK',o+b)
for a,b,c,d in zip('STDZ',('SH','CH','J','J'),'TDHH',('TS','DZ','F','F')):i=r(r(i,a+'I',b+'I'),c+'U',d+'U')
for a in'CH','SH','J':i=r(i,a+'IxY',a)
for a in'BCDFGHJKMNPRSTWYZ':i=r(i,'xTSU'+a,a+a)
print r(i,'Ix','')

This takes Unicode input directly, which made me lose a few bytes because of the decode('utf-8') but I think is more in the spirit of the challenge.

I began by replacing every character by the last two characters of its unicode name, as suggested in the notes of the puzzle. Unfortunately, this doesn't distinguish between alternate versions of the same character, so I had to make an ugly hack to add an 'x' before the small characters and the handakuten.

The rest of the for loops are just fixing exceptions, in order:

  1. the first for loop turns dakutens and handakutens into the correct consonants;
  2. the second for loop deals with the hiragana exceptions of shi, chi, tsu and fu;
  3. third for loop deals with the exceptions before a small y- character (like sha, jo);
  4. fourth for loop deals with doubling consonants after a small tsu.
  5. final line deals with small y-.

I wish I could have combined more steps, but in some cases the steps have to be performed in order to avoid conflicts.

Try it online! (a multi-line version with more examples can be found here).

ffao

Posted 2015-11-29T21:16:39.747

Reputation: 161

1TIO link – boboquack – 2017-08-03T08:03:55.273

Welcome to PPCG. Very nice first solution :) – Shaggy – 2017-08-03T10:26:21.433

Turn your four spaces in front of for b in'AEIOU' into a tab or a single space to save 3 bytes. You may also be able to use from unicodedata import* to save some bytes - not sure. – Stephen – 2017-08-03T15:03:08.443

4

Swift 3, 67 64 characters

let r={(s:String) in s.applyingTransform(.toLatin, reverse: false)}

let r={(s:String)in s.applyingTransform(.toLatin,reverse:false)}

idrougge

Posted 2015-11-29T21:16:39.747

Reputation: 641

3A builtin, really, Swift has a BUILTIN FOR THIS? – Zacharý – 2017-08-03T11:19:56.820

Don't know Swift at all, but can you chop the whitespaces after s:String) and .toLatin,? – Yytsi – 2017-08-03T11:23:19.343

@TuukkaX, well spotted! – idrougge – 2017-08-03T11:29:31.633

@Zacharý, well Foundation has. – idrougge – 2017-08-03T11:29:57.783

3

Python 3, 259 bytes

import re,unicodedata as u
s=re.sub
n=u.normalize
k,*r=r'NFKC DZU DU TSU TU \1\1 SM.{6}(.) \1 (CH|J|SH)Y \1 ISMALL.(Y.) CHI TI JI [ZD]I SHI SI FU HU'.split()
t=''.join(u.name(c)[16:]for c in n(k,s(' ','',n(k,input()))))
while r:t=s(r.pop(),r.pop(),t)
print(t)

Try it online!

Explanation

We’re in luck with this input format! Look what happens if I pass the input through NFKC normalization:

>>> nfkc = lambda x: u.normalize('NFKC', x)
>>> [u.name(c) for c in 'は゛']
['HIRAGANA LETTER HA', 'KATAKANA-HIRAGANA VOICED SOUND MARK']
>>> [u.name(c) for c in nfkc('は゛')]
['HIRAGANA LETTER HA', 'SPACE', 'COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK']

The dakuten gets replaced by a space and a combining dakuten. Now that space is all that’s separating the は from its dakuten. So we get rid of it and normalize again:

>>> [u.name(c) for c in nfkc(nfkc('は゛').replace(' ', ''))]
['HIRAGANA LETTER BA']

Bingo. The fifth line turns the input into something like

KONOSUBARASIISEKAINISISMALL YUKUHUKUWO

Then we apply 9 boring regex substitutions crammed in r, and we’re done:

KONOSUBARASHIISEKAINISHUKUFUKUWO

(Jonathan French saved 4 bytes, writing import re,unicodedata as u instead of import re;from unicodedata import*. Thanks!)

Lynn

Posted 2015-11-29T21:16:39.747

Reputation: 55 648

Abusing normalisation for fun and profit. That's beautiful. – Tim Pederick – 2018-02-10T05:27:38.810

2import re,unicodedata as u as in Kirill L. answer to a related challenge saves 4 bytes. – Jonathan Frech – 2018-02-19T12:05:40.127