How to convert this string to Japanese using GNU/Linux tools?

Pipes are an OS feature which works with byte buffers and does not interpret their contents in any way. So piped text doesn't go through to bash and especially never through 'readline'. Text pasted as command-line arguments does. (And yes, both readline and the terminal may filter out control characters as a security measure.)

Your file is actually a mix of two encodings, windows-1252 and iso8859-1, due to the different ways they use the C1 control character block (0x80..0x9F).

ISO 8859-1 uses this entire range for control characters, and bytes 0x80..0x9F correspond to Unicode codepoints U+0080..U+009F.
Windows-1252 cannot represent C1 control characters; it uses most of this range for printable characters and has a few "holes" – i.e. byte values which have nothing assigned (0x81, 0x8D, 0x8F, 0x90, 0x9D).
The two encodings are otherwise identical in 0x00..0x7F and 0xA0..0xFF ranges.

Let's take the first line of your "bad" input file, decoded from UTF-16 to Unicode text and with nonprintable characters escaped:

\u0081@\u0081™TdaŽ®\u008FÆ‚êƒ~ƒNƒXƒgƒŒ\u0081[ƒg\u0081EƒrƒLƒjver1.11d1.d2\u0081iƒrƒLƒjƒ‚ƒfƒ‹ver.1.1\u0090³Ž®”z•z”Å\u0081j\n

You can see \u0081 (U+0081), which maps to byte 0x81 in ISO 8859-1 but cannot be encoded in Windows-1252.
You can also see the symbol ƒ (U+0192), which maps to 0x83 in Windows-1252 but does not exist at all in ISO 8859-1.

So the trick is to use Windows-1252 when possible and ISO 8859-1 as the fallback, deciding individually for each codepoint. (libiconv could do this via 'ICONV_SET_FALLBACKS', but the CLI iconv tool cannot.) It is easy to write your own tool:

#!/usr/bin/env python3
with open("/dev/stdin", "rb") as infd:
    with open("/dev/stdout", "wb") as outfd:
        for rune in infd.read().decode("utf-16"):
            try:
                chr = rune.encode("windows-1252")
            except UnicodeEncodeError:
                chr = rune.encode("iso8859-1")
            outfd.write(chr)
            # outputs shift-jis

Note that only half of your input file is mis-encoded Shift-JIS. The other half (English) is perfectly fine UTF-16; fortunately Shift-JIS will pass it through so no manual splitting is needed:

#!/usr/bin/env python3
with open("éΦé╟é▌üEé╓é╚é┐éσé▒éªéΦé⌐.txt", "r", encoding="utf-16") as infd:
    with open("りどみ・へなちょこえりか.txt", "w", encoding="utf-8") as outfd:
        buf = b""
        for rune in infd.read():
            try:
                buf += rune.encode("windows-1252")
            except UnicodeEncodeError:
                try:
                    buf += rune.encode("iso8859-1")
                except UnicodeEncodeError:
                    buf += rune.encode("shift-jis")
        outfd.write(buf.decode("shift-jis"))

user1686

Posted 2018-03-30T11:11:22.930

Reputation: 283 655

This is a good solution that answers the question of how to retrieve the original text. My questions are these: – Misaki – 2018-03-30T20:33:52.220

is there a way to read the original file that doesn't involve a fallback to a second encoding? My assumption that UTF-16 is involved is partly because I tried to open it as other encodings in gedit and they all failed. 2) Does this method of reading and converting one character/"rune" at a time always work? Could 2-byte characters be improperly decoded as 3-byte or 1-byte characters, resulting in a 'rune' with too much or too little information?

< – Misaki – 2018-03-30T20:42:00.057

start="3">

Is 2cyr.com forced to use the same fallback? The string is sent to it as UTF-8 as I understand, and when selecting the decoding settings there's no mention of either UTF-16 or ISO 8859-1. It seems pretty simple to test pairs of encodings, like SJIS+Windows-1252, but detecting that UTF-16 is also involved is an increase in complexity and my understanding is poor enough that I'm not entirely sure that this must be done.

< – Misaki – 2018-03-30T20:52:59.240

Some of these comments might be extraneous and could be deleted. I don't think it's a coincidence that the missing symbol in Windows-1252, 0x81, is U+0081. I think the text editor that originally read the SJIS file as Windows-1252 saw 0x81, was unable to convert it, and then just passed it on. 2cyr then did a similar thing when converting from Unicode (any type) to Windows-1252. <del>I'm guessing U+0081 is not actually</del> ok, it is 0x0081 in UTF-16. So instead of the fallback being a second encoding, it would be the raw bit sequence. Maybe sub-255 assumed to be clean by programs. – Misaki – 2018-03-30T21:51:32.150

Or, since U+0081 in UTF8 is 0xC2 0x81, the fallback bit sequence would be the Unicode codepoint. – Misaki – 2018-03-30T22:27:17.547

@Misaki: 1) Yes, UTF-16 is involved (your file is 100% UTF-16), but even after UTF-16 decoding, the first half contains nonsensical data and this conversion is unavoidable. 2) It works as shown – every Unicode rune / code point will map to something useful; in your input file, 100% of them can be mapped to a single byte each. But you're also correct that it won't map to a whole Shift-JIS sequence, which is why my example waits until the end to finally decode the complete buffer as Shift-JIS. Immediately using rune.encode("windows-1252").decode("shift-jis") would very quickly fail. – user1686 – 2018-03-31T09:46:58.680

@Misaki: 3) I'd assume it does. "If it fails, try ISO 8859-1" is a fairly common approach. And UTF-16 is no longer involved when you submit the text to 2cyr.com – your text editor has already decoded UTF-16 for you. The browser encodes the submitted text to UTF-8 and the server decodes it, but that's a transparent detail. – user1686 – 2018-03-31T09:47:04.810

@Misaki: As for how the file was originally created, "saw 0x81, was unable to convert it, and then just passed it on" – this could be true, but it could also be interpreted as fallback to ISO 8859-1, where 0x81 is indeed mapped to U+0081. (Like I said, this type of fallback is very common...) – user1686 – 2018-03-31T09:48:34.590

How to convert this string to Japanese using GNU/Linux tools?

Answers