How to convert this string to Japanese using GNU/Linux tools?

2

Here is a string from a text file:

@™TdaŽ®Æ‚êƒ~ƒNƒXƒgƒŒ[ƒgEƒrƒLƒjver1.11d1.d2iƒrƒLƒjƒ‚ƒfƒ‹ver.1.1³Ž®”z•z”Åj

It includes many nonprinting characters and is copied here: https://pastebin.com/TUG4agN4

Using https://2cyr.com/decode/?lang=en, we can confirm that it translates to the following:

 ☆Tda式照れミクストレート・ビキニver1.11d1.d2(ビキニモデルver.1.1正式配布版)

This is with source encoding = SJIS (shift-jis), displayed as Windows-1252.

But how can we obtain the same result without a website? The relevant tool is iconv, but something in the tool chain is broken. If I try to cat from the source text file or use it as standard input with '<' in bash, one of the 'iconv's in the chain quickly errors out. If I copy the above string from text editor gedit (reading the file as utf-16le) or as output by iconv with utf16-to-utf8 conversion, then the result is close, but still wrong:

@儺da式ニれミクストレ[トEビキニver1.11d1.d2iビキニモデルver.1.1ウ式配布版j

Some evidence of the tool chain failing:

$ cat 'utf8.txt' |head -1

@™TdaŽ®Æ‚êƒ~ƒNƒXƒgƒŒ[ƒgEƒrƒLƒjver1.11d1.d2iƒrƒLƒjƒ‚ƒfƒ‹ver.1.1³Ž®”z•z”Å

$ cat 'utf8.txt' |head -1| iconv -f utf8 -t utf16

���@�"!Tda}��� ��~�N�X�g�R�[�g�E�r�L�jver1.11d1.d2�i�r�L�j� �f�9 ver.1.1��}� z" z ��j

Note three invalid characters at start.

$ cat 'utf8.txt' |head -1| iconv -f utf8 -t utf16|iconv -f utf16 -t windows-1252

iconv: illegal input sequence at position 2

$ echo "@™TdaŽ®Æ‚êƒ~ƒNƒXƒgƒŒ[ƒgEƒrƒLƒjver1.11d1.d2iƒrƒLƒjƒ‚ƒfƒ‹ver.1.1³Ž®”z•z”Åj"| iconv -f utf8 -t utf16

��@"!Tda}�� ��~�N�X�g�R[�gE�r�L�jver1.11d1.d2i�r�L�j� �f�9 ver.1.1�}� z" z �j

Note two invalid characters at start, other differences. The sequence copied from terminal matches the string displayed in text editor, confirmed by find (ctrl-F) matching it, which is the same string that gives the correct result on 2cyr.com.

Extending the last command above with '|iconv -f utf16 -t windows-1252|iconv -f shift-jis -t utf8' gives the close, but incorrect result quoted above, instead of erroring out as the direct chain does.

If I tried making a file named the example string and using the tool convmv on it, convmv said the output filename contained "characters, which are not POSIX filesystem conform! This may result in data loss." Most filenames that are invalid with UTF-8 don't give this warning.

Is there any bit sequence that piping in bash can't handle? If not, why is the tool chain not working?

Apparently the difference is because bash won't paste unprinting characters (the boxes with numbers) to the command line; maybe 'readline' can't handle them? But the result being close suggests the conversion order in the toolchain is correct, so why isn't it working?

The original file, with its filename scrambled in a different way (expires after 30 days): https://ufile.io/oorcq

Misaki

Posted 2018-03-30T11:11:22.930

Reputation: 33

Answers

3

Pipes are an OS feature which works with byte buffers and does not interpret their contents in any way. So piped text doesn't go through to bash and especially never through 'readline'. Text pasted as command-line arguments does. (And yes, both readline and the terminal may filter out control characters as a security measure.)

Your file is actually a mix of two encodings, windows-1252 and iso8859-1, due to the different ways they use the C1 control character block (0x80..0x9F).

  • ISO 8859-1 uses this entire range for control characters, and bytes 0x80..0x9F correspond to Unicode codepoints U+0080..U+009F.
  • Windows-1252 cannot represent C1 control characters; it uses most of this range for printable characters and has a few "holes" – i.e. byte values which have nothing assigned (0x81, 0x8D, 0x8F, 0x90, 0x9D).
  • The two encodings are otherwise identical in 0x00..0x7F and 0xA0..0xFF ranges.

Let's take the first line of your "bad" input file, decoded from UTF-16 to Unicode text and with nonprintable characters escaped:

\u0081@\u0081™TdaŽ®\u008FÆ‚êƒ~ƒNƒXƒgƒŒ\u0081[ƒg\u0081EƒrƒLƒjver1.11d1.d2\u0081iƒrƒLƒjƒ‚ƒfƒ‹ver.1.1\u0090³Ž®”z•z”Å\u0081j\n
  • You can see \u0081 (U+0081), which maps to byte 0x81 in ISO 8859-1 but cannot be encoded in Windows-1252.
  • You can also see the symbol ƒ (U+0192), which maps to 0x83 in Windows-1252 but does not exist at all in ISO 8859-1.

So the trick is to use Windows-1252 when possible and ISO 8859-1 as the fallback, deciding individually for each codepoint. (libiconv could do this via 'ICONV_SET_FALLBACKS', but the CLI iconv tool cannot.) It is easy to write your own tool:

#!/usr/bin/env python3
with open("/dev/stdin", "rb") as infd:
    with open("/dev/stdout", "wb") as outfd:
        for rune in infd.read().decode("utf-16"):
            try:
                chr = rune.encode("windows-1252")
            except UnicodeEncodeError:
                chr = rune.encode("iso8859-1")
            outfd.write(chr)
            # outputs shift-jis

Note that only half of your input file is mis-encoded Shift-JIS. The other half (English) is perfectly fine UTF-16; fortunately Shift-JIS will pass it through so no manual splitting is needed:

#!/usr/bin/env python3
with open("éΦé╟é▌üEé╓é╚é┐éσé▒éªéΦé⌐.txt", "r", encoding="utf-16") as infd:
    with open("りどみ・へなちょこえりか.txt", "w", encoding="utf-8") as outfd:
        buf = b""
        for rune in infd.read():
            try:
                buf += rune.encode("windows-1252")
            except UnicodeEncodeError:
                try:
                    buf += rune.encode("iso8859-1")
                except UnicodeEncodeError:
                    buf += rune.encode("shift-jis")
        outfd.write(buf.decode("shift-jis"))

user1686

Posted 2018-03-30T11:11:22.930

Reputation: 283 655

This is a good solution that answers the question of how to retrieve the original text. My questions are these: – Misaki – 2018-03-30T20:33:52.220

>

  • is there a way to read the original file that doesn't involve a fallback to a second encoding? My assumption that UTF-16 is involved is partly because I tried to open it as other encodings in gedit and they all failed. 2) Does this method of reading and converting one character/"rune" at a time always work? Could 2-byte characters be improperly decoded as 3-byte or 1-byte characters, resulting in a 'rune' with too much or too little information?
  • < – Misaki – 2018-03-30T20:42:00.057

    start="3">

  • Is 2cyr.com forced to use the same fallback? The string is sent to it as UTF-8 as I understand, and when selecting the decoding settings there's no mention of either UTF-16 or ISO 8859-1. It seems pretty simple to test pairs of encodings, like SJIS+Windows-1252, but detecting that UTF-16 is also involved is an increase in complexity and my understanding is poor enough that I'm not entirely sure that this must be done.
  • < – Misaki – 2018-03-30T20:52:59.240

    Some of these comments might be extraneous and could be deleted. I don't think it's a coincidence that the missing symbol in Windows-1252, 0x81, is U+0081. I think the text editor that originally read the SJIS file as Windows-1252 saw 0x81, was unable to convert it, and then just passed it on. 2cyr then did a similar thing when converting from Unicode (any type) to Windows-1252. <del>I'm guessing U+0081 is not actually</del> ok, it is 0x0081 in UTF-16. So instead of the fallback being a second encoding, it would be the raw bit sequence. Maybe sub-255 assumed to be clean by programs. – Misaki – 2018-03-30T21:51:32.150

    Or, since U+0081 in UTF8 is 0xC2 0x81, the fallback bit sequence would be the Unicode codepoint. – Misaki – 2018-03-30T22:27:17.547

    @Misaki: 1) Yes, UTF-16 is involved (your file is 100% UTF-16), but even after UTF-16 decoding, the first half contains nonsensical data and this conversion is unavoidable. 2) It works as shown – every Unicode rune / code point will map to something useful; in your input file, 100% of them can be mapped to a single byte each. But you're also correct that it won't map to a whole Shift-JIS sequence, which is why my example waits until the end to finally decode the complete buffer as Shift-JIS. Immediately using rune.encode("windows-1252").decode("shift-jis") would very quickly fail. – user1686 – 2018-03-31T09:46:58.680

    @Misaki: 3) I'd assume it does. "If it fails, try ISO 8859-1" is a fairly common approach. And UTF-16 is no longer involved when you submit the text to 2cyr.com – your text editor has already decoded UTF-16 for you. The browser encodes the submitted text to UTF-8 and the server decodes it, but that's a transparent detail. – user1686 – 2018-03-31T09:47:04.810

    @Misaki: As for how the file was originally created, "saw 0x81, was unable to convert it, and then just passed it on" – this could be true, but it could also be interpreted as fallback to ISO 8859-1, where 0x81 is indeed mapped to U+0081. (Like I said, this type of fallback is very common...) – user1686 – 2018-03-31T09:48:34.590