iconv generating UTF-16 with BOM

11

3

Inspired by this question, can I use the iconv command to generate UTF-16 output with a BOM and with specified endianness?

The iconv command converts text from one encoding to another.

For example:

echo hello | iconv -f ascii -t utf-16

generates a UTF-16 representation of "hello\n".

UTF-16 files often, but not always, start with a Byte Order Mark (BOM), which is a 2-byte encoding of the Unicode character U+FEFF. You can determine the endianness of a UTF-16 file with BOM by checking whether the first two bytes are FE FF or FF FE.

The iconv command has several options for generating UTF-16 output:

$ iconv --list | grep -i utf-16
UTF-16//
UTF-16BE//
UTF-16LE//

This command:

echo hello | iconv -f ascii -t utf-16be

generates big-endian UTF-16 with no BOM; it seems to assume that if you specified the endianness, you don't need to indicate it in the output. Similarly, utf-16le generates little-endian UTF-16 with no BOM.

This:

echo hello | iconv -f ascii -t utf-16

generates (on my x86 Ubuntu system) little-endian UTF-16 with a BOM -- but I've seen a report of a similar command generating big-endian UTF-16 with a BOM, even on a little-endian system.

I can always use utf-16be or utf-16le and prepend the BOM manually, but I'm looking for a solution that just uses the iconv command.

Another workaround, if you know what endianness -t utf-16 generates, is:

echo hello | iconv -f ascii -t utf-16 | dd conv=swab 2>/dev/null

What I'd like to use is something like:

iconv -f ascii -t utf-16bebom # big-endian with BOM
iconv -f ascii -t utf-16lebom # little-endian with BOM

but iconv doesn't support that.

EDIT :

Can someone with access to an x86 Mac OSX system post a comment showing the (copy-and-pasted) output of the following command?

echo hello | iconv -f ascii -t utf-16 | od -x

Keith Thompson

Posted 2012-01-22T01:46:24.117

Reputation: 4 645

1

A BOM reduces the portability of the data but you can add it this way

– RedGrittyBrick – 2012-01-22T09:35:41.720

@RedGrittyBrick: How does it reduce portability (specifically for UtF-16)? I know I can generate the BOM ezplicitly; I'm looking for a way to do so just using iconv -- and wondering why -t utf-16 seems to leave the endianness unpecified. – Keith Thompson – 2012-01-22T09:43:05.077

I guess iconv assumes current platform byte-ordering if you don't specify it explicitly. On some platforms other than windows, some text processing tools don't expect BOMs and so do the wrong thing. An example might be when concatenating text files, or using file-based templates to construct content. "For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order" – RedGrittyBrick – 2012-01-22T12:17:46.730

This question shows iconv -f UTF-8 -t UTF-16, run on a little-endian system (MacOS), generating big-endian UTF-16 with a BOM, which seems very odd. – Keith Thompson – 2012-01-22T19:46:58.127

Answers

9

No, if you specify the byte ordering, iconv does not insert a BOM.

This is from The Unicode Consortium

Q: How I should deal with BOMs?

A: Here are some guidelines to follow:

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,
    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

(my emphasis)

I expect iconv is attempting to be faithful to the last of these guidelines.


Update.

A digression

In my opinion:

  1. An option to specify a BOM would certainly be a useful additional feature for iconv.

  2. A UTF-16LE file without a BOM is usable in Windows, albeit with additional effort sometimes. For example Notepad's File Open dialogue allows you to select "Unicode" which is Microsoft's name for "UTF-16LE" and (unsurprisingly) seems to work on files without a BOM.

  3. I can open a UTF-16LE test file (without BOM) or a UTF-8 test file (without BOM) in Windows Notepad (XP) in the usual way e.g. by double-clicking the file's name in explorer. That seems usable to me. I am aware that sometimes Windows will guess the encoding incorrectly - In which case you have to tell Notepad the encoding when opening the file. This inconvenience means including a BOM is preferable for text files intended for use on Windows.

  4. If a specific application will not work with anything other than a UTF-16LE file with BOM, then I would agree that a UTF-16LE file without BOM is not usable for that specific application.

  5. I suspect that if you can make everything work with UTF-8 (without BOM), that is the best solution in the long term.

However the answer to the question "can I use the iconv command to generate UTF-16 output with a BOM and with specified endianness" is currently "No".

RedGrittyBrick

Posted 2012-01-22T01:46:24.117

Reputation: 70 632

2This answer helped me - helped me learn why I was screwed. The standard Windows program to export/import from the registry, C:\Windows\System32\reg.exe exports UTF-16 LE WITH BOM and will only read UTF-16 LE WITH BOM - will not read UTF-16 LE without BOM and will not read UTF-16 BE with BOM - in other words, it demands the BOM when reading but it damn well better be the right one! (Fortunately, it reads UTF-8.) – davidbak – 2016-04-29T21:17:35.160

1And what about the first guideline, A.1? If f I want to generate a Unicode text file that's usable on an x86 Windows system, it should be a little-endian UTF16 file with a BOM. – Keith Thompson – 2012-01-22T19:36:03.387

@KeithThompson: Systems should accept both UTF16LE and UTF16BE. At least Windows Notepad accepts both, when it comes to .txt's - as long as the file has a BOM. – user1686 – 2012-01-22T20:08:37.843

@KeithThompson: I agree that guideline 1 should take priority, however iconv doesn't provide a way for you to specify a BOM. The answer to your original question is simply "No". – RedGrittyBrick – 2012-01-23T10:30:54.143

Not the answer I was hoping for, but an answer, and a thorough one! – Keith Thompson – 2012-01-23T16:54:12.080