11
3
Inspired by this question, can I use the iconv
command to generate UTF-16 output with a BOM and with specified endianness?
The iconv
command converts text from one encoding to another.
For example:
echo hello | iconv -f ascii -t utf-16
generates a UTF-16 representation of "hello\n"
.
UTF-16 files often, but not always, start with a Byte Order Mark (BOM), which is a 2-byte encoding of the Unicode character U+FEFF
. You can determine the endianness of a UTF-16 file with BOM by checking whether the first two bytes are FE FF
or FF FE
.
The iconv
command has several options for generating UTF-16 output:
$ iconv --list | grep -i utf-16
UTF-16//
UTF-16BE//
UTF-16LE//
This command:
echo hello | iconv -f ascii -t utf-16be
generates big-endian UTF-16 with no BOM; it seems to assume that if you specified the endianness, you don't need to indicate it in the output. Similarly, utf-16le
generates little-endian UTF-16 with no BOM.
This:
echo hello | iconv -f ascii -t utf-16
generates (on my x86 Ubuntu system) little-endian UTF-16 with a BOM -- but I've seen a report of a similar command generating big-endian UTF-16 with a BOM, even on a little-endian system.
I can always use utf-16be
or utf-16le
and prepend the BOM manually, but I'm looking for a solution that just uses the iconv
command.
Another workaround, if you know what endianness -t utf-16
generates, is:
echo hello | iconv -f ascii -t utf-16 | dd conv=swab 2>/dev/null
What I'd like to use is something like:
iconv -f ascii -t utf-16bebom # big-endian with BOM
iconv -f ascii -t utf-16lebom # little-endian with BOM
but iconv
doesn't support that.
EDIT :
Can someone with access to an x86 Mac OSX system post a comment showing the (copy-and-pasted) output of the following command?
echo hello | iconv -f ascii -t utf-16 | od -x
1
A BOM reduces the portability of the data but you can add it this way
– RedGrittyBrick – 2012-01-22T09:35:41.720@RedGrittyBrick: How does it reduce portability (specifically for UtF-16)? I know I can generate the BOM ezplicitly; I'm looking for a way to do so just using
iconv
-- and wondering why-t utf-16
seems to leave the endianness unpecified. – Keith Thompson – 2012-01-22T09:43:05.077I guess iconv assumes current platform byte-ordering if you don't specify it explicitly. On some platforms other than windows, some text processing tools don't expect BOMs and so do the wrong thing. An example might be when concatenating text files, or using file-based templates to construct content. "For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order" – RedGrittyBrick – 2012-01-22T12:17:46.730
This question shows
iconv -f UTF-8 -t UTF-16
, run on a little-endian system (MacOS), generating big-endian UTF-16 with a BOM, which seems very odd. – Keith Thompson – 2012-01-22T19:46:58.127