The issue is that GNU tr
, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.
The tr
man page and online documentation speak of characters, but that's a bit of a simplification. The TODO
file in the source code package mentions this item (picked from coreutils 8.30):
Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
multibyte aware. The problem is that I want to avoid duplicating
significant blocks of logic, yet I also want to incur only minimal
(preferably 'no') cost when operating in single-byte mode.
On a Linux system—even with a UTF-8 locale (en_US.UTF-8
)—GNU tr
replaces an ä
as two "characters" (the UTF-8 representation of ä
has two bytes):
linux$ echo 'ä' | tr 'ä' 'x'
xx
In the same vein, mixing an ä
and an ö
produces funny results, since their UTF-8 representations share a common byte:
linux$ echo 'ö' | tr ä x
x�
Or the other way around (the x
doesn't apply here):
linux$ echo ab | tr ab äx
ä
And in your case, GNU tr
takes the \377
as a raw byte value.
The tr
on Mac is different, it knows the concept of multibyte characters and acts accordingly:
mac$ echo 'ä' | tr ä x
x
mac$ echo ab | tr ab äx
äx
The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf
, so that's what you get.
The easy way to have tr
work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:
$ echo 'ä' | LC_ALL=C tr 'ä' 'x'
xx
And in your case, you can use:
... | LC_ALL=C tr "\000" "\377"
Or you could use something like Perl to generate those \xff
bytes:
perl -e 'printf "\377" x 1000 for 1..100'
4"Linux has that set to
C
while macOS has it set to something likeen_US.UTF-8
" -- I'm not sure this is the whole story. In my Kubuntu or Debianenv | grep -E 'LANG|LC'
returnsLANG=pl_PL.UTF-8
only, so it's Unicode. Still the OP's original command yields0xff
out of the box. Could it be becausetr
implementation itself differs between Linux and Mac? – Kamil Maciorowski – 2018-08-16T05:27:13.5601
Regarding my doubt, I have found this answer which says "many implementations of
– Kamil Maciorowski – 2018-08-16T05:46:10.290tr
, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debiantr 'Ł' 'L'
translatesŁ
toLL
(Ł
is a Polish letter, I useLANG=pl_PL.UTF-8
), so it apparently treats its first argument as two characters.3Yes, it has to be done by
tr
. It would make negative sense for such conversion to happen when writing to file. – user1686 – 2018-08-16T05:47:32.333It's not really hard to test that it's not about the locale setting. With
LANG=en_US.UTF-8
(on a Linux system that has that locale generated),printf ' ' | tr ' ' '\377' | hexdump -C
plainly showsff
. – ilkkachu – 2018-08-16T09:03:19.560And, actually, changing
LANG
might not be enough. The relevant locale setting isLC_CTYPE
, and the value it gets comes first fromLC_ALL
, thenLC_CTYPE
, thenLANG
, with the first one set taking effect (that's the same for all other locale settings). So, ifLC_CTYPE
is set, changingLANG
doesn't do anything in this case. To reliably override it, you'd need to setLC_ALL
. Also, it's enough to set it just fortr
, i.e.... | LC_ALL=C tr ' ' '\377' | ...
– ilkkachu – 2018-08-16T09:08:08.553@ilkkachu Thanks for the tips! Edits made to improve the answer. Thanks community! – JakeGould – 2018-08-16T16:00:40.067