Program to check/look up UTF-8/Unicode characters in string on command line?

I've just realized I have a file on my system; it lists normally:

$ ls -la TΕSТER.txt 
-rw-r--r-- 1 user user 8 2013-04-11 18:07 TΕSТER.txt
$ cat TΕSТER.txt 
testing

... yet, it crashes a piece of software with a UTF-8/Unicode related error. I was really puzzled, since I couldn't tell why such a file is a problem; and finally I remembered to check the output of ls with hexdump:

$ ls TΕSТER.txt 
TΕSТER.txt
$ ls TΕSТER.txt | hexdump -C
00000000  54 ce 95 53 d0 a2 45 52  2e 74 78 74 0a           |T..S..ER.txt.|
0000000d

... Well, obviously there are some bytes in between/instead of some letters, so I guess it is a Unicode encoding problem. And I can try to echo the bytes back to see what is printed:

$ echo -e "\x54\xCE\x95\x53\xD0\xA2\x45\x52\x2E\x74\x78\x74"
TΕSТER.txt

... but I still cannot tell which - if any - Unicode characters these are.

So is there a command line tool, which I can to inspect a string on the terminal, and get Unicode information about it's characters?

sdaau

Posted 2013-04-11T17:06:54.660

Reputation: 3 758

Answers

Try using uniname, part of the uniutils package on Debian and Ubuntu systems. Here's an example of uniname in action:

echo -e "\x54\xCE\x95\x53\xD0\xA2\x45\x52\x2E\x74\x78\x74" | uniname
No LINES variable in environment so unable to determine lines per page.
Using default of 24.
character  byte       UTF-32   encoded as     glyph   name
        0          0  000054   54             T      LATIN CAPITAL LETTER T
        1          1  000395   CE 95          Ε      GREEK CAPITAL LETTER EPSILON
        2          3  000053   53             S      LATIN CAPITAL LETTER S
        3          4  000422   D0 A2          Т      CYRILLIC CAPITAL LETTER TE
        4          6  000045   45             E      LATIN CAPITAL LETTER E
        5          7  000052   52             R      LATIN CAPITAL LETTER R
        6          8  00002E   2E             .      FULL STOP
        7          9  000074   74             t      LATIN SMALL LETTER T
        8         10  000078   78             x      LATIN SMALL LETTER X
        9         11  000074   74             t      LATIN SMALL LETTER T
       10         12  00000A   0A                     LINE FEED (LF)

rmiesen

Posted 2013-04-11T17:06:54.660

Reputation: 331

Well, I looked a bit on the net, and found a one-liner ugrep in Look up a unicode character by name | commandlinefu.com; but that doesn't help me much here.

Then I saw codecs – String encoding and decoding - Python Module of the Week, which does have a lot of options - but not much related to Unicode character names.

So finally I coded a small tool utfinfo.pl, which only accepts input on stdin:

http://sdaaubckp.svn.sourceforge.net/viewvc/sdaaubckp/single-scripts/utfinfo.pl

... which gives me the following information:

$ ls TΕSТER.txt | perl utfinfo.pl 
Got 10 uchars
Char: 'T' u: 84 [0x0054] b: 84 [0x54] n: LATIN CAPITAL LETTER T [Basic Latin]
Char: 'Ε' u: 917 [0x0395] b: 206,149 [0xCE,0x95] n: GREEK CAPITAL LETTER EPSILON [Greek and Coptic]
Char: 'S' u: 83 [0x0053] b: 83 [0x53] n: LATIN CAPITAL LETTER S [Basic Latin]
Char: 'Т' u: 1058 [0x0422] b: 208,162 [0xD0,0xA2] n: CYRILLIC CAPITAL LETTER TE [Cyrillic]
Char: 'E' u: 69 [0x0045] b: 69 [0x45] n: LATIN CAPITAL LETTER E [Basic Latin]
Char: 'R' u: 82 [0x0052] b: 82 [0x52] n: LATIN CAPITAL LETTER R [Basic Latin]
Char: '.' u: 46 [0x002E] b: 46 [0x2E] n: FULL STOP [Basic Latin]
Char: 't' u: 116 [0x0074] b: 116 [0x74] n: LATIN SMALL LETTER T [Basic Latin]
Char: 'x' u: 120 [0x0078] b: 120 [0x78] n: LATIN SMALL LETTER X [Basic Latin]
Char: 't' u: 116 [0x0074] b: 116 [0x74] n: LATIN SMALL LETTER T [Basic Latin]

... which then identifies which characters are not the "plain" ASCII ones.

Hope this helps someone,
Cheers!

sdaau

Posted 2013-04-11T17:06:54.660

Reputation: 3 758

http://sdaaubckp.svn.sourceforge.net/viewvc/sdaaubckp/single-scripts/utfinfo.pl is a dead link – Winny – 2019-03-25T02:50:03.863

1Nice tool, but the downloaded version is missing the ! of the shebang... – mpy – 2013-04-12T16:26:43.877

1Cheer @mpy - fixed now... – sdaau – 2013-04-16T01:19:05.847

lets work on an outside ASCII char, for instance: á the bytes from á,

echo -n 'á' | xxd

the unicode from á

echo -en 'á' | iconv -f utf-8 -t UNICODEBIG | xxd -g 2

so in your filename case we have

echo -e "\x54\xCE\x95\x53\xD0\xA2\x45\x52\x2E\x74\x78\x74"  | iconv -f utf-8 -t UNICODEBIG | xxd -g 2

showing that the unicode for the capital E is \u0395 which seems to be the same symbol draw of the ASCII \x45

Danilo G. Veraszto

Posted 2013-04-11T17:06:54.660

Reputation: 11

How does this address the question? – RalfFriedl – 2019-02-10T17:11:25.393

Hello Ralf, @sdaau needed to find out unicodes, like he stated

_... but I still cannot tell which - if any - Unicode characters these are.

So is there a command line tool, which I can to inspect a string on the terminal, and get Unicode information about it's characters?_ and that is what i explained – Danilo G. Veraszto – 2019-02-12T09:24:52.930