Why does VIM show the Unicode code point and not the UTF-8 code value?

8

1

Consider this supposed line of code that I found in a PHP blog, note the quotes:

throw new Exception(“That's not a server name!”);

Those quotes are RIGHT DOUBLE QUOTATION MARK (Unicode code point: U+201D; UTF-8 hex-encoded value: 0xE2 0x80 0x9D). Pressing ga in VIM displays the following in the status bar:

<”> 8221, Hex 201d, Octal 20035

Why is the Unicode code point being displayed and not the UTF-8 code value?

Considering that the file is stored as UTF-8 and it is the terminal translating the bytes into glyphs, I would expect VIM to show the raw value of the file (UTF-8 code value), not to translate it into a Unicode code point.

dotancohen

Posted 2014-07-23T06:38:10.100

Reputation: 9 798

1TIL new #Vim commands. Thank you for the question! – Boldewyn – 2014-07-23T10:05:03.000

Answers

17

Why is the Unicode code point being displayed and not the UTF-8 code value?

Because you use ga:

<”> 8221, Hex 201d, Octal 20035

instead of g8:

e2 80 9d

romainl

Posted 2014-07-23T06:38:10.100

Reputation: 19 227

Thank you romainl. Actually, I did read :h ga but did not read the g8 section that came after it. I quote from :h ga "Print the ascii value of the character under the cursor ... When the character is a non-standard ASCII character, ... the non-printable version is also given." I guess that the text of that document may have been written before UTF-8 support and since both the Unicode code point and the UTF-8 code value are the same hexidecimal value for code points <=127 there was no need to make a distinction at the time. – dotancohen – 2014-07-23T08:31:33.357

13

Because Vim is a text editor and works with text codepoints, not bytes. There is more than just one translation happening – when opening a file, the editor must decode it from the byte encoding to an internal representation (usually Unicode); when saving back to a file, or when displaying its contents on the terminal, the editor must encode the text back to bytes.

One reason for this is simple – the file and the terminal might be using different character sets. For example, you're editing some old documents in ISO 8859-13 or KOI8-R, and want them to show up correctly on a UTF-8 terminal.

The second reason, again, is that text editors work with text. For example, is one character and its width is one terminal cell, regardless of its byte encoding (3 bytes in UTF-8, 1 byte in Windows-1257, 2 bytes in Shift-JIS, and so on). If Vim merely counted it as three bytes but the terminal showed it as one, it would result in vertical splits being misaligned, lines wrapped too soon, tabs appearing too narrow, and so on.

Instead of this...                ...you would see this.

┌───────────────────────────┐     ┌───────────────────────────┐
│She said, "Hello."         │     │She said, "Hello."         │
│                           │     │                           │
│She said, “Hello.”         │     │She said, “Hello.”     │
│                           │     │                           │
│Ji pasakė, „Sveiki“.       │     │Ji pasakė, „Sveiki“. │
└───────────────────────────┘     └───────────────────────────┘

Not to mention, you'd have to Backspace three times to delete a single character.

user1686

Posted 2014-07-23T06:38:10.100

Reputation: 283 655

Thanks Grawity, your opening sentence makes the critical point. – dotancohen – 2014-08-25T10:11:29.117