Unicode is intended to be a universal character set for describing all the characters required for written text incorporating all writing systems, technical symbols and punctuation.
Unicode
Unicode assigns each character a code point to act as a unique reference:
- U+0041 A
- U+0042 B
- U+0043 C
- ...
- U+039B Λ
- U+039C Μ
Unicode Transformation Formats
UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).
Code Point UTF-8 UTF-16 (big-endian)
U+0041 41 00 41
U+0042 42 00 42
U+0043 43 00 43
...
U+039B CE 9B 03 9B
U+039C CE 9C 03 9C
Specification
The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.
Identifying Characters
Related tags