Extended Unix Code

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.

The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme.

G0 is almost always an ISO-646 compliant coded character set such as US-ASCII, ISO 646:KR (KS X 1003) or ISO 646:JP (the lower half of JIS X 0201) that is invoked on GL (i.e. with the most significant bit cleared). An exception from US-ASCII is that 0x5C (backslash in US-ASCII) is often used to represent a Yen sign in EUC-JP (see below) and a won sign in EUC-KR.

To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code.

The most commonly used EUC codes are variable-width encodings with a character belonging to G0 (ISO-646 compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. The EUC-CN form of GB 2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes whereas a single character in EUC-TW can take up to four bytes.

Modern applications are more likely to use UTF-8, which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors. EUC is however still very popular, especially EUC-KR for South Korea.

EUC-CN

EUC-CN
MIME / IANAGB2312
Alias(es)csGB2312
Language(s)Simplified Chinese, English, Russian
StandardGB 2312 (1980)
ClassificationExtended ASCII, variable-width encoding, CJK encoding, EUC
ExtendsUS-ASCII
Extensions748, GBK, GB 18030, x-mac-chinesesimp
Transforms / EncodesGB 2312
Succeeded byGBK, GB 18030

EUC-CN[1] is the usual encoded form of the GB 2312 standard for simplified Chinese characters. Unlike the case of Japanese JIS X 0208 and ISO-2022-JP, GB 2312 is not normally used in a 7-bit ISO 2022 code version,[lower-alpha 1] although a variant form called HZ (which delimits GB 2312 text with ASCII sequences) was sometimes used on USENET.

An ASCII character is represented in its usual encoding. A character from GB 2312 is represented by two bytes, both from the range 0xA1–0xFE.

748 code

An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB 2312, but is not ISO 2022compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5 and other nonISO 2022compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.

GBK and GB 18030

GBK is an extension to GB 2312. It defines an extended form of the EUC-CN encoding capable of representing a larger array of CJK characters sourced largely from Unicode 1.1, including traditional Chinese characters and characters used only in Japanese. It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and C1 bytes, not limited to the single shifts, may appear as lead or trail bytes), due to a larger encoding space being required.

Variants of GBK are implemented by Windows code page 936 (the Microsoft Windows code page for simplified Chinese), and by IBM's code page 1386.

The Unicode-based GB 18030 character encoding defines an extension of GBK capable of encoding the entirety of Unicode. However, Unicode encoded as GB 18030 is a variable-width encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of other Unicode transformation formats such as UTF-8.

Mac OS Chinese Simplified

Other EUC-CN variants deviating from the EUC mechanism include the Mac OS Chinese Simplified script (known as Code page 10008 or x-mac-chinesesimp).[2] It uses the bytes 0x80, 0x81, 0x82, 0xA0, 0xFD, 0xFE and 0xFF for the U with umlaut (ü), two special font metric characters, the non-breaking space, the copyright sign (©), the trademark sign (™) and the ellipsis (…) respectively.[1] This differs in what is regarded as a single-byte character versus the first byte of a two-byte character from both EUC (where, of those, 0xFD and 0xFE are defined as lead bytes) and GBK (where, of those, 0x81, 0x82, 0xFD and 0xFE are defined as lead bytes).

This use of 0xA0, 0xFD, 0xFE and 0xFF matches Apple's Shift_JIS variant.

EUC-JP

EUC-JP
MIME / IANAEUC-JP
Alias(es)Unixized JIS (UJIS), csEUCPkdFmtJapanese
Language(s)Japanese, English, Russian
ClassificationExtended ISO 646, variable-width encoding, CJK encoding, EUC
ExtendsUS-ASCII or ISO 646:JP
Transforms / EncodesJIS X 0208, JIS X 0212, JIS X 0201
Succeeded byEUC-JISx0213
EUC-JIS-2004
Alias(es)EUC-JISx0213
Language(s)Japanese, Ainu, English, Russian
StandardJIS X 0213
ClassificationExtended ASCII, variable-width encoding, CJK encoding, EUC
ExtendsUS-ASCII
Transforms / EncodesJIS X 0213, JIS X 0201 (Kana)
Preceded byEUC-JP

EUC-JP is a variable-width encoding used to represent the elements of three Japanese character set standards, namely JIS X 0208, JIS X 0212, and JIS X 0201. Other names for this encoding include Unixized JIS (or UJIS) and AT&T JIS.[3] 0.1% of all web pages use EUC-JP since August 2018,[4] while 3.2% of Japanese web sites use this encoding (less used than Shift JIS, or UTF-8). It is called Code page 954 by IBM.[5][6] Microsoft has two code page numbers for this encoding (51932 and 20932).

This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP, which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike Shift JIS).

A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004, encodes JIS X 0201 and JIS X 0213[7] (similarly to Shift_JISx0213, its Shift_JIS-based counterpart).

Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used Shift JIS or its extensions (Windows code page 932 on Microsoft Windows, and MacJapanese on classic Mac OS), although it became heavily used by Unix or Unix-like operating systems (except for HP-UX). Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.

Vendor extensions to EUC-JP were usually allocated within the individual code sets,[8] as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR).

Characters are encoded as follows:

  • As an EUC/ISO 2022 compliant encoding, the C0 control characters, space and DEL are represented as in ASCII.
  • A graphical character from ASCII (code set 0) is represented as its usual one-byte representation, in the range 0x21 0x7E. While some variants of EUC-JP encode the lower half of JIS X 0201 here, most encode ASCII,[9] including the W3C/WHATWG Encoding standard used by HTML5,[10] and so does EUC-JIS-2004.[7] While this means that 0x5C is typically mapped to Unicode as U+005C REVERSE SOLIDUS (the ASCII backslash), U+005C may be displayed as a Yen sign by certain Japanese-locale fonts, e.g. on Microsoft Windows, for compatibility with the lower half of JIS X 0201.[11][12]
  • A character from JIS X 0208 (code set 1) is represented by two bytes, both in the range 0xA1 0xFE. This differs from the ISO-2022-JP representation by having the high bit set. This code set may also contain vendor extensions in some EUC-JP variants. In EUC-JIS-2004, the first plane of JIS X 0213 is encoded here, which is effectively a superset of standard JIS X 0208.[7]
  • A character from the upper half of JIS X 0201 (half-width kana, code set 2) is represented by two bytes, the first being 0x8E, the second being the usual JIS X 0201 representation in the range 0xA1 0xDF. This set may contain IBM vendor extensions in some variants.
  • A character from JIS X 0212 (code set 3) is represented in EUC-JP by three bytes, the first being 0x8F, the following two being in the range 0xA10xFE, i.e. with the high bit set. In addition to standard JIS X 0212, code set 3 of some EUC-JP variants may also contain extensions in rows 83 and 84 to represent characters from IBM's Shift JIS extensions which lack standard JIS X 0212 mappings, which may be coded in either of two layouts, one defined by IBM themselves and one defined by the OSF.[8][13] In EUC-JIS-2004, the second plane of JIS X 0213 is encoded here,[7] which does not collide with the allocated rows in standard JIS X 0212.[14] Some implementations of EUC-JIS-2004, such as the one used by Python, allow both JIS X 0212 and JIS X 0213 plane 2 characters in this set.[14]

EUC-KR

EUC-KR
EUC-KR code structure
MIME / IANAEUC-KR
Alias(es)Wansung, IBM-970
Language(s)Korean, English, Russian
StandardKS X 2901 (KS C 5861)
ClassificationExtended ISO 646, variable-width encoding, CJK encoding, EUC
ExtendsUS-ASCII or ISO 646:KR
ExtensionsMac OS Korean, IBM-949, Unified Hangul Code (Windows-949)
Transforms / EncodesKS X 1001
Succeeded byUnified Hangul Code (web standards)

EUC-KR is a variable-width encoding to represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601)[15][16] and either ISO 646:KR (KS X 1003, formerly KS C 5636) or US-ASCII, depending on variant. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR.

A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character from KS X 1003 or US-ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E).

When used with ASCII, it is called Code page 970 by IBM.[17][18][19] It is known as Code page 51949 by Microsoft.[20] It is usually referred to as Wansung (Korean: 완성, romanized: Wanseong, lit. 'precomposed[21]') in the Republic of Korea.

A common extension of EUC-KR is the Unified Hangul Code (통합형 한글 코드, Tonghabhyeong Hangeul Kodeu,[22] or 통합 완성형, Tonghab Wansunghyung), which is the default Korean codepage on Microsoft Windows (code page 949, numbered 1363 by IBM). The W3C/WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of EUC-KR.[23] Other EUC-KR compatible extensions include the Mac OS Korean encoding, used by the classic Mac OS. IBM's code page 949 is yet another, unrelated, EUC-KR extension. Similarly to the EUC-CN extensions described above, these extensions do not conform to the EUC structure.

As of July 2020, 0.1% of all web pages globally use EUC-KR,[4] which is misleading as 17.4% of South Korean web pages use (only country the encoding is meant for),[24] making it the most popular non-UTF-8/Unicode encoding for a language/web domain, while only 8.4% of web pages using Korean language (making UTF-8 less popular in South Korea than in (seemingly) all countries of the world).[25] Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms (macOS, other Unix-like OSes, and Windows), but its use has been very slowly shifting to UTF-8 as it gains popularity, especially on Linux and macOS.

As with most other encodings, UTF-8 is now preferred for new use, solving problems with consistency between platforms and vendors.

EUC-TW

EUC-TW is a variable-width encoding that supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. It is a rarely used encoding for traditional Chinese characters as used in Taiwan. Big5 is much more common.

  • As an EUC/ISO 2022 encoding, the C0 control characters, ASCII space and DEL are encoded as in ASCII.
  • A graphical character from US-ASCII (G0, code set 0) is encoded in GL as its usual single byte representation (0x21–0x7E).
  • A character from CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1–0xFE).
  • A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes:
    • The first byte is always 0x8E (Single Shift 2).
    • The second byte (0xA1–0xB0) indicates the plane, the number of which is obtained by subtracting 0xA0 from that byte.
    • The third and fourth bytes are in GR (0xA1–0xFE).

Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.

UTF-8 is becoming more common than EUC-TW, as with most code pages.

Packed versus fixed-length form

The encodings described above (using bytes in 0x21–0x7E for code set 0, bytes in 0xA1–0xFE for code set 1, 0x8E followed by bytes in 0xA1–0xFE for code set 2 and 0x8F followed by bytes in 0xA1–0xFE for code set 3) are in a variable-width form referred to as the EUC packed format. This is the form usually labelled as EUC.[3]

Internal processing may make use of a fixed-length alternative form called the EUC complete two-byte format. This represents:[3]

  • Code set 0 as two bytes in the range 0x21–0x7E (except that the first may be 0x00).
  • Code set 1 as two bytes in the range 0xA0–0xFF (except that the first may be 0x80).
  • Code set 2 as a byte in the range 0x20–0x7E (or 0x00) followed by a byte in the range 0xA0–0xFF.
  • Code set 3 as a byte in the range 0xA0–0xFF (or 0x80) followed by a byte in the range 0x21–0x7E.

Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed-length format.[3] These fixed-length forms are suited to internal processing and are not usually encountered in interchange.

EUC-JP is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese".[26] Only the packed format is included in the WHATWG Encoding Standard used by HTML5.[27]

gollark: It's an AWFUL tool for dealing with programming mistkaes.
gollark: Well, for a perfect mistake-removing thing yes, but we have things which just *sort of* do that by enforcing some rules, like static typing.
gollark: We've seen *already* exploits in many, many complex things designed by competent programmers. The solution is not "program better and don't make mistakes", you need tools which detect mistakes and/or prevent them.
gollark: Also, if you mess up a surgery and, say, accidentally kill someone, it's more obvious than if your code turns out to have, some years later, had a security hole.
gollark: The medical licensing thing does seem to go around artificially limiting supply?

See also

Notes

  1. 7-bit ISO 2022 code versions supporting GB 2312 include ISO-2022-CN (with shift codes) and ISO-2022-JP-2 (without shift codes), both of which also support other non-ASCII sets.

References

  1. "Map (external version) from Mac OS Chinese Simplified encoding to Unicode 3.0 and later". Apple, Inc.
  2. "Encoding.WindowsCodePage Property - .NET Framework (current version)". MSDN. Microsoft.
  3. Lunde, Ken (2008). CJKV Information Processing: Chinese, Japanese, Korean, and Vietnamese Computing. O'Reilly. pp. 242–244. ISBN 9780596800925.
  4. "Historical trends in the usage of character encodings for websites". W3Techs.
  5. "CCSID 954 information document". Archived from the original on 2016-03-27.
  6. International Components for Unicode (ICU), ibm-954_P101-2007.ucm, 2002-12-03
  7. "JIS X 0213 Code Mapping Tables". x0213.org.
  8. "4.2 Review Process of Rules for Code Set Conversion Between eucJP-open and UCS". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.
  9. "Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)". XML Japanese Profile. W3C.
  10. "EUC-JP decoder". Encoding Standard. WHATWG. "If byte is an ASCII byte, return a code point whose value is byte."
  11. "3.1.1 Details of Problems". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.
  12. Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?".
  13. Lunde, Ken. "Appendix J: Japanese Character Sets" (PDF). CJKV Information Processing (2nd ed.). ISBN 978-0-596-51447-1.
  14. Chang, Hyeshik. "Readme for CJKCodecs". cPython. Python Software Foundation.
  15. "KS X 1001:1992" (PDF).
  16. "KS C 5601:1987" (PDF). 1988-10-01.
  17. "CCSID 970". IBM Globalization. IBM. Archived from the original on 2014-12-01.
  18. "ibm-970_P110_P110-2006_U2 (alias euc-kr)". Converter Explorer - ICU Demonstration. International Components for Unicode.
  19. International Components for Unicode (ICU), ibm-970_P110_P110-2006_U2.ucm, 2002-12-03
  20. "Code Page Identifiers". Windows Dev Center. Microsoft.
  21. Lunde, Ken (2009). "Chapter 3: Character Set Standards". CJKV Information Processing. p. 146. ISBN 0596514476.
  22. "한글 코드에 대하여" (in Korean). W3C. Archived from the original on 2013-05-24. Retrieved 2019-01-07.
  23. "5. Indexes (§ index EUC-KR)", Encoding Standard, WHATWG
  24. "Distribution of Character Encodings among websites that use .kr". w3techs.com. Retrieved 2020-07-03.
  25. "Distribution of Character Encodings among websites that use Korean". w3techs.com. Retrieved 2020-07-03.
  26. "Character Sets". IANA.
  27. "4.2. Names and labels". Encoding Standard. WHATWG.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.