Big5

Big5
Language(s)	Traditional Chinese
Classification	Extended ASCII,[lower-alpha 1][lower-alpha 2] Variable-width encoding, DBCS, CJK encoding
Extends	ASCII[lower-alpha 2]
Extensions	Windows-950, Big5-HKSCS, numerous others
Other related encoding(s)	CNS 11643
	Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.; Big5 does not specify a single-byte component; however, ASCII (or an extension) is used in practice.;

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set instead.

Big5 gets its name from the consortium of five companies in Taiwan that developed it.[1]

Organization

The original Big5 character set is sorted first by usage frequency, second by stroke count, lastly by Kangxi radical.

The original Big5 character set lacked many commonly used characters. To solve this problem, each vendor developed its own extension. The ETen extension became part of the current Big5 standard through popularity.

The structure of Big5 does not conform to the ISO 2022 standard, but rather bears a certain similarity to the Shift JIS encoding. It is a double-byte character set (DBCS) with the following structure:

First byte ("lead byte")	0x81 to 0xfe (or 0xa1 to 0xf9 for non-user-defined characters)
Second byte	0x40 to 0x7e, 0xa1 to 0xfe

(the prefix 0x signifying hexadecimal numbers).

Standard assignments (excluding vendor or user-defined extensions) do not use the bytes 0x7F through 0xA0, nor 0xFF, as either lead (first) or trail (second) bytes. Bytes 0xA1 through 0xFE are used for both lead and trail bytes for double-byte (Big5) codes. Bytes 0x40 through 0x7E are used as trail bytes following a lead byte, or for single-byte codes otherwise. If the second byte is not in either range, behaviour is unspecified (i.e., varies from system to system). Additionally, certain variants of the Big5 character set, for example the HKSCS, use an expanded range for the lead byte, including values in the 0x81 to 0xA0 range (similar to Shift JIS), whereas others use reduced lead byte ranges (for instance, the Apple Macintosh variant uses 0xFD through 0xFF as single-byte codes, limiting the lead byte range to 0xA1 through 0xFC).[2]

The numerical value of individual Big5 codes are frequently given as a 4-digit hexadecimal number, which describes the two bytes that comprise the Big5 code as if the two bytes were a big endian representation of a 16-bit number. For example, the Big5 code for a full-width space, which are the bytes 0xa1 0x40, is usually written as 0xa140 or just A140.

Strictly speaking, the Big5 encoding contains only DBCS characters. However, in practice, the Big5 codes are always used together with an unspecified, system-dependent single-byte character set (ASCII, or an 8-bit character set such as code page 437), so that you will find a mix of DBCS characters and single-byte characters in Big5-encoded text. Bytes in the range 0x00 to 0x7f that are not part of a double-byte character are assumed to be single-byte characters. (For a more detailed description of this problem, please see the discussion on "The Matching SBCS" below.)

The meaning of non-ASCII single bytes outside the permitted values that are not part of a double-byte character varies from system to system. In old MSDOS-based systems, they are likely to be displayed as 8-bit characters; in modern systems, they are likely to either give unpredictable results or generate an error.

A more detailed look at the organization

In the original Big5, the encoding is compartmentalized into different zones:

0x8140 to 0xa0fe	Reserved for user-defined characters 造字
0xa140 to 0xa3bf	"Graphical characters" 圖形碼
0xa3c0 to 0xa3fe	Reserved, not for user-defined characters
0xa440 to 0xc67e	Frequently used characters 常用字
0xc6a1 to 0xc8fe	Reserved for user-defined characters
0xc940 to 0xf9d5	Less frequently used characters 次常用字
0xf9d6 to 0xfefe	Reserved for user-defined characters

The "graphical characters" actually comprise punctuation marks, partial punctuation marks (e.g., half of a dash, half of an ellipsis; see below), dingbats, foreign characters, and other special characters (e.g., presentational "full width" forms, digits for Suzhou numerals, zhuyin fuhao, etc.)

In most vendor extensions, extended characters are placed in the various zones reserved for user-defined characters, each of which are normally regarded as associated with the preceding zone. For example, additional "graphical characters" (e.g., punctuation marks) would be expected to be placed in the 0xa3c0–0xa3fe range, and additional logograms would be placed in either the 0xc6a1–0xc8fe or the 0xf9d6–0xfefe range. Sometimes, this is not possible due to the large number of extended characters to be added; for example, Cyrillic letters and Japanese kana have been placed in the zone associated with "frequently-used characters".

What a Big5 code actually encodes

An individual Big5 code does not always represent a complete semantic unit. The Big5 codes of logograms are always logograms, but codes in the "graphical characters" section are not always complete "graphical characters". What Big5 encodes are particular graphical representations of characters or part of characters that happen to fit in the space taken by two monospaced ASCII characters. This is a property of double-byte character sets as normally used in CJK (Chinese, Japanese, and Korean) computing, and is not a unique problem of Big5.

(The above might need some explanation by putting it in historical perspective, as it is theoretically incorrect: Back when text mode personal computing was still the norm, characters were normally represented as single bytes and each character takes one position on the screen. There was therefore a practical reason to insist that double-byte characters must take up two positions on the screen, namely that off-the-shelf, American-made software would then be usable without modification in a DBCS-based system. If a character can take an arbitrary number of screen positions, software that assumes that one byte of text takes one screen position would produce incorrect output. Of course, if a computer never had to deal with the text screen, the manufacturer would not enforce this artificial restriction; the Apple Macintosh is an example. Nevertheless, the encoding itself must be designed so that it works correctly on text-screen-based systems.)

To illustrate this point, consider the Big5 code 0xa14b (…). To English speakers this looks like an ellipsis and the Unicode standard identifies it as such; however, in Chinese, the ellipsis consists of six dots that fit in the space of two Chinese characters (……), so in fact there is no Big5 code for the Chinese ellipsis, and the Big5 code 0xa14b just represents half of a Chinese ellipsis. It represents only half of an ellipsis because the whole ellipsis should take the space of two Chinese characters, and in many DBCS systems one DBCS character must take exactly the space of one Chinese character.

Characters encoded in Big5 do not always represent things that can be readily used in plain text files; an example is "citation mark" (0xa1ca, ﹋), which is, when used, required to be typeset under the title of literary works. Another example is the Suzhou numerals, which is a form of scientific notation that requires the number to be laid out in a 2-D form consisting of at least two rows.

The Matching SBCS

In practice, Big5 cannot be used without a matching Single Byte Character Set (SBCS); this is mostly to do with a compatibility reason. However, as in the case of other CJK DBCS character sets, the SBCS to use has never been specified. Big5 has always been defined as a DBCS, though when used it must be paired with a suitable, unspecified SBCS and therefore used as what some people call a MBCS; nevertheless, Big5 by itself, as defined, is strictly a DBCS.

The SBCS to use being unspecified implies that the SBCS used can theoretically vary from system to system. Nowadays, ASCII is the only possible SBCS one would use. However, in old DOS-based systems, Code Page 437—with its extra special symbols in the control code area including position 127—was much more common. Yet, on a Macintosh system with the Chinese Language Kit, or on a Unix system running the cxterm terminal emulator, the SBCS paired with Big5 would not be Code Page 437.

Outside the valid range of Big5, the old DOS-based systems would routinely interpret things according to the SBCS that is paired with Big5 on that system. In such systems, characters 127 to 160, for example, were very likely not avoided because they would produce invalid Big5, but used because they would be valid characters in Code Page 437.

The modern characterization of Big5 as an MBCS consisting of the DBCS of Big5 plus the SBCS of ASCII is therefore historically incorrect and potentially flawed, as the choice of the matching SBCS was, and theoretically still is, quite independent of the flavour of Big5 being used.

History

The inability of ASCII to support large character sets such as used for Chinese, Japanese and Korean led to governments and industry to find creative solutions to enable their languages to be rendered on computers. A variety of ad hoc and usually proprietary input methods led to efforts to develop a standard system. As a result, Big5 encoding was defined by the Institute for Information Industry of Taiwan in 1984. The name "Big5" is in recognition that the standard emerged from collaboration of five of Taiwan's largest IT firms: Acer (宏碁); MiTAC (神通); JiaJia (佳佳), ZERO ONE Technology (零壹 or 01tech); and, First International Computer (FIC) (大眾).

Big5 was rapidly popularized in Taiwan and worldwide among Chinese who used the traditional Chinese character set through its adoption in several commercial software packages, notably the E-TEN Chinese DOS input system (ETen Chinese System). The Republic of China government declared Big5 as their standard in mid-1980s since it was, by then, the de facto standard for using traditional Chinese on computers.

Extensions

The original Big-5 only include CJK logograms from two lists "常用國字標準字體表; cháng yòng gúo zì bīao zhǔn zì tĭ bǐao" (4808 characters) and "次常用國字標準字體表; cì cháng yòng gúo zì bīao zhǔn zì tĭ bǐao" (6343 characters), but not letters from people's names, place names, dialects, chemistry, biology, Japanese kana. As a result, many Big-5 supporting software include extensions to address the problems.

The plethora of variations make UTF-8 or UTF-16 a more consistent code page for modern use.

Vendor extensions

ETEN extensions

In ETEN (倚天) Chinese operating system, the following code points are added to make it compliant with IBM5550 code page:

A3C0–A3E0: 33 control characters.
C6A1–C875: circle 1–10, bracket 1–10, Roman numerals 1–9 (i–ix), CJK radical glyphs, Japanese hiragana, Japanese katakana, Cyrillic characters
F9D6–F9FE: '碁', '銹', '恒', '裏', '墻', '粧', '嫺', and 34 extra symbols.

In some versions of Eten, there are extra graphical symbols and Simplified Chinese characters.

Microsoft code pages

Microsoft (微軟) created its own version of Big5 extension as Code page 950 for use with Microsoft Windows, which supports ETEN's extensions, but only the F9D6-F9FE code points. In Windows ME, the euro currency symbol was mapped to Big-5 code point A3E1, but not in later versions of the operating system.

After installing Microsoft's HKSCS patch on top of traditional Chinese Windows (or any version of Windows 2000 and above with proper language pack), applications using code page 950 automatically use a hidden code page 951 table. The table supports all code points in HKSCS-2001, except for the compatibility code points specified by the standard.[3]

Code page 950 used by Windows 2000 and Windows XP maps hiragana and katakana characters to Unicode private use area block when exporting to Unicode, but to the proper hiragana and katakana Unicode blocks in Windows Vista.

ChinaSea font

ChinaSea fonts (中國海字集)[4] are Traditional Chinese fonts made by ChinaSea. The fonts are rarely sold separately, but are bundled with other products, such as the Chinese version of Microsoft Office 97. The fonts support Japanese kana, kokuji, and other characters missing in Big-5. As a result, the ChinaSea extensions have become more popular than the government-supported extensions. Some Hong Kong BBSes had used encodings in ChinaSea fonts before the introduction of HKSCS.

'Sakura' font

The 'Sakura' font (日和字集 Sakura Version) is developed in Hong Kong and is designed to be compatible with HKSCS. It adds support for kokuji and proprietary dingbats (including Doraemon) not found in HKSCS.

Unicode-at-on

Unicode-at-on (Unicode補完計畫), formerly BIG5 extension, extends BIG-5 by altering code page tables, but uses the ChinaSea extensions starting with version 2. However, with the bankruptcy of ChinaSea, late development, and the increasing popularity of HKSCS and Unicode (the project is not compatible with HKSCS), the success of this extension is limited at best.

Despite the problems, characters previously mapped to Unicode Private Use Area are remapped to the standardized equivalents when exporting characters to Unicode format.

OPG

The web sites of the Oriental Daily News and Sun Daily, belonging to the Oriental Press Group Limited (東方報業集團有限公司) in Hong Kong, used a downloadable font with a different Big-5 extension coding than the HKSCS.

Official extensions

Taiwan Ministry of Education font

The Taiwan Ministry of Education supplied its own font, the Taiwan Ministry of Education font (臺灣教育部造字檔) for use internally.

Taiwan Council of Agriculture font

Taiwan's Council of Agriculture font, Executive Yuan introduced a 133-character custom font, the Taiwan Council of Agriculture font (臺灣農委會常用中文外字集) that includes 84 characters from the 'fish' radical and 7 from the 'bird' radical.

Big5+

The Chinese Foundation for Digitization Technology (中文數位化技術推廣委員會) introduced Big5+ in 1997, which used over 20000 code points to incorporate all CJK logograms in Unicode 1.1. However, the extra code points exceeded the original Big-5 definition (Big5+ uses high byte values 81-FE and low byte values 40-7E and 80-FE), preventing it from being installed on Microsoft Windows without new codepage files.

Big-5E

To allow Windows users to use custom fonts, the Chinese Foundation for Digitization Technology introduced Big-5E, which added 3954 characters (in three blocks of code points: 8E40-A0FE, 8140-86DF, 86E0-875C) and removed the Japanese kana from the ETEN extension. Unlike Big-5+, Big5E extends Big-5 within its original definition. Mac OS X 10.3 and later supports Big-5E in the fonts LiHei Pro (儷黑 Pro.ttf) and LiSong Pro (儷宋 Pro.ttf).

Big5-2003

The Chinese Foundation for Digitization Technology made a Big5 definition and put it into CNS 11643 in note form, making it part of the official standard in Taiwan.

Big5-2003 incorporates all Big-5 characters introduced in the 1984 ETEN extensions (code points A3C0-A3E0, C6A1-C7F2, and F9D6-F9FE) and the Euro symbol. Cyrillic characters were not included because the authority claimed CNS 11643 does not include such characters.

CDP

The Academia Sinica made a Chinese Data Processing font (漢字構形資料庫) in late 1990s, which the latest release version 2.5 included 112,533 characters, some less than the Mojikyo fonts.

HKSCS

Hong Kong also adopted Big5 for character encoding. However, written Cantonese has its own characters not available in the normal Big5 character set. To solve this problem, the Hong Kong Government created the Big5 extensions Government Chinese Character Set (GCCS) in 1995 and Hong Kong Supplementary Character Set in 1999. The Hong Kong extensions were commonly distributed as a patch. It is still being distributed as a patch by Microsoft, but a full Unicode font is also available from the Hong Kong Government's web site.

There are two encoding schemes of HKSCS: one encoding scheme is for the Big-5 coding standard and the other is for the ISO 10646 standard. Subsequent to the initial release, there are also HKSCS-2001 and HKSCS-2004. The HKSCS-2004 is aligned technically with the ISO/IEC 10646:2003 and its Amendment 1 published in April 2004 by the International Organization for Standardization (ISO).

HKSCS includes all the characters from the common ETEN extension, plus some characters from Simplified Chinese, place names, people's names, and Cantonese phrases (including profanity).

As of 2020, the most recent edition of HKSCS is HKSCS-2016; however, the last edition of HKSCS to encode all of its characters in Big5 was HKSCS-2008, while the characters added in more recent editions are mapped to ISO 10646 / Unicode only (as a CJK Unified Ideographs horizontal glyph extension where appropriate).[5] Additionally, similarly to Hong Kong's situation, there are also characters that are needed by Macao but is neither included in Big5 nor HKSCS, hence, the Macao Supplementary Character Set was developed, comprising characters not found in Big5 or HKSCS; this, however, is also not encoded in Big5. The first batch of 121 MSCS characters were submitted for inclusion in of mapping to Unicode in 2009,[6] and the first final version of MSCS was established in 2020.[5]

Kana and Cyrillic

There are two major Big5 extension layouts for encoding kana, Russian Cyrillic and list markers in the range 0xC6A1 through 0xC875. These are not compatible with one another.[7] They are compared in the table below.

The ETEN layout of kana and Cyrillic is also used by the HKSCS[8] (including HTML5)[9] and Unicode-At-On[10] variants, and the ETEN layout of the kana (with Cyrillic omitted) is also used by the Big5-2003 variant.[11] The published mapping files for Windows-950 include neither, and this Big5 range is mapped to the Private Use Area by the Windows-950 implementation from International Components for Unicode.[12] Python's cp950 codec is using the BIG5.TXT layout.[13]

Big5 codes 0xC6A1 through 0xC875

Big5 code	BIG5.TXT layout[14]	ETEN layout[15]
0xC6A1	ヾ	①
0xC6A2	ゝ	②
0xC6A3	ゞ	③
0xC6A4	々	④
0xC6A5	ぁ	⑤
0xC6A6	あ	⑥
0xC6A7	ぃ	⑦
0xC6A8	い	⑧
0xC6A9	ぅ	⑨
0xC6AA	う	⑩
0xC6AB	ぇ	⑴
0xC6AC	え	⑵
0xC6AD	ぉ	⑶
0xC6AE	お	⑷
0xC6AF	か	⑸
0xC6B0	が	⑹
0xC6B1	き	⑺
0xC6B2	ぎ	⑻
0xC6B3	く	⑼
0xC6B4	ぐ	⑽
0xC6B5	け	ⅰ
0xC6B6	げ	ⅱ
0xC6B7	こ	ⅲ
0xC6B8	ご	ⅳ
0xC6B9	さ	ⅴ
0xC6BA	ざ	ⅵ
0xC6BB	し	ⅶ
0xC6BC	じ	ⅷ
0xC6BD	す	ⅸ
0xC6BE	ず	ⅹ
0xC6BF	せ	丶
0xC6C0	ぜ	丿
0xC6C1	そ	亅
0xC6C2	ぞ	亠
0xC6C3	た	冂
0xC6C4	だ	冖
0xC6C5	ち	冫
0xC6C6	ぢ	勹
0xC6C7	っ	匸
0xC6C8	つ	卩
0xC6C9	づ	厶
0xC6CA	て	夊
0xC6CB	で	宀
0xC6CC	と	巛
0xC6CD	ど	⼳
0xC6CE	な	广
0xC6CF	に	廴
0xC6D0	ぬ	彐
0xC6D1	ね	彡
0xC6D2	の	攴
0xC6D3	は	无
0xC6D4	ば	疒
0xC6D5	ぱ	癶
0xC6D6	ひ	辵
0xC6D7	び	隶
0xC6D8	ぴ	¨
0xC6D9	ふ	ˆ
0xC6DA	ぶ	ヽ
0xC6DB	ぷ	ヾ
0xC6DC	へ	ゝ
0xC6DD	べ	ゞ
0xC6DE	ぺ	〃
0xC6DF	ほ	仝
0xC6E0	ぼ	々
0xC6E1	ぽ	〆
0xC6E2	ま	〇
0xC6E3	み	ー
0xC6E4	む	［
0xC6E5	め	］
0xC6E6	も	✽
0xC6E7	ゃ	ぁ
0xC6E8	や	あ
0xC6E9	ゅ	ぃ
0xC6EA	ゆ	い
0xC6EB	ょ	ぅ
0xC6EC	よ	う
0xC6ED	ら	ぇ
0xC6EE	り	え
0xC6EF	る	ぉ
0xC6F0	れ	お
0xC6F1	ろ	か
0xC6F2	ゎ	が
0xC6F3	わ	き
0xC6F4	ゐ	ぎ
0xC6F5	ゑ	く
0xC6F6	を	ぐ
0xC6F7	ん	け
0xC6F8	ァ	げ
0xC6F9	ア	こ
0xC6FA	ィ	ご
0xC6FB	イ	さ
0xC6FC	ゥ	ざ
0xC6FD	ウ	し
0xC6FE	ェ	じ
0xC740	エ	す
0xC741	ォ	ず
0xC742	オ	せ
0xC743	カ	ぜ
0xC744	ガ	そ
0xC745	キ	ぞ
0xC746	ギ	た
0xC747	ク	だ
0xC748	グ	ち
0xC749	ケ	ぢ
0xC74A	ゲ	っ
0xC74B	コ	つ
0xC74C	ゴ	づ
0xC74D	サ	て
0xC74E	ザ	で
0xC74F	シ	と
0xC750	ジ	ど
0xC751	ス	な
0xC752	ズ	に
0xC753	セ	ぬ
0xC754	ゼ	ね
0xC755	ソ	の
0xC756	ゾ	は
0xC757	タ	ば
0xC758	ダ	ぱ
0xC759	チ	ひ
0xC75A	ヂ	び
0xC75B	ッ	ぴ
0xC75C	ツ	ふ
0xC75D	ヅ	ぶ
0xC75E	テ	ぷ
0xC75F	デ	へ
0xC760	ト	べ
0xC761	ド	ぺ
0xC762	ナ	ほ
0xC763	ニ	ぼ
0xC764	ヌ	ぽ
0xC765	ネ	ま
0xC766	ノ	み
0xC767	ハ	む
0xC768	バ	め
0xC769	パ	も
0xC76A	ヒ	ゃ
0xC76B	ビ	や
0xC76C	ピ	ゅ
0xC76D	フ	ゆ
0xC76E	ブ	ょ
0xC76F	プ	よ
0xC770	ヘ	ら
0xC771	ベ	り
0xC772	ペ	る
0xC773	ホ	れ
0xC774	ボ	ろ
0xC775	ポ	ゎ
0xC776	マ	わ
0xC777	ミ	ゐ
0xC778	ム	ゑ
0xC779	メ	を
0xC77A	モ	ん
0xC77B	ャ	ァ
0xC77C	ヤ	ア
0xC77D	ュ	ィ
0xC77E	ユ	イ
0xC7A1	ョ	ゥ
0xC7A2	ヨ	ウ
0xC7A3	ラ	ェ
0xC7A4	リ	エ
0xC7A5	ル	ォ
0xC7A6	レ	オ
0xC7A7	ロ	カ
0xC7A8	ヮ	ガ
0xC7A9	ワ	キ
0xC7AA	ヰ	ギ
0xC7AB	ヱ	ク
0xC7AC	ヲ	グ
0xC7AD	ン	ケ
0xC7AE	ヴ	ゲ
0xC7AF	ヵ	コ
0xC7B0	ヶ	ゴ
0xC7B1	Д	サ
0xC7B2	Е	ザ
0xC7B3	Ё	シ
0xC7B4	Ж	ジ
0xC7B5	З	ス
0xC7B6	И	ズ
0xC7B7	Й	セ
0xC7B8	К	ゼ
0xC7B9	Л	ソ
0xC7BA	М	ゾ
0xC7BB	У	タ
0xC7BC	Ф	ダ
0xC7BD	Х	チ
0xC7BE	Ц	ヂ
0xC7BF	Ч	ッ
0xC7C0	Ш	ツ
0xC7C1	Щ	ヅ
0xC7C2	Ъ	テ
0xC7C3	Ы	デ
0xC7C4	Ь	ト
0xC7C5	Э	ド
0xC7C6	Ю	ナ
0xC7C7	Я	ニ
0xC7C8	а	ヌ
0xC7C9	б	ネ
0xC7CA	в	ノ
0xC7CB	г	ハ
0xC7CC	д	バ
0xC7CD	е	パ
0xC7CE	ё	ヒ
0xC7CF	ж	ビ
0xC7D0	з	ピ
0xC7D1	и	フ
0xC7D2	й	ブ
0xC7D3	к	プ
0xC7D4	л	ヘ
0xC7D5	м	ベ
0xC7D6	н	ペ
0xC7D7	о	ホ
0xC7D8	п	ボ
0xC7D9	р	ポ
0xC7DA	с	マ
0xC7DB	т	ミ
0xC7DC	у	ム
0xC7DD	ф	メ
0xC7DE	х	モ
0xC7DF	ц	ャ
0xC7E0	ч	ヤ
0xC7E1	ш	ュ
0xC7E2	щ	ユ
0xC7E3	ъ	ョ
0xC7E4	ы	ヨ
0xC7E5	ь	ラ
0xC7E6	э	リ
0xC7E7	ю	ル
0xC7E8	я	レ
0xC7E9	①	ロ
0xC7EA	②	ヮ
0xC7EB	③	ワ
0xC7EC	④	ヰ
0xC7ED	⑤	ヱ
0xC7EE	⑥	ヲ
0xC7EF	⑦	ン
0xC7F0	⑧	ヴ
0xC7F1	⑨	ヵ
0xC7F2	⑩	ヶ
0xC7F3	⑴	А
0xC7F4	⑵	Б
0xC7F5	⑶	В
0xC7F6	⑷	Г
0xC7F7	⑸	Д
0xC7F8	⑹	Е
0xC7F9	⑺	Ё
0xC7FA	⑻	Ж
0xC7FB	⑼	З
0xC7FC	⑽	И
0xC7FD	(not used)	Й
0xC7FE	(not used)	К
0xC840	(not used)	Л
0xC841	(not used)	М
0xC842	(not used)	Н
0xC843	(not used)	О
0xC844	(not used)	П
0xC845	(not used)	Р
0xC846	(not used)	С
0xC847	(not used)	Т
0xC848	(not used)	У
0xC849	(not used)	Ф
0xC84A	(not used)	Х
0xC84B	(not used)	Ц
0xC84C	(not used)	Ч
0xC84D	(not used)	Ш
0xC84E	(not used)	Щ
0xC84F	(not used)	Ъ
0xC850	(not used)	Ы
0xC851	(not used)	Ь
0xC852	(not used)	Э
0xC853	(not used)	Ю
0xC854	(not used)	Я
0xC855	(not used)	а
0xC856	(not used)	б
0xC857	(not used)	в
0xC858	(not used)	г
0xC859	(not used)	д
0xC85A	(not used)	е
0xC85B	(not used)	ё
0xC85C	(not used)	ж
0xC85D	(not used)	з
0xC85E	(not used)	и
0xC85F	(not used)	й
0xC860	(not used)	к
0xC861	(not used)	л
0xC862	(not used)	м
0xC863	(not used)	н
0xC864	(not used)	о
0xC865	(not used)	п
0xC866	(not used)	р
0xC867	(not used)	с
0xC868	(not used)	т
0xC869	(not used)	у
0xC86A	(not used)	ф
0xC86B	(not used)	х
0xC86C	(not used)	ц
0xC86D	(not used)	ч
0xC86E	(not used)	ш
0xC86F	(not used)	щ
0xC870	(not used)	ъ
0xC871	(not used)	ы
0xC872	(not used)	ь
0xC873	(not used)	э
0xC874	(not used)	ю
0xC875	(not used)	я

gollark: The updated ones say similar things.

gollark: The amount of privacy things I would have to deal with if I didn't just lyriclyishly ignore them might make the whole stats thing impractical.

gollark: Weird.

gollark: https://discord.com/developers/docs/policy

gollark: That is the developer policy bit.

References

chinese mac Character Sets
Apple, Inc (2005-04-04) [1996-06-31]. Map (external version) from Mac OS Chinese Traditional encoding to Unicode 3.0 and later. Unicode Consortium.
"狗爺語錄 » Blog Archive » What is Code Page 951 (CP951)?". Archived from the original on 2007-02-22. Retrieved 2006-09-27.
黃國書. "Chinasea 1.0 中國海字集". ISU FTP. Archived from the original on 2005-03-19. Retrieved 2016-12-05.
Macao Special Administrative Region Government (2020-06-11). "Submission of Macao's Vertical Extension (UNC Characters), Horizontal Extension, and IVSes Registration for MSCS" (PDF). ISO/IEC JTC 1/SC 2/WG 2 IRGN 2430.
Computer Chinese Characters Encoding Workgroup (2009-06-12). "Submission of Characters from Macao Information Systems Character Set" (PDF). ISO/IEC JTC 1/SC 2/WG 2 IRGN 1580. Archived from the original (PDF) on 2015-01-04.
Lunde, Ken (1996-07-12). "2.3.1: BIG FIVE". CJK.INF Version 2.1.
"Big5HKSCS-2004". Mozilla Taiwan.
van Kesteren, Anne. "big5". Encoding Standard. WHATWG.
"UAO 2.41 b2u". Mozilla Taiwan.
"Big5-2003 b2u". Mozilla Taiwan.
IBM; Unicode Consortium (2002-12-03). "windows-950-2000". International Components for Unicode.
Script showing output of cp950 codec for lead bytes 0xC6 and 0xC7
Unicode Consortium (2015-12-02) [1994-02-11]. BIG5 to Unicode table (complete).
"Big5-ETen vs Unicode mapping table". Mozilla Taiwan. 2002-02-24.

External links

Mozilla and the Big5 Family of Encodings: an overview of Big5 encodings with code charts for each extension and relevant Firefox bugs (Traditional Chinese)
Big5 character code table
Chinese character codes: an update by Christian Wittern
CNS 11643 official web site has information about the Big5e character set (an extended version of Big5) in the "Chinese Information Code" section.
Big5 introduction Contains differences between extensions.
Graphical View of Big5 in ICU's Converter Explorer
教育部標準字體 Download page of the Taiwan Ministry of Education fonts
文獻處理實驗室 Download pages of the CDP font
Hong Kong Supplementary Character Set Info Downloadable HKSCS documents & font
香港參考宋體 Download page of Dynalab(華康科技有限公司)'s HKSCS font.
Microsoft's Windows Codepage 950 (Traditional Chinese Big5)
on.cc Download page of the OPG font
中國海字集視窗版(v3.0)下載網頁 Download page of the ChinaSea font
Big5 Codeset Overview
Python Script to print cp950 codec

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.

[ASCII-2] Big5 does not specify a single-byte component; however, ASCII (or an extension) is used in practice.

[3] se mac Character Sets

[mactradchinese-4] Apple, Inc (2005-04-04) [1996-06-31]. Map (external version) from Mac OS Chinese Traditional encoding to Unicode 3.0 and later. Unicode Consortium.

[5] "狗爺語錄 » Blog Archive » What is Code Page 951 (CP951)?". Archived from the original on 2007-02-22. Retrieved 2006-09-27.

[6] 黃國書. "Chinasea 1.0 中國海字集". ISU FTP. Archived from the original on 2005-03-19. Retrieved 2016-12-05.

[irgn2430-7] Macao Special Administrative Region Government (2020-06-11). "Submission of Macao's Vertical Extension (UNC Characters), Horizontal Extension, and IVSes Registration for MSCS" (PDF). ISO/IEC JTC 1/SC 2/WG 2 IRGN 2430.

[irgn1580-8] Computer Chinese Characters Encoding Workgroup (2009-06-12). "Submission of Characters from Macao Information Systems Character Set" (PDF). ISO/IEC JTC 1/SC 2/WG 2 IRGN 1580. Archived from the original (PDF) on 2015-01-04.

[9] Lunde, Ken (1996-07-12). "2.3.1: BIG FIVE". CJK.INF Version 2.1.

[10] "Big5HKSCS-2004". Mozilla Taiwan.

[11] van Kesteren, Anne. "big5". Encoding Standard. WHATWG.

[12] "UAO 2.41 b2u". Mozilla Taiwan.

[13] "Big5-2003 b2u". Mozilla Taiwan.

[14] IBM; Unicode Consortium (2002-12-03). "windows-950-2000". International Components for Unicode.

[15] Script showing output of cp950 codec for lead bytes 0xC6 and 0xC7

[16] Unicode Consortium (2015-12-02) [1994-02-11]. BIG5 to Unicode table (complete).

[17] "Big5-ETen vs Unicode mapping table". Mozilla Taiwan. 2002-02-24.

Character encodings
Early telecommunications	Telegraph code Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Korean Baudot and Murray FIELDATA ASCII ISO/IEC 646 BCDIC 353 355 357 358 359 360 EBCDIC Teletex and Videotex/Teletext ISO/IEC 6937 / ITU T.51 ITU T.61 ITU T.101 World System Teletext background sets
ISO/IEC 8859	Approved -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -13 -14 -15 -16 Abandoned -12 Adaptations ISO-IR-182 ISO-IR-200 ISO-IR-201 Proposed but not approved ISO-IR-111 ISO-IR-197 French/Dutch/Turkish draft
Bibliographic use	MARC-8 ANSEL CCCII/EACC ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822
National standards	ArmSCII BraSCII CNS 11643 ELOT 927 GOST 10859 GB 18030 HKSCS I.S. 434 ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 LST 1284 LST 1564 LST 1590-1 LST 1590-2 LST 1590-3 LST 1590-4 PASCII RUSCII SI 960 TIS-620 TSCII VISCII VSCII YUSCII
ISO/IEC 2022	7-bit CN CN-EXT JP JP-EXT JP-1 JP-2 JP-3 KR ISO/IEC 4873 ISO/IEC 8859 ISO/IEC 10367 Extended Unix Code / EUC CN KR JP TW
MacOS code pages ("scripts")	Armenian Arabic Barents Cyrillic Celtic CentEuro ChineseSimp / EUC-CN ChineseTrad / Big5 Croatian Cyrillic Devanagari Dingbats Farsi (Persian) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Japanese / ShiftJIS Keyboard Korean / EUC-KR Latin (Kermit) Maltese/Esperanto Ogham / I.S. 434 Roman Romanian Sámi Symbol Thai / TIS-620 Turkish Turkic Cyrillic Ukrainian VT100
DOS code pages	100 111 112 113 151 152 161 162 163 164 165 166 210 220 301 437 449 489 620 667 668 707 708 709 710 711 714 715 720 721 737 768 770 771 772 773 774 775 776 777 778 790 850 851 852 853 854 855/872 856 857 858 859 860 861 862 863 864 865 866/808 867 868 869 874/1161/1162 876 877 878 881 882 883 884 885 891 895 896 897 898 899 900 903 904 906 907 909 910 911 926 927 928 929 932 934 936 938 941 942 943 944 946 947 948 949 950/1370 951 966 991 1034 1039 1040 1041 1042 1043 1044 1046 1086 1088 1092 1093 1098 1108 1109 1114 1115 1116 1117 1118 1119 1125/848 1126 1127 1131/849 1139 1167 1168 1300 1351 1361 1362 1363 1372 1373 1374 1375 1380 1381 1385 1386 1391 1392 1393 1394 3012 3021 3843 3844 3845 3846 3847 3848 30000 30001 30002 30003 30004 30005 30006 30007 30008 30009 30010 30011 30012 30013 30014 30015 30016 30017 30018 30019 30020 30021 30022 30023 30024 30025 30026 30027 30028 30029 30030 30031 30032 30033 30034 30039 30040 58152 58210 58335 59234 59829 60258 60853 61282 62306 CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický KOI8 Mazovia MIK
IBM AIX code pages	367 371 806 813 819 895 896 912 913 914 915 916 919 920 921/901 922/902 923 952 953 954 955 956 957 958 959 960 961 963 964 965 970 971 1004 1006 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1029 1036 1089 1111 1124 1129/1163 1133 1350 1382 1383
IBM code pages for other vendors' encodings	Apple Macintosh 1275 1280 1281 1282 1283 1284 1285 1286 Adobe 1038 1276 1277 DEC 1020 1021 1023 1090 1100 1101 1102 1103 1104 1105 1106 1107 1287 1288 HP 1050 1051 1052 1053 1054 1055 1056 1057 1058
Windows code pages	CER-GS 874/1162 (TIS-620) 932/943 (Shift JIS) 936/1386 (GBK) 950/1370 (Big5) 949/1363 (EUC-KR) 1169 1174 Extended Latin-8 1200 (UTF-16LE) 1201 (UTF-16BE) 1250 1251 1252 1253 1254 1255 1256 1257 1258 1261 1270 54936 (GB18030) Armenian Cyrillic + Finnish Cyrillic + French Cyrillic + German Polytonic Greek 65001 (UTF-8)
Microsoft code pages for other vendors' encodings	Apple Macintosh 10000 10001 10002 10003 10004 10005 10006 10007 10008 10010 10017 10021 10029 10079 10081 10082
EBCDIC code pages	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37/1140 37-2 38 39 40 251 252 254 256 257 258 259 260 264 273/1141 274 275 276 277/1142 278/1143 279 280/1144 281 282 283 284/1145 285/1146 286 287 288 289 290 297/1147 298 300 320 321 322 330 352 353 355 357 358 359 360 361 363 382 383 384 385 386 387 388 389 390 391 392 393 394 395 410 420 421 423 424 425 435 500/1148 803 829 833 834 835 836 837 838/1160 839 870/1110/1153 871/1149 875 880 881 882 883 884 885 886 887 888 889 890 892 893 905 918 924 930/1390 931 933/1364 935/1388 937/1371 939/1399 1001 1002 1003 1005 1007 1024 1025/1154 1026/1155 1027 1028 1030 1031 1032 1033 1037 1047 1068 1069 1070 1071 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1087 1091 1097 1112/1156 1113 1122/1157 1123/1158 1130/1164 1132 1136 1137 1150 1151 1152 1159 1165 1166 1278 1279 1303 1364 1376 1377 JEF KEIS
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish 7-bit Hebrew 8-bit Hebrew Special Graphics Technical (TCS)
Platform specific	Acorn Adobe Standard Adobe Latin 1 Amstrad CPC Apple I Apple II Apple III ATASCII Atari ST BICS Casio calculators CDC Compucolor II CP/M+ DEC RADIX 50 DEC MCS/NRCS DG International ELWRO-Junior FIELDATA GEM GEOS GSM 03.38 HP Roman Extension HP Roman-8 HP Roman-9 HP FOCAL HP RPL IBM SQUOZE LICS LMBCS Mattel Aquarius Minitel MSX NEC APC NeXT OricSCII PCW PETSCII Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International Ventura Symbol WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode / ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 (UTF-16LE/UTF-16BE) / UCS-2 UTF-32 (UTF-32LE/UTF-32BE) / UCS-4 UTF-EBCDIC GB 18030 BOCU-1 CESU-8 SCSU
TeX typesetting system	Cork IL1 IL2 IL3 L7X LGR LY1 OML OMS OMX OT1 OT2 OT3 OT4 PL0 QX T2A T2B T2C T2D T3 T4 T5 TS1 TS3 U X2
Miscellaneous code pages	ABICOMP APL 293 310 (Graphic Escape) 351 (GDDM) 907 (OEM) ISO-IR-68 ARIB STD-B24 HZ IEC-P27-1 INIS 7-bit 8-bit Cyrillic ISO-IR-169 ISO 2033 Johab Mojikyō SEASCII Stanford/ITS TACE16 TRON UTF-5 UTF-6 WTF-8
Related topics	Code page Control character (C0 C1) CCSID Character encodings in HTML Charset detection Han unification Hardware ISO 6429/IEC 6429/ANSI X3.64 Mojibake
Character sets