Task

Today, your task is to implement your own Unicode implementation, with the following rules:

Write an encoder and a decoder in any language of your choice

The encoder's input is a list of code points (as integers) and it outputs a list of bytes (as integers) corresponding to your encoding.

The decoder does the opposite (bytes => code points)

Your implementation has to cover all Unicode 7.0.0 assigned code points

It has to stay backwards-compatible with ASCII, i.e. encode Basic latin characters (U+0000-U+007F) on one byte, with 0 as most significant bit.

Encode all the other assigned code points in any form and any number of bytes you want, as long as there is no ambiguity (i.e. two code points or group of code points can't have the same encoding and vice versa)

Your implementation doesn't have to cover UTF-16 surrogates (code points U+D800-U+DFFF) nor private use areas (U+E000-U+F8FF, U+F0000-U+10FFFF)

Your encoding must be context-independant (i.e. not rely on previously encoded characters) and does NOT require self-synchronization (i.e. each byte doesn't have to infer where it's located in the encoding of a code point, like in UTF-8).

To sum up, here are the blocks that you have to cover, in JSON:

[ [0x0000,0x007F], // Basic Latin [0x0080,0x00FF], // Latin-1 Supplement [0x0100,0x017F], // Latin Extended-A [0x0180,0x024F], // Latin Extended-B [0x0250,0x02AF], // IPA Extensions [0x02B0,0x02FF], // Spacing Modifier Letters [0x0300,0x036F], // Combining Diacritical Marks [0x0370,0x03FF], // Greek and Coptic [0x0400,0x04FF], // Cyrillic [0x0500,0x052F], // Cyrillic Supplement [0x0530,0x058F], // Armenian [0x0590,0x05FF], // Hebrew [0x0600,0x06FF], // Arabic [0x0700,0x074F], // Syriac [0x0750,0x077F], // Arabic Supplement [0x0780,0x07BF], // Thaana [0x07C0,0x07FF], // NKo [0x0800,0x083F], // Samaritan [0x0840,0x085F], // Mandaic [0x08A0,0x08FF], // Arabic Extended-A [0x0900,0x097F], // Devanagari [0x0980,0x09FF], // Bengali [0x0A00,0x0A7F], // Gurmukhi [0x0A80,0x0AFF], // Gujarati [0x0B00,0x0B7F], // Oriya [0x0B80,0x0BFF], // Tamil [0x0C00,0x0C7F], // Telugu [0x0C80,0x0CFF], // Kannada [0x0D00,0x0D7F], // Malayalam [0x0D80,0x0DFF], // Sinhala [0x0E00,0x0E7F], // Thai [0x0E80,0x0EFF], // Lao [0x0F00,0x0FFF], // Tibetan [0x1000,0x109F], // Myanmar [0x10A0,0x10FF], // Georgian [0x1100,0x11FF], // Hangul Jamo [0x1200,0x137F], // Ethiopic [0x1380,0x139F], // Ethiopic Supplement [0x13A0,0x13FF], // Cherokee [0x1400,0x167F], // Unified Canadian Aboriginal Syllabics [0x1680,0x169F], // Ogham [0x16A0,0x16FF], // Runic [0x1700,0x171F], // Tagalog [0x1720,0x173F], // Hanunoo [0x1740,0x175F], // Buhid [0x1760,0x177F], // Tagbanwa [0x1780,0x17FF], // Khmer [0x1800,0x18AF], // Mongolian [0x18B0,0x18FF], // Unified Canadian Aboriginal Syllabics Extended [0x1900,0x194F], // Limbu [0x1950,0x197F], // Tai Le [0x1980,0x19DF], // New Tai Lue [0x19E0,0x19FF], // Khmer Symbols [0x1A00,0x1A1F], // Buginese [0x1A20,0x1AAF], // Tai Tham [0x1AB0,0x1AFF], // Combining Diacritical Marks Extended [0x1B00,0x1B7F], // Balinese [0x1B80,0x1BBF], // Sundanese [0x1BC0,0x1BFF], // Batak [0x1C00,0x1C4F], // Lepcha [0x1C50,0x1C7F], // Ol Chiki [0x1CC0,0x1CCF], // Sundanese Supplement [0x1CD0,0x1CFF], // Vedic Extensions [0x1D00,0x1D7F], // Phonetic Extensions [0x1D80,0x1DBF], // Phonetic Extensions Supplement [0x1DC0,0x1DFF], // Combining Diacritical Marks Supplement [0x1E00,0x1EFF], // Latin Extended Additional [0x1F00,0x1FFF], // Greek Extended [0x2000,0x206F], // General Punctuation [0x2070,0x209F], // Superscripts and Subscripts [0x20A0,0x20CF], // Currency Symbols [0x20D0,0x20FF], // Combining Diacritical Marks for Symbols [0x2100,0x214F], // Letterlike Symbols [0x2150,0x218F], // Number Forms [0x2190,0x21FF], // Arrows [0x2200,0x22FF], // Mathematical Operators [0x2300,0x23FF], // Miscellaneous Technical [0x2400,0x243F], // Control Pictures [0x2440,0x245F], // Optical Character Recognition [0x2460,0x24FF], // Enclosed Alphanumerics [0x2500,0x257F], // Box Drawing [0x2580,0x259F], // Block Elements [0x25A0,0x25FF], // Geometric Shapes [0x2600,0x26FF], // Miscellaneous Symbols [0x2700,0x27BF], // Dingbats [0x27C0,0x27EF], // Miscellaneous Mathematical Symbols-A [0x27F0,0x27FF], // Supplemental Arrows-A [0x2800,0x28FF], // Braille Patterns [0x2900,0x297F], // Supplemental Arrows-B [0x2980,0x29FF], // Miscellaneous Mathematical Symbols-B [0x2A00,0x2AFF], // Supplemental Mathematical Operators [0x2B00,0x2BFF], // Miscellaneous Symbols and Arrows [0x2C00,0x2C5F], // Glagolitic [0x2C60,0x2C7F], // Latin Extended-C [0x2C80,0x2CFF], // Coptic [0x2D00,0x2D2F], // Georgian Supplement [0x2D30,0x2D7F], // Tifinagh [0x2D80,0x2DDF], // Ethiopic Extended [0x2DE0,0x2DFF], // Cyrillic Extended-A [0x2E00,0x2E7F], // Supplemental Punctuation [0x2E80,0x2EFF], // CJK Radicals Supplement [0x2F00,0x2FDF], // Kangxi Radicals [0x2FF0,0x2FFF], // Ideographic Description Characters [0x3000,0x303F], // CJK Symbols and Punctuation [0x3040,0x309F], // Hiragana [0x30A0,0x30FF], // Katakana [0x3100,0x312F], // Bopomofo [0x3130,0x318F], // Hangul Compatibility Jamo [0x3190,0x319F], // Kanbun [0x31A0,0x31BF], // Bopomofo Extended [0x31C0,0x31EF], // CJK Strokes [0x31F0,0x31FF], // Katakana Phonetic Extensions [0x3200,0x32FF], // Enclosed CJK Letters and Months [0x3300,0x33FF], // CJK Compatibility [0x3400,0x4DBF], // CJK Unified Ideographs Extension A [0x4DC0,0x4DFF], // Yijing Hexagram Symbols [0x4E00,0x9FFF], // CJK Unified Ideographs [0xA000,0xA48F], // Yi Syllables [0xA490,0xA4CF], // Yi Radicals [0xA4D0,0xA4FF], // Lisu [0xA500,0xA63F], // Vai [0xA640,0xA69F], // Cyrillic Extended-B [0xA6A0,0xA6FF], // Bamum [0xA700,0xA71F], // Modifier Tone Letters [0xA720,0xA7FF], // Latin Extended-D [0xA800,0xA82F], // Syloti Nagri [0xA830,0xA83F], // Common Indic Number Forms [0xA840,0xA87F], // Phags-pa [0xA880,0xA8DF], // Saurashtra [0xA8E0,0xA8FF], // Devanagari Extended [0xA900,0xA92F], // Kayah Li [0xA930,0xA95F], // Rejang [0xA960,0xA97F], // Hangul Jamo Extended-A [0xA980,0xA9DF], // Javanese [0xA9E0,0xA9FF], // Myanmar Extended-B [0xAA00,0xAA5F], // Cham [0xAA60,0xAA7F], // Myanmar Extended-A [0xAA80,0xAADF], // Tai Viet [0xAAE0,0xAAFF], // Meetei Mayek Extensions [0xAB00,0xAB2F], // Ethiopic Extended-A [0xAB30,0xAB6F], // Latin Extended-E [0xABC0,0xABFF], // Meetei Mayek [0xAC00,0xD7AF], // Hangul Syllables [0xD7B0,0xD7FF], // Hangul Jamo Extended-B [0xF900,0xFAFF], // CJK Compatibility Ideographs [0xFB00,0xFB4F], // Alphabetic Presentation Forms [0xFB50,0xFDFF], // Arabic Presentation Forms-A [0xFE00,0xFE0F], // Variation Selectors [0xFE10,0xFE1F], // Vertical Forms [0xFE20,0xFE2F], // Combining Half Marks [0xFE30,0xFE4F], // CJK Compatibility Forms [0xFE50,0xFE6F], // Small Form Variants [0xFE70,0xFEFF], // Arabic Presentation Forms-B [0xFF00,0xFFEF], // Halfwidth and Fullwidth Forms [0xFFF0,0xFFFF], // Specials [0x10000,0x1007F], // Linear B Syllabary [0x10080,0x100FF], // Linear B Ideograms [0x10100,0x1013F], // Aegean Numbers [0x10140,0x1018F], // Ancient Greek Numbers [0x10190,0x101CF], // Ancient Symbols [0x101D0,0x101FF], // Phaistos Disc [0x10280,0x1029F], // Lycian [0x102A0,0x102DF], // Carian [0x102E0,0x102FF], // Coptic Epact Numbers [0x10300,0x1032F], // Old Italic [0x10330,0x1034F], // Gothic [0x10350,0x1037F], // Old Permic [0x10380,0x1039F], // Ugaritic [0x103A0,0x103DF], // Old Persian [0x10400,0x1044F], // Deseret [0x10450,0x1047F], // Shavian [0x10480,0x104AF], // Osmanya [0x10500,0x1052F], // Elbasan [0x10530,0x1056F], // Caucasian Albanian [0x10600,0x1077F], // Linear A [0x10800,0x1083F], // Cypriot Syllabary [0x10840,0x1085F], // Imperial Aramaic [0x10860,0x1087F], // Palmyrene [0x10880,0x108AF], // Nabataean [0x10900,0x1091F], // Phoenician [0x10920,0x1093F], // Lydian [0x10980,0x1099F], // Meroitic Hieroglyphs [0x109A0,0x109FF], // Meroitic Cursive [0x10A00,0x10A5F], // Kharoshthi [0x10A60,0x10A7F], // Old South Arabian [0x10A80,0x10A9F], // Old North Arabian [0x10AC0,0x10AFF], // Manichaean [0x10B00,0x10B3F], // Avestan [0x10B40,0x10B5F], // Inscriptional Parthian [0x10B60,0x10B7F], // Inscriptional Pahlavi [0x10B80,0x10BAF], // Psalter Pahlavi [0x10C00,0x10C4F], // Old Turkic [0x10E60,0x10E7F], // Rumi Numeral Symbols [0x11000,0x1107F], // Brahmi [0x11080,0x110CF], // Kaithi [0x110D0,0x110FF], // Sora Sompeng [0x11100,0x1114F], // Chakma [0x11150,0x1117F], // Mahajani [0x11180,0x111DF], // Sharada [0x111E0,0x111FF], // Sinhala Archaic Numbers [0x11200,0x1124F], // Khojki [0x112B0,0x112FF], // Khudawadi [0x11300,0x1137F], // Grantha [0x11480,0x114DF], // Tirhuta [0x11580,0x115FF], // Siddham [0x11600,0x1165F], // Modi [0x11680,0x116CF], // Takri [0x118A0,0x118FF], // Warang Citi [0x11AC0,0x11AFF], // Pau Cin Hau [0x12000,0x123FF], // Cuneiform [0x12400,0x1247F], // Cuneiform Numbers and Punctuation [0x13000,0x1342F], // Egyptian Hieroglyphs [0x16800,0x16A3F], // Bamum Supplement [0x16A40,0x16A6F], // Mro [0x16AD0,0x16AFF], // Bassa Vah [0x16B00,0x16B8F], // Pahawh Hmong [0x16F00,0x16F9F], // Miao [0x1B000,0x1B0FF], // Kana Supplement [0x1BC00,0x1BC9F], // Duployan [0x1BCA0,0x1BCAF], // Shorthand Format Controls [0x1D000,0x1D0FF], // Byzantine Musical Symbols [0x1D100,0x1D1FF], // Musical Symbols [0x1D200,0x1D24F], // Ancient Greek Musical Notation [0x1D300,0x1D35F], // Tai Xuan Jing Symbols [0x1D360,0x1D37F], // Counting Rod Numerals [0x1D400,0x1D7FF], // Mathematical Alphanumeric Symbols [0x1E800,0x1E8DF], // Mende Kikakui [0x1EE00,0x1EEFF], // Arabic Mathematical Alphabetic Symbols [0x1F000,0x1F02F], // Mahjong Tiles [0x1F030,0x1F09F], // Domino Tiles [0x1F0A0,0x1F0FF], // Playing Cards [0x1F100,0x1F1FF], // Enclosed Alphanumeric Supplement [0x1F200,0x1F2FF], // Enclosed Ideographic Supplement [0x1F300,0x1F5FF], // Miscellaneous Symbols and Pictographs [0x1F600,0x1F64F], // Emoticons [0x1F650,0x1F67F], // Ornamental Dingbats [0x1F680,0x1F6FF], // Transport and Map Symbols [0x1F700,0x1F77F], // Alchemical Symbols [0x1F780,0x1F7FF], // Geometric Shapes Extended [0x1F800,0x1F8FF], // Supplemental Arrows-C [0x20000,0x2A6DF], // CJK Unified Ideographs Extension B [0x2A700,0x2B73F], // CJK Unified Ideographs Extension C [0x2B740,0x2B81F], // CJK Unified Ideographs Extension D [0x2F800,0x2FA1F], // CJK Compatibility Ideographs Supplement [0xE0000,0xE007F], // Tags [0xE0100,0xE01EF] // Variation Selectors Supplement ]

Total: 116,816 code points.

Answers

Score: 318080 (~2.72 bytes/char)

I assume the encoding does not require self-synchronization (like UTF-8 / UTF-16).
Because there is no example text, I assume the encoding is context-independent (unlike e.g. SCSU), i.e. the encoder converts every code point to a unique bytes sequence.
log₂ 116816 = 16.83 > 16, so any encoding chosen must contain some code point that map to 3 bytes.

For simplicity, let's assume we need to encode exactly 2¹⁷ = 131072 code points. We could first map every code point to an integer from 0 to 131071. Then remaining is the optimization problem (a mathematical problem) on choosing a code which minimize L(E) = ∑_0≤i<131072 |E(i)|.

One possible encoding is prefix code. The specification already 0-prefixed sequence is for 0 ... 127 only. A simple, inefficient encoding is to use 1-prefixed sequence for the remaining code:

0xxxxxxx                   = x        (this encodes values < 128)
1xxxxxxx xxxxxxxx xxxxxxxx = x + 128  (this encodes values < 131072)

The value is (1 byte × 128 + 3 bytes × (131072 − 128)) = 392960 bytes.

We could also use 2-bit prefix, e.g.

0xxxxxxx                   = x       (< 128)
10xxxxxx                   = x + 128 (< 192)
11xxxxxx xxxxxxxx xxxxxxxx = x + 192 (< 131072)
    L(E) = 192×1 + 130880×3 = 392832

0xxxxxxx                   = x         (< 128)
10xxxxxx xxxxxxxx          = x + 128   (< 16512)
11xxxxxx xxxxxxxx xxxxxxxx = x + 16512 (< 131072)
    L(E) = 376576

etc. I think the most efficient coding is like this:

0xxxxxxx                   = x
10xxxxxx xxxxxxxx          = x + 128 
110xxxxx xxxxxxxx          = x + 16512 \
1110xxxx xxxxxxxx          = x + 24704  | squeeze all possible values
11110xxx xxxxxxxx          = x + 28800  | encodable by 2 bytes
111110xx xxxxxxxx          = x + 30848  | until 17 bits remains.
1111110x xxxxxxxx          = x + 31872 /
1111111x xxxxxxxx xxxxxxxx = x + 32384 (up to 163456)
    L(E) = 360704

Since the encoding supports all numbers up to 163456, I have chosen to close some "narrow" gaps under 14000 code points to simply the code point → number conversion, so it will support the ranges:

 U+0000 .. U+17000
U+1b000 .. U+2b900
U+2f800 .. U+2fb00
U+e0000 .. U+e0200

Since the range of numbers are extended, the score will be a few bytes above optimal, but I would compromise that for simplicity of the encoder here.

So the encoder and decoders may be written as (Python 3):

def codepoint_to_number(codepoint):
    if 0 <= codepoint < 0x17000:
        return codepoint
    elif 0x1b000 <= codepoint < 0x2b900:
        return codepoint - 0x4000
    elif 0x2f800 <= codepoint < 0x2fb00:
        return codepoint - 0x7f00
    elif 0xe0000 <= codepoint < 0xe0200:
        return codepoint - 0xb8400
    else:
        raise UnicodeError('Cannot encode ' + hex(codepoint))

def number_to_bytes(number):
    if number < 0x80:
        return bytes([number])
    elif number < 0x7e80:
        number += 0x7f80
        return bytes([number >> 8, number & 0xff])
    else:
        number += 0xfd8180
        return bytes([number >> 16, (number >> 8) & 0xff, number & 0xff])

def bytes_to_number(bs):
    first_byte = bs[0]
    if first_byte < 0x80:
        return (first_byte, bs[1:])
    elif first_byte < 0xfe:
        return (0x80 + ((first_byte & 0x7f) << 8 | bs[1]), bs[2:])
    else:
        return (0x7e80 + ((first_byte & 1) << 16 | bs[1] << 8 | bs[2]), bs[3:])

def number_to_codepoint(number):
    if number < 0x17000:
        return number
    elif number < 0x27900:
        return number + 0x4000
    elif number < 0x27c00:
        return number + 0x7f00
    else:
        return number + 0xb8400

def encode(codepoints):
    result = bytearray()
    for codepoint in codepoints:
        result.extend(number_to_bytes(codepoint_to_number(codepoint)))
    return result

def decode(bytes_sequence):
    bs = memoryview(bytes_sequence)
    while bs:
        (number, bs) = bytes_to_number(bs)
        yield number_to_codepoint(number)

# Test case:

import itertools
import unittest

# {{{
CODEPOINT_RANGES = [
    [0x0000,0x007F], # Basic Latin
    [0x0080,0x00FF], # Latin-1 Supplement
    [0x0100,0x017F], # Latin Extended-A
    [0x0180,0x024F], # Latin Extended-B
    [0x0250,0x02AF], # IPA Extensions
    [0x02B0,0x02FF], # Spacing Modifier Letters
    [0x0300,0x036F], # Combining Diacritical Marks
    [0x0370,0x03FF], # Greek and Coptic
    [0x0400,0x04FF], # Cyrillic
    [0x0500,0x052F], # Cyrillic Supplement
    [0x0530,0x058F], # Armenian
    [0x0590,0x05FF], # Hebrew
    [0x0600,0x06FF], # Arabic
    [0x0700,0x074F], # Syriac
    [0x0750,0x077F], # Arabic Supplement
    [0x0780,0x07BF], # Thaana
    [0x07C0,0x07FF], # NKo
    [0x0800,0x083F], # Samaritan
    [0x0840,0x085F], # Mandaic
    [0x08A0,0x08FF], # Arabic Extended-A
    [0x0900,0x097F], # Devanagari
    [0x0980,0x09FF], # Bengali
    [0x0A00,0x0A7F], # Gurmukhi
    [0x0A80,0x0AFF], # Gujarati
    [0x0B00,0x0B7F], # Oriya
    [0x0B80,0x0BFF], # Tamil
    [0x0C00,0x0C7F], # Telugu
    [0x0C80,0x0CFF], # Kannada
    [0x0D00,0x0D7F], # Malayalam
    [0x0D80,0x0DFF], # Sinhala
    [0x0E00,0x0E7F], # Thai
    [0x0E80,0x0EFF], # Lao
    [0x0F00,0x0FFF], # Tibetan
    [0x1000,0x109F], # Myanmar
    [0x10A0,0x10FF], # Georgian
    [0x1100,0x11FF], # Hangul Jamo
    [0x1200,0x137F], # Ethiopic
    [0x1380,0x139F], # Ethiopic Supplement
    [0x13A0,0x13FF], # Cherokee
    [0x1400,0x167F], # Unified Canadian Aboriginal Syllabics
    [0x1680,0x169F], # Ogham
    [0x16A0,0x16FF], # Runic
    [0x1700,0x171F], # Tagalog
    [0x1720,0x173F], # Hanunoo
    [0x1740,0x175F], # Buhid
    [0x1760,0x177F], # Tagbanwa
    [0x1780,0x17FF], # Khmer
    [0x1800,0x18AF], # Mongolian
    [0x18B0,0x18FF], # Unified Canadian Aboriginal Syllabics Extended
    [0x1900,0x194F], # Limbu
    [0x1950,0x197F], # Tai Le
    [0x1980,0x19DF], # New Tai Lue
    [0x19E0,0x19FF], # Khmer Symbols
    [0x1A00,0x1A1F], # Buginese
    [0x1A20,0x1AAF], # Tai Tham
    [0x1AB0,0x1AFF], # Combining Diacritical Marks Extended
    [0x1B00,0x1B7F], # Balinese
    [0x1B80,0x1BBF], # Sundanese
    [0x1BC0,0x1BFF], # Batak
    [0x1C00,0x1C4F], # Lepcha
    [0x1C50,0x1C7F], # Ol Chiki
    [0x1CC0,0x1CCF], # Sundanese Supplement
    [0x1CD0,0x1CFF], # Vedic Extensions
    [0x1D00,0x1D7F], # Phonetic Extensions
    [0x1D80,0x1DBF], # Phonetic Extensions Supplement
    [0x1DC0,0x1DFF], # Combining Diacritical Marks Supplement
    [0x1E00,0x1EFF], # Latin Extended Additional
    [0x1F00,0x1FFF], # Greek Extended
    [0x2000,0x206F], # General Punctuation
    [0x2070,0x209F], # Superscripts and Subscripts
    [0x20A0,0x20CF], # Currency Symbols
    [0x20D0,0x20FF], # Combining Diacritical Marks for Symbols
    [0x2100,0x214F], # Letterlike Symbols
    [0x2150,0x218F], # Number Forms
    [0x2190,0x21FF], # Arrows
    [0x2200,0x22FF], # Mathematical Operators
    [0x2300,0x23FF], # Miscellaneous Technical
    [0x2400,0x243F], # Control Pictures
    [0x2440,0x245F], # Optical Character Recognition
    [0x2460,0x24FF], # Enclosed Alphanumerics
    [0x2500,0x257F], # Box Drawing
    [0x2580,0x259F], # Block Elements
    [0x25A0,0x25FF], # Geometric Shapes
    [0x2600,0x26FF], # Miscellaneous Symbols
    [0x2700,0x27BF], # Dingbats
    [0x27C0,0x27EF], # Miscellaneous Mathematical Symbols-A
    [0x27F0,0x27FF], # Supplemental Arrows-A
    [0x2800,0x28FF], # Braille Patterns
    [0x2900,0x297F], # Supplemental Arrows-B
    [0x2980,0x29FF], # Miscellaneous Mathematical Symbols-B
    [0x2A00,0x2AFF], # Supplemental Mathematical Operators
    [0x2B00,0x2BFF], # Miscellaneous Symbols and Arrows
    [0x2C00,0x2C5F], # Glagolitic
    [0x2C60,0x2C7F], # Latin Extended-C
    [0x2C80,0x2CFF], # Coptic
    [0x2D00,0x2D2F], # Georgian Supplement
    [0x2D30,0x2D7F], # Tifinagh
    [0x2D80,0x2DDF], # Ethiopic Extended
    [0x2DE0,0x2DFF], # Cyrillic Extended-A
    [0x2E00,0x2E7F], # Supplemental Punctuation
    [0x2E80,0x2EFF], # CJK Radicals Supplement
    [0x2F00,0x2FDF], # Kangxi Radicals
    [0x2FF0,0x2FFF], # Ideographic Description Characters
    [0x3000,0x303F], # CJK Symbols and Punctuation
    [0x3040,0x309F], # Hiragana
    [0x30A0,0x30FF], # Katakana
    [0x3100,0x312F], # Bopomofo
    [0x3130,0x318F], # Hangul Compatibility Jamo
    [0x3190,0x319F], # Kanbun
    [0x31A0,0x31BF], # Bopomofo Extended
    [0x31C0,0x31EF], # CJK Strokes
    [0x31F0,0x31FF], # Katakana Phonetic Extensions
    [0x3200,0x32FF], # Enclosed CJK Letters and Months
    [0x3300,0x33FF], # CJK Compatibility
    [0x3400,0x4DBF], # CJK Unified Ideographs Extension A
    [0x4DC0,0x4DFF], # Yijing Hexagram Symbols
    [0x4E00,0x9FFF], # CJK Unified Ideographs
    [0xA000,0xA48F], # Yi Syllables
    [0xA490,0xA4CF], # Yi Radicals
    [0xA4D0,0xA4FF], # Lisu
    [0xA500,0xA63F], # Vai
    [0xA640,0xA69F], # Cyrillic Extended-B
    [0xA6A0,0xA6FF], # Bamum
    [0xA700,0xA71F], # Modifier Tone Letters
    [0xA720,0xA7FF], # Latin Extended-D
    [0xA800,0xA82F], # Syloti Nagri
    [0xA830,0xA83F], # Common Indic Number Forms
    [0xA840,0xA87F], # Phags-pa
    [0xA880,0xA8DF], # Saurashtra
    [0xA8E0,0xA8FF], # Devanagari Extended
    [0xA900,0xA92F], # Kayah Li
    [0xA930,0xA95F], # Rejang
    [0xA960,0xA97F], # Hangul Jamo Extended-A
    [0xA980,0xA9DF], # Javanese
    [0xA9E0,0xA9FF], # Myanmar Extended-B
    [0xAA00,0xAA5F], # Cham
    [0xAA60,0xAA7F], # Myanmar Extended-A
    [0xAA80,0xAADF], # Tai Viet
    [0xAAE0,0xAAFF], # Meetei Mayek Extensions
    [0xAB00,0xAB2F], # Ethiopic Extended-A
    [0xAB30,0xAB6F], # Latin Extended-E
    [0xABC0,0xABFF], # Meetei Mayek
    [0xAC00,0xD7AF], # Hangul Syllables
    [0xD7B0,0xD7FF], # Hangul Jamo Extended-B
    [0xF900,0xFAFF], # CJK Compatibility Ideographs
    [0xFB00,0xFB4F], # Alphabetic Presentation Forms
    [0xFB50,0xFDFF], # Arabic Presentation Forms-A
    [0xFE00,0xFE0F], # Variation Selectors
    [0xFE10,0xFE1F], # Vertical Forms
    [0xFE20,0xFE2F], # Combining Half Marks
    [0xFE30,0xFE4F], # CJK Compatibility Forms
    [0xFE50,0xFE6F], # Small Form Variants
    [0xFE70,0xFEFF], # Arabic Presentation Forms-B
    [0xFF00,0xFFEF], # Halfwidth and Fullwidth Forms
    [0xFFF0,0xFFFF], # Specials
    [0x10000,0x1007F], # Linear B Syllabary
    [0x10080,0x100FF], # Linear B Ideograms
    [0x10100,0x1013F], # Aegean Numbers
    [0x10140,0x1018F], # Ancient Greek Numbers
    [0x10190,0x101CF], # Ancient Symbols
    [0x101D0,0x101FF], # Phaistos Disc
    [0x10280,0x1029F], # Lycian
    [0x102A0,0x102DF], # Carian
    [0x102E0,0x102FF], # Coptic Epact Numbers
    [0x10300,0x1032F], # Old Italic
    [0x10330,0x1034F], # Gothic
    [0x10350,0x1037F], # Old Permic
    [0x10380,0x1039F], # Ugaritic
    [0x103A0,0x103DF], # Old Persian
    [0x10400,0x1044F], # Deseret
    [0x10450,0x1047F], # Shavian
    [0x10480,0x104AF], # Osmanya
    [0x10500,0x1052F], # Elbasan
    [0x10530,0x1056F], # Caucasian Albanian
    [0x10600,0x1077F], # Linear A
    [0x10800,0x1083F], # Cypriot Syllabary
    [0x10840,0x1085F], # Imperial Aramaic
    [0x10860,0x1087F], # Palmyrene
    [0x10880,0x108AF], # Nabataean
    [0x10900,0x1091F], # Phoenician
    [0x10920,0x1093F], # Lydian
    [0x10980,0x1099F], # Meroitic Hieroglyphs
    [0x109A0,0x109FF], # Meroitic Cursive
    [0x10A00,0x10A5F], # Kharoshthi
    [0x10A60,0x10A7F], # Old South Arabian
    [0x10A80,0x10A9F], # Old North Arabian
    [0x10AC0,0x10AFF], # Manichaean
    [0x10B00,0x10B3F], # Avestan
    [0x10B40,0x10B5F], # Inscriptional Parthian
    [0x10B60,0x10B7F], # Inscriptional Pahlavi
    [0x10B80,0x10BAF], # Psalter Pahlavi
    [0x10C00,0x10C4F], # Old Turkic
    [0x10E60,0x10E7F], # Rumi Numeral Symbols
    [0x11000,0x1107F], # Brahmi
    [0x11080,0x110CF], # Kaithi
    [0x110D0,0x110FF], # Sora Sompeng
    [0x11100,0x1114F], # Chakma
    [0x11150,0x1117F], # Mahajani
    [0x11180,0x111DF], # Sharada
    [0x111E0,0x111FF], # Sinhala Archaic Numbers
    [0x11200,0x1124F], # Khojki
    [0x112B0,0x112FF], # Khudawadi
    [0x11300,0x1137F], # Grantha
    [0x11480,0x114DF], # Tirhuta
    [0x11580,0x115FF], # Siddham
    [0x11600,0x1165F], # Modi
    [0x11680,0x116CF], # Takri
    [0x118A0,0x118FF], # Warang Citi
    [0x11AC0,0x11AFF], # Pau Cin Hau
    [0x12000,0x123FF], # Cuneiform
    [0x12400,0x1247F], # Cuneiform Numbers and Punctuation
    [0x13000,0x1342F], # Egyptian Hieroglyphs
    [0x16800,0x16A3F], # Bamum Supplement
    [0x16A40,0x16A6F], # Mro
    [0x16AD0,0x16AFF], # Bassa Vah
    [0x16B00,0x16B8F], # Pahawh Hmong
    [0x16F00,0x16F9F], # Miao
    [0x1B000,0x1B0FF], # Kana Supplement
    [0x1BC00,0x1BC9F], # Duployan
    [0x1BCA0,0x1BCAF], # Shorthand Format Controls
    [0x1D000,0x1D0FF], # Byzantine Musical Symbols
    [0x1D100,0x1D1FF], # Musical Symbols
    [0x1D200,0x1D24F], # Ancient Greek Musical Notation
    [0x1D300,0x1D35F], # Tai Xuan Jing Symbols
    [0x1D360,0x1D37F], # Counting Rod Numerals
    [0x1D400,0x1D7FF], # Mathematical Alphanumeric Symbols
    [0x1E800,0x1E8DF], # Mende Kikakui
    [0x1EE00,0x1EEFF], # Arabic Mathematical Alphabetic Symbols
    [0x1F000,0x1F02F], # Mahjong Tiles
    [0x1F030,0x1F09F], # Domino Tiles
    [0x1F0A0,0x1F0FF], # Playing Cards
    [0x1F100,0x1F1FF], # Enclosed Alphanumeric Supplement
    [0x1F200,0x1F2FF], # Enclosed Ideographic Supplement
    [0x1F300,0x1F5FF], # Miscellaneous Symbols and Pictographs
    [0x1F600,0x1F64F], # Emoticons
    [0x1F650,0x1F67F], # Ornamental Dingbats
    [0x1F680,0x1F6FF], # Transport and Map Symbols
    [0x1F700,0x1F77F], # Alchemical Symbols
    [0x1F780,0x1F7FF], # Geometric Shapes Extended
    [0x1F800,0x1F8FF], # Supplemental Arrows-C
    [0x20000,0x2A6DF], # CJK Unified Ideographs Extension B
    [0x2A700,0x2B73F], # CJK Unified Ideographs Extension C
    [0x2B740,0x2B81F], # CJK Unified Ideographs Extension D
    [0x2F800,0x2FA1F], # CJK Compatibility Ideographs Supplement
    [0xE0000,0xE007F], # Tags
    [0xE0100,0xE01EF]  # Variation Selectors Supplement
]
#}}}

ALL_CODEPOINTS = list(itertools.chain.from_iterable(range(x, y+1) for x, y in CODEPOINT_RANGES))

ENCODE_RESULT = encode(ALL_CODEPOINTS)

print(len(ENCODE_RESULT), '/', len(ALL_CODEPOINTS))

DECODE_RESULT = list(decode(ENCODE_RESULT))

tc = unittest.TestCase()
tc.assertListEqual(ALL_CODEPOINTS, DECODE_RESULT)

kennytm

Posted 2014-10-28T07:43:38.447

Reputation: 6 847

Great entry! You're right, the challenge doesn't require self-sync, and it's context independant. My code point count was wrong indeed, it seems like the correct count is 116816. (warning, if I'm right, it's not 116570 as you say, and you may have to recount your score) – xem – 2014-10-28T12:30:49.413

3@grawity: UTF-16 is self-synchronizing if you view every 16-bit as a single word. – kennytm – 2014-10-28T12:51:57.867

@xem: Thanks, updated the code to match the correct count. – kennytm – 2014-10-28T12:52:16.527

1log₂ 113706 = 16.79, so one could encode all these code points in 17 bits (3 bytes). – kennytm – 2014-10-28T07:55:11.677

1@KennyTM the ASCII compatibility requirement will throw that estimate off a little. By almost one bit, if my mental math is right. – hobbs – 2014-10-28T08:08:29.347

What exactly is the ambiguity requirement? I assume it doesn't have to be self-synchronising, and you seem to assume that each byte in the output is part of the representation of precisely one code point (correct?), so are you asking for a prefix code? Or is it sufficient that the encoder implement an injective function, but allowing arithmetic encoding? – Peter Taylor – 2014-10-28T08:10:24.260

@hobbs log₂ (113706 - 128) is still 16.79, so it just requires exactly 1 more bit, and still 3 bytes are needed. – kennytm – 2014-10-28T08:10:30.217

@PeterTaylor I'm not sure about what you mean by self-synchronizing, prefix, injective and algorithmic encoding (lol), but what I meant was: "two (groups of) code points can't have the same encoding, and N bytes can't be decoded in more than one way". Each byte of the output can but DOES NOT HAVE TO be identifiable as the Xth byte of the Y-bytes encoding of a character, (like UTF-8 does). – xem – 2014-10-28T08:19:44.487

I got 116816 adding up the amount of possible values. – feersum – 2014-10-28T08:47:36.717

As others have pointed out, doesn't this just amount to assigning a unique 17-bit integer to each code point? – COTO – 2014-10-28T09:23:27.307

@COTO yeah almost. you can encode all code points on 1 to 3 bytes. the challenge is to do it efficiently – xem – 2014-10-28T12:24:28.593

@feersum thanks, I also get 116816 after recounting. Let's use that. :) – xem – 2014-10-28T12:31:33.087

@KennyTM right. Well, 0.9998375 bits really, but I was in bed without a proper calculator when I wrote my previous comment :) – hobbs – 2014-10-28T15:48:10.250

Of possible interest: Reading the Wiki UTF-8 article made this challenge a great deal clearer. @xem, you might consider including this link in the OP somewhere.

– COTO – 2014-10-28T16:26:10.163

Invent your own Unicode 7.0.0 encoding (as efficient as possible)

Task

Scoring

Winner

Answers

Score: 318080 (~2.72 bytes/char)