LEB128

LEB128 or Little Endian Base 128 is a form of variable-length code compression used to store an arbitrarily large integer in a small number of bytes. LEB128 is used in the DWARF debug file format[1][2] and the WebAssembly binary encoding for all integer literals.[3]

Encoding format

LEB128 format is very similar to variable-length quantity format; the primary difference is that LEB128 is little-endian whereas variable-length quantities are big-endian. Both allow small numbers to be stored in a single byte, while also allowing encoding of arbitrarily long numbers. There are 2 versions of LEB128: unsigned LEB128 and signed LEB128. The decoder must know whether the encoded value is unsigned LEB128 or signed LEB128.

Unsigned LEB128

To encode an unsigned number using unsigned LEB128 first represent the number in binary. Then zero extend the number up to a multiple of 7 bits (such that if the number is non-zero, the most significant 7 bits are not all 0). Break the number up into groups of 7 bits. Output one encoded byte for each 7 bit group, from least significant to most significant group. Each byte will have the group in its 7 least significant bits. Set the most significant bit on each byte except the last byte. The number zero is encoded as a single byte 0x00.

As an example, here is how the unsigned number 624485 gets encoded:

MSB ------------------ LSB
      10011000011101100101  In raw binary
     010011000011101100101  Padded to a multiple of 7 bits
 0100110  0001110  1100101  Split into 7-bit groups
00100110 10001110 11100101  Add high 1 bits on all but last (most significant) group to form bytes
    0x26     0x8E     0xE5  In hexadecimal

→ 0xE5 0x8E 0x26            Output stream (LSB to MSB)

Unsigned LEB128 and VLQ (variable-length quantity) both compress any given integer into not only the same number of bits, but exactly the same bits—the two formats differ only in exactly how those bits are arranged.

Signed LEB128

A signed number is represented similarly: Starting with an $N$ -bit two's complement representation, where $N$ is a multiple of 7, the number is broken into groups as for the unsigned encoding.

For example the signed number -123456 is encoded as 0xC0 0xBB 0x78:

MSB ------------------ LSB 
         11110001001000000  Binary encoding of 123456
     000011110001001000000  As a 21-bit number
     111100001110110111111  Negating all bits (one’s complement)
     111100001110111000000  Adding one (two’s complement)
 1111000  0111011  1000000  Split into 7-bit groups
01111000 10111011 11000000  Add high 1 bits on all but last (most significant) group to form bytes
    0x78     0xBB     0xC0  In hexadecimal

→ 0xC0 0xBB 0x78            Output stream (LSB to MSB)

C-like pseudocode

Encode unsigned integer

do {
  byte = low order 7 bits of value;
  value >>= 7;
  if (value != 0) /* more bytes to come */
    set high order bit of byte;
  emit byte;
} while (value != 0);

Encode signed integer

more = 1;
negative = (value < 0);

/* the size in bits of the variable value, e.g., 64 if value's type is int64_t */
size = no. of bits in signed integer; 

while (more) {
  byte = low order 7 bits of value;
  value >>= 7;
  /* the following is only necessary if the implementation of >>= uses a 
     logical shift rather than an arithmetic shift for a signed left operand */
  if (negative)
    value |= (~0 << (size - 7)); /* sign extend */

  /* sign bit of byte is second high order bit (0x40) */
  if ((value == 0 && sign bit of byte is clear) || (value == -1 && sign bit of byte is set))
    more = 0;
  else
    set high order bit of byte;
  emit byte;
}

Decode unsigned integer

result = 0;
shift = 0;
while (true) {
  byte = next byte in input;
  result |= (low order 7 bits of byte) << shift;
  if (high order bit of byte == 0)
    break;
  shift += 7;
}

Decode signed integer

result = 0;
shift = 0;

/* the size in bits of the result variable, e.g., 64 if result's type is int64_t */
size = number of bits in signed integer;

do {
  byte = next byte in input;
  result |= (low order 7 bits of byte << shift);
  shift += 7;
} while (high order bit of byte != 0);

/* sign bit of byte is second high order bit (0x40) */
if ((shift <size) && (sign bit of byte is set))
  /* sign extend */
  result |= (~0 << shift);

JavaScript Code

Encode signed 32bit integer

const encodeSignedLeb128FromInt32 = (value) => {
  value |= 0;
  const result = [];
  while (true) {
    const byte = value & 0x7f;
    value >>= 7;
    if (
      (value === 0 && (byte & 0x40) === 0) ||
      (value === -1 && (byte & 0x40) !== 0)
    ) {
      result.push(byte);
      return result;
    }
    result.push(byte | 0x80);
  }
};

Decode signed 32bit integer

const decodeSignedLeb128 = (input) => {
  let result = 0;
  let shift = 0;
  while (true) {
    const byte = input.shift();
    result |= (byte & 0x7f) << shift;
    shift += 7;
    if ((0x80 & byte) === 0) {
      if (shift < 32 && (byte & 0x40) !== 0) {
        return result | (~0 << shift);
      }
      return result;
    }
  }
};

Uses

The DWARF file format uses both unsigned and signed LEB128 encoding for various fields.[2]
The mpatrol debugging tool uses LEB128 in its tracing file format.[4]
The Android project uses LEB128 in its Dalvik Executable Format (.dex) file format.[5]
Compressing tables in Hewlett-Packard IA-64 exception handling.[6]
It is used in the Linux kernel for its DWARF implementation.[7]
It is used in WebAssembly's portable binary encoding of the modules.[8]
It is used in LLVM's Coverage Mapping Format.[9] LLVM's implementation of LEB128 encoding and decoding is useful alongside the pseudo-code above.[10]
osu! uses LEB128 in its osu! replay (.osr) format.[11]
It is used in xz file format.[12]

Related Encodings

The LLVM bitcode file format uses a similar technique[13] except that the value is broken into groups of bits of context-dependent size, with the highest bit indicating a continuation, instead of a fixed 7 bits.
Dlugosz' Variable-Length Integer Encoding uses multiples of 7 bits for the first three size breaks, but after that the increments vary. It also puts all the prefix bits at the beginning of the word, instead of at the beginning of each byte.
Protocol Buffers use the same encoding for unsigned integers, but encode signed integers by prepending the sign as the least significant bit.
W3C Efficient XML Interchange (EXI) [14] represents unsigned integers using LEB128, in exactly the same way as described here.

gollark: What format?

gollark: You said we are all "constantly accelerated". I do not experience acceleration.

gollark: I'm not accelerated.

gollark: Oh, I'm not. GTech™ Physics Site-21114 isn't, due to testing of INTRANSITIVE WINTERS technology.

gollark: Clearly you are moving at different, relativistic velocities.

References

UNIX International (July 1993). "DWARF Debugging Information Format Specification Version 2.0, Draft" (PDF). Retrieved 2009-07-19. |section= ignored (help)
Free Standards Group (December 2005). "DWARF Debugging Information Format Specification Version 3.0" (PDF). p. 70. Retrieved 2009-07-19.
WebAssembly Community Group (January 2020). "WebAssembly Specification Release 1.0". Retrieved 2020-01-13.
"MPatrol documentation". December 2008. Retrieved 2009-07-19.
"Dalvik Executable Format". 2007. Retrieved 2009-07-19.
Christophe de Dinechin (October 2000). "C++ Exception Handling for IA-64". Retrieved 2009-07-19.
Matt Fleming (2009). "DWARF implementation". Retrieved 2011-05-22.
WebAssembly (2016). "WebAssembly Binary Encoding". Retrieved 2016-03-15.
LLVM Project (2016). "LLVM Code Coverage Mapping Format". Retrieved 2016-10-20.
LLVM Project (2019). "LLVM LEB128 encoding and decoding". Retrieved 2019-11-02.
"Osr (file format) - osu!wiki". osu.ppy.sh. Retrieved 2017-03-18.
"The .xz File Format". tukaani.org. 2009. Retrieved 2017-10-30.
http://llvm.org/docs/BitCodeFormat.html#variable-width-value
https://www.w3.org/TR/exi/#encodingUnsignedInteger