Inspiration:

Heavily inspired by Smallest chess board compression I decided to make a similar but distinctly different competition.

tl;dr

Take the file Chess_Games.txt and compress it as much as you can so that you can expand it to the original file.

Objective:

Write an algorithm to encode and decode an entire chess database from start position to end

The encoding must be able to determine for all of the games all the positions:

The location all the pieces
Whose turn it is
Whether the player can castle on each side.
Whether the player can perform an en-passant, and if so, which of their pawns?
The previous position

Additionally:

Each game should also include who won and how it ended (forfeit, draw, checkmate, stalemate, etc...)

Input/Output:

There should be 2 algorithms Compress/Expand which satisfy the following properties:

Compress takes in a file of games via a sequence of moves via chess notation and output compressed file
Expand does the opposite, taking in a compressed file and outputting the original file with all the games in the same order
Correctness: Expand(Compress(file)) = file for all properly formed files

Any game that is not well formed or violates the rules of chess is considered bad. All bad games may be skipped.

It must be able to parse sen notation. Look at chessgames.com and https://database.lichess.org/ for some examples.

I have compiled a file of the first 10000 games from "May 2017" in Chess_Games.txt

The file should look like the following:

e4 d5 exd5 Nf6 d3 Qxd5 Nc3 Qf5 Be2 Bd7 g4 Qe6 g5 Nd5 Ne4 Bc6 Bg4 Qe5 f4 Qd4 Nf3 Qb6 Qe2 e6 Be3 Qa6 O-O Nd7 c4 Ne7 f5 Bxe4 dxe4 exf5 exf5 O-O-O f6 gxf6 gxf6 Nc6 Nd4 Nxd4 Bxd4 Bc5 Bxc5 Rhg8 Be7 Rde8 b4 Qxf6 Bxf6 Rxe2 h3 h5 Kh1 hxg4 Rg1 Nxf6 Rae1 Rxe1 Rxe1 gxh3 Kh2 Ng4+ Kxh3 Nf2+ Kh2 Ng4+ Kg3 f5 Kf4 Nh6 Re7 Rg4+ Ke5 Kd8 Kf6 Ng8+ Kf7 Nxe7 Kf8 f4 c5 f3 c6 f2 cxb7 Nc6 b8=Q+ Nxb8 Kf7 Kd7 0-1
d3 e6 e3 d5 Nf3 Nf6 Be2 Be7 O-O O-O Nc3 Nbd7 Ne1 c5 f3 e5 e4 d4 Nb1 b6 f4 exf4 Rxf4 Qc7 Rf3 Bd6 g3 Ng4 Rf1 Ndf6 Bxg4 Bxg4 Qd2 Rae8 Qf2 Re5 Bf4 Rh5 Bxd6 Qxd6 Nf3 c4 e5 Rxe5 Nxe5 Qxe5 Nd2 cxd3 cxd3 Be2 Rfe1 Qe3 Qxe3 dxe3 Ne4 Bxd3 Nxf6+ gxf6 Rxe3 Bg6 Rae1 Kg7 h4 h5 R1e2 Bf5 Rg2 Bg4 Re4 Kg6 Rge2 f5 Re5 f6 Re6 Rg8 Re7 Rg7 Re8 1-0
d4 d5 e4 dxe4 c4 Nf6 Qc2 Bf5 Nc3 Qxd4 Be3 Qe5 Nge2 e6 Ng3 Bb4 a3 Bxc3+ bxc3 Nc6 Be2 Na5 O-O Nxc4 Bd4 Qb5 Nxf5 exf5 Bxf6 gxf6 Rab1 Nxa3 Qa2 Qxb1 Rxb1 Nxb1 Qxb1 O-O-O Qb3 b6 Ba6+ Kb8 Qb5 Rhe8 Qc6 1-0
e3 c5 d3 d5 Nf3 Nc6 Be2 Nf6 O-O g6 h3 Bg7 Re1 O-O Nbd2 Re8 Nf1 e5 c3 b6 N3h2 Bb7 Qc2 Qc7 b3 Rac8 Bb2 a5 Rac1 a4 Qb1 axb3 axb3 Ba6 d4 Bxe2 Rxe2 exd4 cxd4 Qb8 dxc5 d4 Ree1 dxe3 Rxe3 Nd5 Rxe8+ Rxe8 Bxg7 Kxg7 Re1 Rxe1 Qxe1 bxc5 Qd1 Nd4 Ne3 Nf4 Neg4 Qxb3 Qe1 Qd5 Ne3 Qe6 Qf1 c4 Nhg4 Nde2+ Kh2 c3 g3 c2 gxf4 c1=Q Qxc1 Nxc1 Ng2 Qe4 Kg3 Nd3 f3 Qe2 Nh4 h5 Nh2 Qe1+ Kg2 Nf2 Nf5+ gxf5 Kg3 Qg1+ Kh4 Nxh3 Kxh5 1-0
e4 d5 exd5 Qxd5 Nc3 Qd8 d4 Bf5 Nf3 e6 Nh4 Bg6 Nxg6 hxg6 Be3 Bd6 Bc4 a6 Qf3 c6 O-O-O Nd7 Ne4 Qc7 Nxd6+ Qxd6 Bf4 Qb4 Bb3 Ngf6 Rhe1 O-O-O Bg3 a5 Qf4 Qb6 Re3 Rh5 Bc4 Rf5 Qd6 Ne8 Qa3 Qa7 Red3 b5 Bxb5 cxb5 Rc3+ Kb7 Qe7 b4 Rc5 Rxc5 dxc5 Qxc5 Qxd8 Ndf6 Qb8+ Ka6 Rd8 Qa7 Rxe8 Nxe8 Qxe8 Qc5 Qa8+ Kb5 Qb8+ Ka4 b3+ Ka3 Qf4 Qc3 Qe5 1-0
e4 d5 exd5 Qxd5 Nf3 Qd8 Nc3 e6 Bc4 Nf6 Bb3 Be7 a3 a6 O-O O-O h3 b6 d3 Bb7 Bg5 Nbd7 Ne4 h6 Bh4 Nxe4 Bxe7 Qxe7 dxe4 Bxe4 Re1 Bb7 Ba2 Nf6 b4 Rfd8 Qc1 Qd6 c4 Qc6 Qc2 Rd7 Rac1 Rad8 c5 bxc5 Qxc5 Qb5 Qxb5 axb5 Ne5 Rd2 Bb3 Rb2 Bd1 Rdd2 Re2 Rxe2 Bxe2 Rxe2 Nf3 Ra2 Rxc7 Bd5 Rc8+ Kh7 Ne5 Rxa3 Rf8 Ra1+ Kh2 h5 Rxf7 Kh6 Rf8 Kg5 g3 Kf5 Nd7 Ra2 Nxf6 gxf6 Rg8 Rxf2+ Kg1 Rg2+ Kf1 Rh2 g4+ hxg4 hxg4+ Ke5 Re8 Rh1+ Kf2 Rh2+ Kg3 Rg2+ Kh4 Rf2 Kg3 Rf4 g5 Re4 gxf6 Kxf6 Rf8+ Ke5 Rh8 Re3+ Kf2 Re4 Rh5+ Kd4 Rh6 Rf4+ Kg3 Re4 Rh8 Re3+ 0-1
...

Scoring:

In order to make things objective, the winner is the algorithm that can compress the files at https://database.lichess.org/ as small as possible. The target database is the "May 2017". The winner is whoever has the smallest file that expands properly.

The file to use is Chess_Games.txt which is the first 10000 games from the "May 2017" database at https://database.lichess.org/ with all of the header information removed. This file will be the benchmark to use. It should be 2,789,897 bytes with a SHA-256 hash of 56b6d2fffc7644bcd126becd03d859a3cf6880fa01a47fa08d4d6a823a4866bc (Pastebin might remove the last newline after game 10000)

Naive Solution:

Using generic compression algorithms:

zip: 647,772 bytes (76.841%)
7z: 652,813 bytes (76.661%)
bz2: 722,158 bytes (74.182%)
xz: 784,980 bytes (71.936%)
rar: 853,482 bytes (69.487%)
gz: 923,474 bytes (66.985%)

edggy

Posted 2017-07-16T20:01:00.620

Reputation: 227

There shouldn't be a space after http:// – JungHwan Min – 2017-07-16T20:03:42.577

Thanks for fixing it. The site wouldn't let me post a question with more than 1 link since I don't have the appropriate reputation. So i added a space and hoped someone else would fix it or I could fix it later. – edggy – 2017-07-16T20:08:56.577

No problem, I understand the pain. – notjagan – 2017-07-16T20:11:24.637

Nice question! Could you elaborate more on how the scoring is calculated? Also, which variant of the standard chess notation should we expect and output? In the future, you might find the Sandbox helpful :)

– musicman523 – 2017-07-16T20:27:43.413

1Could we get the chess game/notation you'll be using to score the answers? – wrymug – 2017-07-16T20:37:20.180

@AndersKaseorg: That's a very good suggestion, I just don't have the time to find all those games and run each algorithm on them. If you would like you are more than welcome to compile a collection of games and I will add them to the winning criterion. Also although there are a finite number of games the space complexity is still objective. For example if i asked you "which sorting algorithm is faster on lists of 8-byte integers of length 10000, bubble sort or quicksort?" the answer would still be quicksort even though there are only a finite number of values (specifically (2^(8*8))^10000) – edggy – 2017-07-16T21:29:25.267

@AndersKaseorg: I took your suggestion and found several databases at https://database.lichess.org/ for us to use. The goal is now to have the highest compression ratio from raw games to compressed games.

– edggy – 2017-07-17T00:46:39.000

I agree, that is why I am taking that file and extracting only the moves so it looks like the example above. Then I'll truncate it to only the first 10,000 games and upload that to the pastebin link above. – edggy – 2017-07-18T15:20:43.593

@AndersKaseorg I had forgotten about that, thanks! – Nnnes – 2017-07-18T19:59:57.667

1I stripped the annotations from the benchmark file and removed the confusing references the draw rules to make everything clearer. I changed the statement about parsing to use those sites as a reference. – edggy – 2017-07-18T21:49:54.150

Answers

Python, score = 426508 bytes

Compression Function: ( m+1 ) ⋅ log₂ ( b+1 )

m is number of moves and b is branching factor

Attempt 1:

I answered the question at Smallest chess board compression so I'm trying to fix it up to post here.

Naively I thought that I can encode each move in 12 bits, 4 triplets of the form (start x, start y, end x, end y) where each is 3 bits.

We would assume the starting position and move pieces from there with white going first. The board is arranged such that ( 0 , 0 ) is white's lower left corner.

For example the game:

  e4    e5
 Nf3    f6
Nxe5  fxe5
...    ...

Would be encoded as:

100001 100010 100110 100100
110000 101010 101110 101101
101010 100100 101101 100100
...

This leads to an encoding of 12 m bits where m is the number of moves made.

Attempt 2:

I realized that each move in the previous attempt encodes many illegal moves. So I decided to only encode legal moves. We enumerate possible moves as follows, number each square such that ( 0 , 0 ) → 0 , ( 1 , 0 ) → 1 , ( x , y ) → x + 8 y . Iterate through the tiles and check if a piece is there and if it can move. If so add the positions it can go to to a list. Choose the list index that is the move you want to make. Add that number to the running total of moves weighted by 1 plus the number of possible moves.

Example as above: From the starting position the first piece that can move is the knight on square 1, it can move to square 16 or 18, so add those to the list [(1,16),(1,18)]. Next is the knight on square 6, add it's moves. Overall we get:

[(1,16),(1,18),(6,21),(6,23),(8,16),(8,24),(9,17),(9,25),(10,18),(10,26),(11,19),(11,27),(12,20),(12,28),(13,21),(13,29),(14,22),(14,30),(15,23),(15,31)]

Since we want the move ( 12 , 28 ) , we encode this as 13 in base 20 since there are 20 possible moves.

So now we get the game number g₀ = 13

Next we do the same for black except we number the tiles in reverse (to make it easier, not required) to get the list of moves:

[(1,16),(1,18),(6,21),(6,23),(8,16),(8,24),(9,17),(9,25),(10,18),(10,26),(11,19),(11,27),(12,20),(12,28),(13,21),(13,29),(14,22),(14,30),(15,23),(15,31)]

Since we want the move ( 11 , 27 ) , we encode this as 11 in base 20 since there are 20 possible moves.

So now we get the game number g₁ = ( 11 ⋅ 20 ) + 13 = 233

Next we get the following list of moves for white:

[(1,16),(1,18),(3,12),(3,21),(3,30),(3,39),(4,12),(5,12),(5,19),(5,26),(5,33),(5,40),(6,12),(6,21),(6,23),(8,16),(8,24),(9,17),(9,25),(10,18),(10,26),(11,19),(11,27)(13,21),(13,29),(14,22),(14,30),(15,23),(15,31)]

Since we want the move ( 6 , 21 ) , we encode this as 13 in base 29 since there are 29 possible moves.

So now we get the game number g₂ = ( ( 13 ⋅ 20 ) + 11 ) 20 + 13 = 5433

Next we get the following list of moves for black: [(1,11),(1,16),(1,18),(2,11),(2,20),(2,29),(2,38),(2,47),(3,11),(4,11),(4,18),(4,25),(4,32),(6,21),(6,23),(8,16),(8,24),(9,17),(9,25),(10,18),(10,26),(12,20),(12,28),(13,21),(13,29),(14,22),(14,30),(15,23),(15,31)]

Since we want the move ( 10 , 18 ) , we encode this as 19 in base 29 since there are 29 possible moves.

So now we get the game number g₃ = ( ( ( 19 ⋅ 29 + 13 ) 20 ) + 11 ) 20 + 13 = 225833

And continue this process for all the remaining moves. You can think of g as the function g ( x , y , z ) = x y + z . Thus g₀ = g ( 1 , 1 , 13 ) , g₁ = g ( g ( 1 , 1 , 11 ) , 20 , 13 ) , g₂ = g ( g ( g ( 1 , 1 , 13 ) , 20 , 11 ) , 20 , 13 ) , g₃ = g ( g ( g ( g ( 1 , 1 , 19 ) , 29 , 13 ) , 20 , 11 ) , 20 , 13 )

To decode a game number g₀, we start at the initial position and enumerate all possible moves. Then we compute g₁ = g₀ // l, m₀ = g₀ % l, where l is the number of possible moves, '//' is the integer division operator and '%' is the modulus operator. It should hold that g₀ = g₁ + m₀. Next we make the move m₀ and repeat.

From the example above if g₀ = 225833 then g₁ = 225833 // 20 = 11291 and m₀ = 225833 % 20= 13. Next g₂ = 11291 // 20 = 564 and m₁ = 11291 % 20 = 11. Then g₃ = 11291 // 20 = 564 and m₂ = 11291 % 20 = 11. Therefore g₄ = 564 // 29 = 19 and_m_₃ = 564 % 29 = 13. Finally g₅= 19 // 29 = 0 and m₄ = 19 % 29 = 19.

So how many bits are used to encode a game this way?

For simplicity, let's say there are always 35 moves each turn (https://en.wikipedia.org/wiki/Branching_factor) and for the worst case scenario we always pick the biggest one, 34. The number we will get is 34 ⋅ 35^m + 34 ⋅ 35^m-1 + 34 ⋅ 35^m-2 + ⋯ + 34 ⋅ 35 + 34 = 35^m+1 − 1 where _m is the number of moves. To encode 35^m+1 − 1 we need about log₂ ( 35^m+1 ) bits which is about ( m + 1 ) ⋅ log₂ ( 35 ) = 5.129 ⋅ ( m + 1 )

On average m = 80 (40 moves per player) so this would take 416 bits to encode. If we were recording many games we would need a universal encoding since we don't know how many bits each number will need

Worst case when m = 11741 so this would take 60229 bits to encode.

Additional Notes:

Note that g = 0 denotes a valid game, the one where the piece on the lowest square moves to the lowest square it can.

If you want to refer to a specific position in the game you may need to encode the index. This can be added either manually e.g concatenate the index to the game, or add an additional "end" move as the last possible move each turn. This can now account for players conceding, or 2 in a row to denote the players agreed to a draw. This is only necessary if the game did not end in a checkmate or stalemate based on the position, in this case it is implied. In this case it brings the number of bits needed on average to 419 and in the worst case 60706.

A way to handle forks in a what-if scenario is to encode the game up to the fork and then encode each branch starting at the forked position instead of the initial position.

Implementation:

Take a look at my github repo for the code: https://github.com/edggy/ChessCompress

edggy

Posted 2017-07-16T20:01:00.620

Reputation: 227

Python, score = 418581 bytes

This uses a bijective variant of asymmetric numeral systems. Because it’s bijective, you can not only compress any valid list of chess games into a file and expand it back to the same list—you can also expand any file into a valid list of chess games and compress it back to the same file! Perfect for hiding your porn collection.

Requires python-chess. Run with python script.py compress or python script.py expand, both of which will read from standard input and write to standard output.

import chess
import sys

RESULTS = ['1-0\n', '1/2-1/2\n', '0-1\n', '*\n']
BITS = 24

def get_moves(board):
    if board.is_insufficient_material() or not board.legal_moves:
        return [board.result() + '\n']
    else:
        return RESULTS + sorted(
            board.legal_moves,
            key=lambda move: (move.from_square, move.to_square, move.promotion))

def read_bijective():
    buf = bytearray(getattr(sys.stdin, 'buffer', sys.stdin).read())
    carry = 0
    for i in range(len(buf)):
        carry += buf[i] + 1
        buf[i] = carry & 0xff
        carry >>= 8
    if carry:
        buf.append(carry)
    return buf

def write_bijective(buf):
    carry = 0
    for i in range(len(buf)):
        carry += buf[i] - 1
        buf[i] = carry & 0xff
        carry >>= 8
    while carry:
        carry = (carry << 8) + buf.pop() + 1
    getattr(sys.stdout, 'buffer', sys.stdout).write(buf)

def add_carry(buf, carry):
    for i in range(len(buf)):
        if carry == 0:
            break
        carry += buf[i]
        buf[i] = carry & 0xff
        carry >>= 8
    return carry

def do_compress():
    board = chess.Board()
    state = 0
    buf = bytearray()

    games = []
    for sans in sys.stdin:
        game = []
        for san in sans.split(' '):
            move = san if san in RESULTS else board.parse_san(san)
            moves = get_moves(board)
            game.append((len(moves), moves.index(move)))
            if move in RESULTS:
                board.reset()
            else:
                board.push(move)
        games.append(game)

    for game in reversed(games):
        for (n, i) in reversed(game):
            q = ((1 << BITS) - 1 - i) // n + 1
            while state >= q << 8:
                buf.append(state & 0xff)
                state >>= 8
            hi, j = divmod(state, q)
            lo = n * j + i
            state = hi << BITS | lo
        state += add_carry(buf, 1)

    while state:
        buf.append(state & 0xff)
        state >>= 8
    write_bijective(buf)

def do_expand():
    board = chess.Board()
    state = 0
    buf = read_bijective()

    while True:
        while buf and state < 1 << BITS:
            state = state << 8 | buf.pop()
        if state == 0:
            break
        state += add_carry(buf, -1)

        while True:
            moves = get_moves(board)
            while buf and state < 1 << BITS:
                state = state << 8 | buf.pop()
            n = len(moves)
            hi, lo = divmod(state, 1 << BITS)
            j, i = divmod(lo, n)
            q = ((1 << BITS) - 1 - i) // n + 1
            state = j + q * hi
            move = moves[i]
            if move in RESULTS:
                sys.stdout.write(move)
                board.reset()
                break
            else:
                sys.stdout.write(board.san(move).rstrip('+#') + ' ')
                board.push(move)

if __name__ == '__main__':
    {'compress': do_compress, 'expand': do_expand}[sys.argv[1]]()

Anders Kaseorg

Posted 2017-07-16T20:01:00.620

Reputation: 29 242

1Verified on Ubuntu 17.04 using Python 2.7.13 and ran each command separately. – edggy – 2017-07-24T19:40:20.677

This is really great! I am also intrigued what pornographic chess games will be like in practice. – Anush – 2018-11-25T11:46:46.953

Smallest chess game compression

Inspiration:

tl;dr

Objective:

Input/Output:

Scoring:

Naive Solution:

Answers

Python, score = 426508 bytes

Compression Function: ( m+1 ) ⋅ log2 ( b+1 )

Attempt 1:

Attempt 2:

So how many bits are used to encode a game this way?

Additional Notes:

Implementation:

Python, score = 418581 bytes

Compression Function: ( m+1 ) ⋅ log₂ ( b+1 )