Model a probability table using 15-bit fixed probabilities

7

2

A histogram is an array of integers that counts the number of times a symbol occurs. For example, for the string "aaabbbabbdebabbbebebdbaaabbabaaaaeebeaabaaba" a histogram is a: 18, b: 18, c: 0, d: 2, e: 6, or [18, 18, 0, 2, 6] in short.

If we were to pick a random character out of this string, what would the chance be that it is a particular character? We can calculate a probability table using the histogram. Simply divide every number by the sum of the histogram. So we have a 18/44 chance to get a.

Probability tables are used in compression algorithms. But compression algorithms have to be fast and always do the same computation on every machine. We can't use ratios, as they're slow and get big (1248148/13468235) and we can't use floating point as rounding errors differ from machine to machine. So we use... fixed point probability!

A 15-bit fixed point probability is a 15-bit integer representing a probability. If n is the integer, then it represents the real probability n / 0x7fff. So 0 is never, 0x7fff is always, and 0x3fff is (nearly) 50%.

So, the challenge is to compute a 15-bit fixed point probability table, given a histogram. But there are a few assumptions that the output must fulfill:

  • Every symbol in the histogram has a respective 15-bit fixed point probability in the probability table.
  • A symbol with count 0 in the histogram must have zero probability.
  • A symbol with count > 0 in the histogram may not have zero probability.
  • The cumulative probability must be 1: the probability table must sum to 0x7fff.

Your score for a certain histogram will be the sum of squared differences between the real probabilities and the 15-bit fixed point approximations.

The score for your answer will be the sum of scores you get on the set of histograms generated by the following Python (2 or 3) program:

import random
random.seed(0)
for _ in range(100):
    l = 1 + min(int(random.expovariate(1/100.)), 1000)
    print([int(random.expovariate(1/10.)) for _ in range(l)])

Alternatively, you can view the test set here. I may change the test set in the future, do not optimize for this particular one.

If you write your probability tables in a Python list format, one per line, the following Python program can score your answer in one fell swoop using python score.py hist.txt prob.txt:

from __future__ import division
import sys
scores = []
with open(sys.argv[1]) as hist, open(sys.argv[2]) as prob:
    skip_empty_lines = lambda it: filter(lambda s: s.strip(), it)
    for hist_l, prob_l in zip(skip_empty_lines(hist), skip_empty_lines(prob)):
        hist, prob = eval(hist_l), eval(prob_l)
        scores.append(sum((p/0x7fff - h/sum(hist))**2 for h, p in zip(hist, prob)))
print(sum(scores))

Lowest score wins, ties broken by asymptotic complexity before first answer.

orlp

Posted 2016-02-20T21:19:13.690

Reputation: 37 067

Answers

5

Python, score 8.682671033510566e-07, O(n * log(n))

#!/usr/bin/env python
from __future__ import division
from ast import literal_eval
from fractions import Fraction
from operator import itemgetter
from sys import argv
import fileinput

def model(hist):
    TOTAL_FINAL = 0x7fff
    TOTAL_INIT = sum(hist)
    SCALE = Fraction(TOTAL_FINAL, TOTAL_INIT)
    hist_probs = [Fraction(x, TOTAL_INIT) for x in hist]
    res = [max(int(x*SCALE),1) if x > 0 else 0 for x in hist]

    adjust = TOTAL_FINAL - sum(res)
    res_probs = [Fraction(x,TOTAL_FINAL) for x in res]
    diffs = [(x-y)**2 for x,y in zip(hist_probs, res_probs)]
    d2 = [(x-y) for x,y in zip(hist_probs, res_probs)]
    if adjust > 0:
        idx, vals = zip(*list(filter(lambda x:x[1]>0, sorted(((i,x) for (i,x) in enumerate(d2) if hist[i]), key=itemgetter(1), reverse=True))))
        i = 0
        while i < adjust:
            res[idx[i%len(idx)]] += 1
            i += 1
    else:
        idx, vals = zip(*list(filter(lambda x:x[1]<0,sorted(((i,x) for (i,x) in enumerate(d2) if hist[i]), key=itemgetter(1)))))
        i = 0
        while i < adjust:
            res[idx[i%len(idx)]] -= 1
            i += 1

    return res

if __name__ == '__main__':
    for line in fileinput.input():
        print(model(literal_eval(line)))

This takes input as a file (or sequence of files) with one Python-list-format histogram per line, with the filename(s) provided on the command line (like python prog.py hist.txt).

The general strategy here is to make a (decent) initial guess at the best probability representation, by doing integer scaling of the input histogram to 15 bits. This almost always gets the resultant list just shy of 32767 total. To adjust the probabilities, the items with the largest squared differences from the input histogram get incremented. If the result is too high initially, the largest value in the array is decreased by the excess, instead.

Time Complexity

O(n * log(n)), based on the sorting, which is the most time-consuming operation for each histogram.

Mego

Posted 2016-02-20T21:19:13.690

Reputation: 32 998