Model a probability table using 15-bit fixed probabilities

A histogram is an array of integers that counts the number of times a symbol occurs. For example, for the string "aaabbbabbdebabbbebebdbaaabbabaaaaeebeaabaaba" a histogram is a: 18, b: 18, c: 0, d: 2, e: 6, or [18, 18, 0, 2, 6] in short.

If we were to pick a random character out of this string, what would the chance be that it is a particular character? We can calculate a probability table using the histogram. Simply divide every number by the sum of the histogram. So we have a 18/44 chance to get a.

Probability tables are used in compression algorithms. But compression algorithms have to be fast and always do the same computation on every machine. We can't use ratios, as they're slow and get big (1248148/13468235) and we can't use floating point as rounding errors differ from machine to machine. So we use... fixed point probability!

A 15-bit fixed point probability is a 15-bit integer representing a probability. If n is the integer, then it represents the real probability n / 0x7fff. So 0 is never, 0x7fff is always, and 0x3fff is (nearly) 50%.

So, the challenge is to compute a 15-bit fixed point probability table, given a histogram. But there are a few assumptions that the output must fulfill:

Every symbol in the histogram has a respective 15-bit fixed point probability in the probability table.
A symbol with count 0 in the histogram must have zero probability.
A symbol with count > 0 in the histogram may not have zero probability.
The cumulative probability must be 1: the probability table must sum to 0x7fff.

Your score for a certain histogram will be the sum of squared differences between the real probabilities and the 15-bit fixed point approximations.

The score for your answer will be the sum of scores you get on the set of histograms generated by the following Python (2 or 3) program:

import random
random.seed(0)
for _ in range(100):
    l = 1 + min(int(random.expovariate(1/100.)), 1000)
    print([int(random.expovariate(1/10.)) for _ in range(l)])

Alternatively, you can view the test set here. I may change the test set in the future, do not optimize for this particular one.

If you write your probability tables in a Python list format, one per line, the following Python program can score your answer in one fell swoop using python score.py hist.txt prob.txt:

from __future__ import division
import sys
scores = []
with open(sys.argv[1]) as hist, open(sys.argv[2]) as prob:
    skip_empty_lines = lambda it: filter(lambda s: s.strip(), it)
    for hist_l, prob_l in zip(skip_empty_lines(hist), skip_empty_lines(prob)):
        hist, prob = eval(hist_l), eval(prob_l)
        scores.append(sum((p/0x7fff - h/sum(hist))**2 for h, p in zip(hist, prob)))
print(sum(scores))

Lowest score wins, ties broken by asymptotic complexity before first answer.

orlp

Posted 2016-02-20T21:19:13.690

Reputation: 37 067

Model a probability table using 15-bit fixed probabilities

Answers

Python, score 8.682671033510566e-07, O(n * log(n))

Time Complexity