Implement cdbmake

4

Write the shortest program to build a cdb database, using the following syntax for input:

+klen,dlen:key->data

followed by a newline. Example:

+3,5:one->Hello
+3,7:two->Goodbye

(Unlike the real cdbmake, your program doesn't need to catch the empty-line-as-end-of-input case if you don't want to.) Your program needs to handle arbitrary strings (including NUL, newline, +, ,, :, and/or ->) in both the key and data correctly; therefore, using any kind of line-based input is likely to be wrong. You may assume the input is always valid, and you can assume that there is at most one instance of any key in the input.

The hash tables your program generates must behave like real hash tables. In other words, the hash tables should provide amortised constant-time lookups.

If your language provides built-in hash tables, you're allowed to use them, if they help any. However, do not use any cdb libraries or cdb programs to solve this question.

If your program writes the cdb to standard output, you may assume that it's redirected to a regular file. For example, you may seek or mmap it.

In the case of multiple shortest entries, the earliest-posted wins (as stipulated by the fastest gun in the west rule).


Here's a complete description of cdb's format as written by cdb's inventor:

A cdb is an associative array: it maps strings (keys) to strings (data).

A cdb contains 256 pointers to linearly probed open hash tables. The hash tables contain pointers to (key,data) pairs. A cdb is stored in a single file on disk:

+----------------+---------+-------+-------+-----+---------+
| p0 p1 ... p255 | records | hash0 | hash1 | ... | hash255 |
+----------------+---------+-------+-------+-----+---------+

Each of the 256 initial pointers states a position and a length. The position is the starting byte position of the hash table. The length is the number of slots in the hash table.

Records are stored sequentially, without special alignment. A record states a key length, a data length, the key, and the data.

Each hash table slot states a hash value and a byte position. If the byte position is 0, the slot is empty. Otherwise, the slot points to a record whose key has that hash value.

Positions, lengths, and hash values are 32-bit quantities, stored in little-endian form in 4 bytes. Thus a cdb must fit into 4 gigabytes.

A record is located as follows. Compute the hash value of the key in the record. The hash value modulo 256 is the number of a hash table. The hash value divided by 256, modulo the length of that table, is a slot number. Probe that slot, the next higher slot, and so on, until you find the record or run into an empty slot.

The cdb hash function is h = ((h << 5) + h) ^ c, with a starting hash of 5381.

Chris Jester-Young

Posted 2011-02-20T02:57:45.333

Reputation: 4 464

Could you be more precise as to where the boundary is between hashes that are accepted and hashes that aren't? – J B – 2011-02-20T11:58:30.817

@J B: Updated the question. :-) Basically, you need to provide amortised constant-time lookups. Once you get a small handful of collisions, it's time to expand the hash table! – Chris Jester-Young – 2011-02-20T16:07:56.453

@Chris but do we need to try and keep the table as small as possible? – Juan – 2011-02-20T23:31:24.120

My understanding of cdb format is that it doesn't offer constant-time lookups in the usual sense (i.e. "linearly probed open hash tables"). The lookup time merely seems fairly constant because only 2 disk seeks are usually required, and the time spent linearly probing those data is insignificant because cpu+ram is much faster than a disk seek (for mechanical disks anyway) – gnibbler – 2011-02-20T23:35:29.703

Also, do read from stdin or a file? – Juan – 2011-02-20T23:35:59.870

@gnibbler: Actually, with a sufficiently sparse hash table, they are indeed amortised constant-time lookups. You want to minimise collisions as much as possible (i.e., find a value of n where (hash >> 8) % n is relatively unique for all elements in the same hash table). The way I understand the format, you can expect n to be somewhat larger than count(elements). – Chris Jester-Young – 2011-02-21T00:09:14.260

@Juan: No, you don't need to try to golf the hash tables. :-) Also, the input can be from a file or stdin, as you please. Just as the output can be to a file or stdout, as you please. – Chris Jester-Young – 2011-02-21T00:12:07.703

@gnibbler: Indeed, due to the "linearly probed" aspect (i.e., the slots are essentially all sized for a single element, with collisions spilling to the next slot), you cannot implement cdb with n < count(elements). – Chris Jester-Young – 2011-02-21T00:12:57.937

Answers

2

Python - 398 chars

import os;from struct import*
Q=256;S=[[]for x in' '*Q];D=[0]*Q;O=Q*8;R=os.read(0,1e9)[1:];J="".join
while R:
 i,R=R.split(":",1);l,m=map(int,i.split(','));k,v,R=R[:l],R[l+2:2+l+m],R[2+l+m:];R=R.partition("+")[2];D+=[pack("II",l,m)+k+v];h=5381
 for c in k:h=h*33^ord(c)%Q**4
 S[h%Q]+=[pack("II",h,O)];O+=len(D[-1])
for i,s in enumerate(S):D[i]=pack("II",O,len(s));D+=[J(s)];O+=len(D[-1])
print J(D)

gnibbler

Posted 2011-02-20T02:57:45.333

Reputation: 14 170

1

Python - 552 characters

from sys import*
from struct import*
y=lambda f,i:stdout.write(pack(f,*i))
u=lambda i:y('<II',i)
i=stdin.read()
n=0
t=[[]for c in' '*256]
q=[]
m=[]
o=2048
w=list.append
while len(i)>n:
 d=i.find(':',n);f,g=map(int,i[n+1:d].split(','));n=d+1+f;k=i[d+1:n];d=i[n+2:n+2+g];h=5831;for c in k:h=(((h<<5)+h)^ord(c))&0xffffffff
 w(t[h%256],(h,o));o+=8+f+g;w(m,map(str,(f,g,k,d)));n+=3+g
for d in t:
 l=len(d)**2;t=[(0,0)]*l;u((o,l));o+=l*8
 for r in d:
  i=r[0]>>8
  while t[i%l][0]:i+=1
  t[i%l]=r
 w(q,t)
for v in m:y('<II'+zip(v,'ss'),v)
for t in q:map(u,t)

Juan

Posted 2011-02-20T02:57:45.333

Reputation: 2 738

q=[]:m=[] can be shortened to q=m=[], no? – Gaffi – 2013-07-05T18:55:47.570

@Gaffi that would make them point to the same array, instead of two different instances – Juan – 2013-07-05T19:06:47.010

Ah, makes sense. Thanks! – Gaffi – 2013-07-05T19:08:17.893