While doing some auditing of a database, I found that some attachment content did not match the hashes given in the document's _attachments
map.
I tested this by downloading the document and calculating its hash. Comparing that to couchdb showed that they did not match. I then noticed that the mismatched attachments were ones that couchdb was configured to compress. It appears that my couch id configured to use snappy compression:
foobox# grep -E 'file_compression|compressible_types' /etc/couchdb/{default,local}.ini
/etc/couchdb/default.ini:file_compression = snappy
/etc/couchdb/default.ini:compressible_types = text/*, application/javascript, application/json, application/xml
However, when I attempt to compress the attachment content using snappy, and calculate the hash of the compressed data, it still does not match couchdb hash. In my example below, document-25977
is uncompressed (type application/pdf), and the uncompressed hash matches that provided by couchdb. The 2nd, document-78608
, is a compressible type (text/plain), and the hashes do not match:
foobox$ python hashcompare.py
document-25977
couch len: 142918
couch hash: 028540dd92e1982bcb65c29d32e9617e (md5)
local uncompressed len: 142918
local uncompressed hash: 028540dd92e1982bcb65c29d32e9617e
local compressed len: 132333
local compressed hash: 3157583223dc1a53e1a3386d6abc312d
document-78608
couch len: 2180
couch hash: e613ab6d7f884b835142979489170499 (md5)
local uncompressed len: 2180
local uncompressed hash: 0ab2516c820f5d7afb208e3be7b924dd
local compressed len: 1382
local compressed hash: d9e79232662f57e6af262fc9f867eaf2
This is the script I used to do the comparison:
import couchdb
import snappy
import md5
import base64
server = couchdb.Server('http://localhost:9999')
db = server['program1']
for doc_id in ['document-25977', 'document-78608']:
print doc_id
doc = db[doc_id]
att_stub = doc['_attachments'][doc_id]
hash_type, tmpdigest = att_stub['digest'].split('-', 1)
att = db.get_attachment(doc, doc_id)
data = att.read()
# CouchDB is using snappy compression
compressed_data = snappy.compress(data)
print 'couch len: ', att_stub['length']
print 'couch hash: ', base64.b64decode(tmpdigest).encode('hex'), '(%s)' % hash_type
print 'local uncompressed len: ', len(data)
print 'local uncompressed hash: ', md5.md5(data).digest().encode('hex')
print 'local compressed len: ', len(compressed_data)
print 'local compressed hash: ', md5.md5(compressed_data).digest().encode('hex')
print
I've verified that the documents are uncorrupted when fetched. So what am I missing? I'm not versed enough in Erlang to read the couchdb source and figure out what is going on. Why would the documents have a digest that does not match its contents compressed or other wise?