I had problems with unpacking tar
and zip
files I receive from Windows users. While I do not answer the question "how to create the archive which will work", the scripts below help to unpack tar
and zip
files correctly regardless the original OS.
WARNING: one has to tune the source encoding manually (cp1251
, cp866
in examples below). Commandline options may be a good solution in a future.
Tar:
#!/usr/bin/env python
import tarfile
import codecs
import sys
def recover(name):
return codecs.decode(name, 'cp1251')
for tar_filename in sys.argv[1:]:
tar = tarfile.open(name=tar_filename, mode='r', bufsize=16*1024)
updated = []
for m in tar.getmembers():
m.name = recover(m.name)
updated.append(m)
tar.extractall(members=updated)
tar.close()
Zip:
#!/usr/bin/env python
import zipfile
import os
import codecs
import sys
def recover(name):
return codecs.decode(name, 'cp866')
for filename in sys.argv[1:]:
archive = zipfile.ZipFile(filename, 'r')
infolist = archive.infolist()
for i in infolist:
f = recover(i.filename)
print f
if f.endswith("/"):
os.makedirs(os.path.dirname(f))
else:
open(f, 'w').write(archive.read(i))
archive.close()
UPD 2018-01-02: I use chardet
package to guess the correct encoding of the raw chunk of data. Now script works out of the box on all my bad archives, as well as a good ones.
Things to note:
- All filenames are extracted and merged into the single string to make a bigger piece of the text for the encoding guessing engine. It means that few filenames screwed in a different way each may spoil the guess.
- Special fast-path was used to handle a good unicode text (
chardet
doesn't work with a normal unicode object).
- Doctests are added to test and to demonstrate that normalizer recognizes any encoding on a reasonably short strings.
Final version:
#!/usr/bin/env python2
# coding=utf-8
import zipfile
import os
import codecs
import sys
import chardet
def make_encoding_normalizer(txt):
u'''
Takes raw data and returns function to normalize encoding of the data.
* `txt` is either unicode or raw bytes;
* `chardet` library is used to guess the correct encoding.
>>> n_unicode = make_encoding_normalizer(u"Привет!")
>>> print n_unicode(u"День добрый")
День добрый
>>> n_cp1251 = make_encoding_normalizer(u"Привет!".encode('cp1251'))
>>> print n_cp1251(u"День добрый".encode('cp1251'))
День добрый
>>> type(n_cp1251(u"День добрый".encode('cp1251')))
<type 'unicode'>
'''
if isinstance(txt, unicode):
return lambda text: text
enc = chardet.detect(txt)['encoding']
return lambda file_name: codecs.decode(file_name, enc)
for filename in sys.argv[1:]:
archive = zipfile.ZipFile(filename, 'r')
infolist = archive.infolist()
probe_txt = "\n".join(i.filename for i in infolist)
normalizer = make_encoding_normalizer(probe_txt)
for i in infolist:
print i.filename
f = normalizer(i.filename)
print f
dirname = os.path.dirname(f)
if dirname:
assert os.path.abspath(dirname).startswith(os.path.abspath(".")), \
"Security violation"
if not os.path.exists(dirname):
os.makedirs(dirname)
if not f.endswith("/"):
open(f, 'w').write(archive.read(i))
archive.close()
if __name__ == '__main__' and len(sys.argv) == 1:
# Hack for Python 2.x to support unicode source files as doctest sources.
reload(sys)
sys.setdefaultencoding("UTF-8")
import doctest
doctest.testmod()
print "If there are no messages above, the script passes all tests."
Great, thanks! Unfortunately, most users know nothing about 7z, and rar is proprietary :( – kolypto – 2009-10-26T00:29:16.540
Yeah, it's a problem. ZIP is by far the most usable solution for users, as all modern OSes have nice native UI support for it. Unfortunately the charset problem is not really solvable today in ZIP (and even in other archive formats it's still troublesome). – bobince – 2009-10-28T19:51:41.967