How can I create a zip / tgz in Linux such that Windows has proper filenames?

Currently, tar -zcf arch.tgz files/* encodes filenames in UTF, so Windows users see all characters spoiled in filenames which are not english, and can do nothing with it.

zip -qq -r arch.zip files/* has the same behavior.

How can I create a zip / tgz archive so when Windows users extract it will have all filenames encoded properly?

kolypto

Posted 2009-10-25T14:41:17.217

Reputation: 2 861

Answers

Currently, tar encodes filenames in UTF

Actually tar doesn't encode/decode filenames at all, It simply copies them out of the filesystem as-is. If your locale is UTF-8-based (as in many modern Linux distros), that'll be UTF-8. Unfortunately the system codepage of a Windows box is never UTF-8, so the names will always be mangled except on tools such as WinRAR that allow the charset used to be changed.

So it is impossible to create a ZIP file with non-ASCII filenames that work across different countries' releases of Windows and their built-in compressed folder support.

It is a shortcoming of the tar and zip formats that there is no fixed or supplied encoding information, so non-ASCII characters will always been non-portable. If you need a non-ASCII archive format you'll have to use one of the newer formats, such as recent 7z or rar. Unfortunately these are still wonky; in 7zip you need the -mcu switch, and rar still won't use UTF-8 unless it detects characters not in the codepage.

Basically it's a horrible mess and if you can avoid distributing archives containing filenames with non-ASCII characters you'll be much better off.

bobince

Posted 2009-10-25T14:41:17.217

Reputation: 8 816

Great, thanks! Unfortunately, most users know nothing about 7z, and rar is proprietary :( – kolypto – 2009-10-26T00:29:16.540

Yeah, it's a problem. ZIP is by far the most usable solution for users, as all modern OSes have nice native UI support for it. Unfortunately the charset problem is not really solvable today in ZIP (and even in other archive formats it's still troublesome). – bobince – 2009-10-28T19:51:41.967

Here is a simple Python script that I've written to unpack tar files from UNIX on Windows:

import tarfile

archive_name = "archive_name.tar"

def recover(name):
    return unicode(name, 'utf-8')

tar = tarfile.open(name=archive_name, mode='r', bufsize=16*1024)
updated = []
for m in tar.getmembers():
    m.name = recover(m.name)
    updated.append(m)

tar.extractall(members=updated)
tar.close()

Alexei Osipov

Posted 2009-10-25T14:41:17.217

Reputation: 351

Sir, you saved my life. God bless you :) – user1576772 – 2015-08-05T19:42:39.153

Awesome! this script helped me convert a EUC-JP encoded tar file that was created on an old Solaris server. – wm_eddie – 2010-12-09T10:19:03.587

The problem, using in Linux the default tar (GNU tar), is solved... adding the --format=posix parameter when creating the file.

For example:
tar --format=posix -cf

In Windows, to extract the files, I use bsdtar.

In https://lists.gnu.org/archive/html/bug-tar/2005-02/msg00018.html it is written (since 2005!!):

> I read something in the ChangeLog about UTF-8 being supported. What does
> this mean?
> I found no way to create an archive that would be interchangeable
> between different locales.

When creating archives in POSIX.1-2001 format (tar --format=posix or --format=pax), tar converts file names from the current locales to UTF-8 and then stores them in archive. When extracting, the reverse operation is performed.

P.S. Instead of typing --format=posix you can type -H pax, which is shorter.

Sys

Posted 2009-10-25T14:41:17.217

Reputation: 91

I believe you're running into problems with the Zip container format itself. Tar may be suffering from the same problem.

Use the 7zip (.7z) or RAR (.rar) archive formats instead. Both are available for Windows and Linux; the p7zip software handles both formats.

I just tested creating .7z, .rar, .zip, and .tar files on both WinXP and Debian 5, and the .7z and .rar files store/restore filenames correctly while the .zip and .tar files don't. It doesn't matter which system is used to create the test archive.

quack quixote

Posted 2009-10-25T14:41:17.217

Reputation: 37 382

I had problems with unpacking tar and zip files I receive from Windows users. While I do not answer the question "how to create the archive which will work", the scripts below help to unpack tar and zip files correctly regardless the original OS.

WARNING: one has to tune the source encoding manually (cp1251, cp866 in examples below). Commandline options may be a good solution in a future.

Tar:

#!/usr/bin/env python

import tarfile
import codecs
import sys

def recover(name):
    return codecs.decode(name, 'cp1251')

for tar_filename in sys.argv[1:]:
    tar = tarfile.open(name=tar_filename, mode='r', bufsize=16*1024)
    updated = []
    for m in tar.getmembers():
        m.name = recover(m.name)
        updated.append(m)
    tar.extractall(members=updated)
    tar.close()

Zip:

#!/usr/bin/env python

import zipfile
import os
import codecs
import sys

def recover(name):
    return codecs.decode(name, 'cp866')

for filename in sys.argv[1:]:
    archive = zipfile.ZipFile(filename, 'r')
    infolist = archive.infolist()
    for i in infolist:
        f = recover(i.filename)
        print f
        if f.endswith("/"):
            os.makedirs(os.path.dirname(f))
        else:
            open(f, 'w').write(archive.read(i))
    archive.close()

UPD 2018-01-02: I use chardet package to guess the correct encoding of the raw chunk of data. Now script works out of the box on all my bad archives, as well as a good ones.

Things to note:

All filenames are extracted and merged into the single string to make a bigger piece of the text for the encoding guessing engine. It means that few filenames screwed in a different way each may spoil the guess.
Special fast-path was used to handle a good unicode text (chardet doesn't work with a normal unicode object).
Doctests are added to test and to demonstrate that normalizer recognizes any encoding on a reasonably short strings.

Final version:

#!/usr/bin/env python2
# coding=utf-8

import zipfile
import os
import codecs
import sys

import chardet


def make_encoding_normalizer(txt):
    u'''
    Takes raw data and returns function to normalize encoding of the data.
        * `txt` is either unicode or raw bytes;
        * `chardet` library is used to guess the correct encoding.

    >>> n_unicode = make_encoding_normalizer(u"Привет!")
    >>> print n_unicode(u"День добрый")
    День добрый

    >>> n_cp1251 = make_encoding_normalizer(u"Привет!".encode('cp1251'))
    >>> print n_cp1251(u"День добрый".encode('cp1251'))
    День добрый
    >>> type(n_cp1251(u"День добрый".encode('cp1251')))
    <type 'unicode'>
    '''
    if isinstance(txt, unicode):
        return lambda text: text

    enc = chardet.detect(txt)['encoding']
    return lambda file_name: codecs.decode(file_name, enc)


for filename in sys.argv[1:]:
    archive = zipfile.ZipFile(filename, 'r')
    infolist = archive.infolist()

    probe_txt = "\n".join(i.filename for i in infolist)
    normalizer = make_encoding_normalizer(probe_txt)

    for i in infolist:
        print i.filename
        f = normalizer(i.filename)
        print f
        dirname = os.path.dirname(f)
        if dirname:
            assert os.path.abspath(dirname).startswith(os.path.abspath(".")), \
                "Security violation"
            if not os.path.exists(dirname):
                os.makedirs(dirname)
        if not f.endswith("/"):
            open(f, 'w').write(archive.read(i))
    archive.close()


if __name__ == '__main__' and len(sys.argv) == 1:
    # Hack for Python 2.x to support unicode source files as doctest sources.
    reload(sys)
    sys.setdefaultencoding("UTF-8")

    import doctest
    doctest.testmod()

    print "If there are no messages above, the script passes all tests."

dmitry_romanov

Posted 2009-10-25T14:41:17.217

Reputation: 151

Thank you for your programs! Regretfully, the Zip program does not work under Python 3, but it works under Python 2. – beroal – 2017-07-16T07:12:48.837

@beroal, I updated the script. Now it uses the engine developed by Mozilla for Firefox to autodetect the encoding. – dmitry_romanov – 2018-01-02T10:20:12.470

POSIX-1.2001 specified how TAR uses UTF-8.

As of 2007, changelog version 6.3.0 in the PKZIP APPNOTE.TXT (http://www.pkware.com/documents/casestudies/APPNOTE.TXT) specified how ZIP uses UTF-8.

It's only which tools support these standards properly, that remains an open question.

damjan

Posted 2009-10-25T14:41:17.217

Reputation: 141