decompress ZIP with given encoding

26

17

I got ZIP file(s), which contains files, which filenames are in some encoding. Let's say I know encoding of those filenames, but I still dont know how to properly decompress them.

Here is example file, it contains one file "【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12.ass"

I know used encoding is GB18030 (Chinese)

Question is - how to unpack that file in FreeBSD using unzip or other CLI utility to get proper encoded filename? I tried everything what I could, but result was never good. Please help.

I tried on OSX:

MBP1:test 2ge$ bsdtar xf gb18030.zip
MBP1:test 2ge$ ls
%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12/      gb18030.zip
MBP1:test 2ge$ cd %A1%BESSK%D7%D6Ļ%D7顿The\ Vampire\ Diaries\ %CE%FCѪ%B9%ED%C8ռ%C7S06E12/
MBP1:%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12 2ge$ ls
%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12.ass*
MBP1:%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12 2ge$ find . | iconv -f gb18030 -t utf-8
.
./%A1%BESSK%D7%D6L抬%D7椤縏he Vampire Diaries %CE%FC血%B9%ED%C8占%C7S06E12.ass 
MBP1:%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12 2ge$ convmv -r -f gb18030 -t utf-8 --notest .
Skipping, already UTF-8: ./%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12.ass
Ready!

I tried similar with unzip, but I get similar problem.

Thanks, now trying on FREE BSD, where I am connecting using SSH from OSX (Terminal):

# locale
LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=C

The first thing, I would like to is to proper show Chinese names. I changed

setenv LC_ALL zh_CN.GB18030
setenv LANG zh_CN.GB18030

Then I downloaded file and try to "ls" to see proper characters, but not luck. So I think I have to solve first Chinese locale to verify when I get proper result, actually I can compare it. Can you also help me please with this?

2ge

Posted 2015-02-02T10:46:20.857

Reputation: 261

Answers

22

Here's what I do on Ubuntu 16.04 to unzip a zip in any encoding, as long as I know what that encoding is. The same method should work on FreeBSD because it only relies on widely available unzip tool.

  1. I double-check the exact name of the encoding, as to not misspell it: https://www.iana.org/assignments/character-sets/character-sets.xhtml

  2. I simply run

    $ unzip -O <encoding> <filename> -d <target_dir>
    

    or

    $ unzip -I <encoding> <filename> -d <target_dir>
    

    choosing between -O or -I according to instructions here:

    $ unzip -h
    UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
      ...
      -O CHARSET  specify a character encoding for DOS, Windows and OS/2 archives
      -I CHARSET  specify a character encoding for UNIX and other archives
      ...
    

    which means that I simply try -O and it should work, because not a lot of people would create a .zip file in Unix...


So, for your specific example:

  1. The exact encoding name is GB18030.

  2. I use the -O flag and:

    $ unzip -O GB18030 gb18030.zip -d target_dir
    Archive:  gb18030.zip
       creating: target_dir/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12/
      inflating: target_dir/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12.ass
    

    ... it works.

mbdevpl

Posted 2015-02-02T10:46:20.857

Reputation: 321

For zips created by Greek Windows I had success with this method and encoding CP737 – ndemou – 2017-09-21T09:01:18.967

Bravo! I double checked the man page, it actually works but totally undocumented, none the zsh completion have this parameter. – ttimasdf – 2018-03-29T06:46:46.533

3unzip does not have this option in Mac OS X and always creates percent-encoded filenames. @javacom's unar suggestion worked as a charm. – Phil Krylov – 2018-04-10T18:48:42.447

Looks like a Debian-specific functionality. My unzip tells it's UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler and doesn't provide such options. – L29Ah – 2019-04-11T19:06:11.917

2@L29Ah My unzip in Debian 9 is exactly the same version and has no such options. Probably Ubuntu specific? – Arnie97 – 2019-04-16T14:20:48.907

@Arnie97 and L29Ah: The unzip on CentOS 7.6.1810 (not Debian family) reports itself as UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler. and it has these options. – mbdevpl – 2019-04-18T01:31:57.377

why this is not accepted answer? – Wang – 2019-04-25T10:33:45.523

You can use -O option on any distributions. First, download the source by apt source unzip on Ubuntu (live environment is enough). Second, copy the unzip-6.0 directory to your system. Third, cd into the directory. Finally, execute sudo make --file=unix/Makefile generic && sudo make --file=unix/Makefile install to compile and install. The default prefix is /usr/local (not just /usr). For the detailed explanation, read README and INSTALL. This procedure is confirmed on Arch Linux, whose original unzip doesn't supply -O option. – ynn – 2019-09-19T16:33:33.710

@ynn Or you can pick only unzip-6.0/debian/patches/20-unzip60-alt-iconv-utf8.patch and apply it to an official source by Info-ZIP and then compile and install. This procedure is also confirmed on Arch Linux. (On Arch, you can asp checkout unzip and then makepkg -o and then apply the patch and then makepkg -ei.)

– ynn – 2019-09-19T18:04:25.800

11

On most POSIX filesystems the filename is just a series of bytes and it's up to userspace to make any sense of it. You can use this to your advantage.

  1. First, extract the archive using bsdtar, since the unzip tool seems to mangle the file names, while bsdtar will extract them raw. (I'm testing this on Linux. I guess FreeBSD just calls it tar.)

    $ bsdtar xf gb18030.zip
    
  2. Verify that tools like iconv can successfully decode the names:

    $ find . | iconv -f gb18030 -t utf-8
    

    (Note that this only affects the find output, not files themselves.)

  3. Finally use convmv to convert the file names to UTF-8:

    $ convmv -r -f gb18030 -t utf-8 --notest .
    

    (Note: I had to install Encode::HanExtra from CPAN for the GB18030 support, and manually add use Encode::HanExtra; to /usr/bin/convmv even though it's supposed to

  4. In case convmv is unavailable, script it:

    $ find . -depth | while read -r old; do
        old=./$old;
        head=${old%/*};
        tail=${old##*/};
        new=$head/$(echo "$tail" | iconv -f gb18030 -t utf-8);
        [ "$old" = "$new" ] || mv "$old" "$new";
    done
    

    (At least on Linux, this has an advantage in that iconv is almost always available, and it always supports gb18030.)

user1686

Posted 2015-02-02T10:46:20.857

Reputation: 283 655

thanks grawity looking into this. I am testing right now on OSX (but thats really close to FreeBSD, and I think result will be similar). adding comment to my question, can not edit here... – 2ge – 2015-02-02T17:08:04.233

1@2ge: Ah, OSX might actually be quite different, as HFS+ internally forces file names into NFD UTF-16 rather than storing bytestrings, so there's a possibility that it'll corrupt the GB18030 names before you get a chance to convert them. – user1686 – 2015-02-02T17:25:49.280

I edited original question, add some more comments. – 2ge – 2015-02-05T17:21:52.453

Yeah, I tried it on macOS Sierra and bsdtar reported lots of "Failed to create xxx" errors (because the parent directory names are corrput). Had to copy my archive to a Linux VPS, use unzip -O to extract it, and copy the result back to my Mac using ssh -C. – Chang Qian – 2017-09-25T09:57:36.070

10

Method 1 : use unar utility

sudo apt-get install unar

unar -e gb18030 gb18030.zip

Method 2 : Use a python script to unzip the file (reference https://gist.github.com/usunyu/dfc6e56af6e6caab8018bef4c3f3d452#file-gbk-unzip-py )

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# unzip-gbk.py

import os
import sys
import zipfile
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--encoding", help="encoding for filename, default gbk")
parser.add_argument("-l", help="list filenames in zipfile, do not unzip", action="store_true")
parser.add_argument("file", help="process file.zip")
args = parser.parse_args()
print "Processing File " + args.file

file=zipfile.ZipFile(args.file,"r");
if args.encoding:
    print "Encoding " + args.encoding
for name in file.namelist():
    if args.encoding:
        utf8name=name.decode(args.encoding)
    else:
        utf8name=name.decode('gbk')
    pathname = os.path.dirname(utf8name)
    if args.l:
        print "Filename " + utf8name
    else:
        print "Extracting " + utf8name
        if not os.path.exists(pathname) and pathname!= "":
            os.makedirs(pathname)
        data = file.read(name)
        if not os.path.exists(utf8name):
            fo = open(utf8name, "w")
            fo.write(data)
            fo.close
file.close()

The example gb18030.zip will extract the following file

【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12
【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12.ass

javacom

Posted 2015-02-02T10:46:20.857

Reputation: 101

2Thank you, the unar method is the most hassle-free at least on Mac OS X. – Phil Krylov – 2018-04-10T18:44:15.007

4

7z supports charset ID with a switch -scs, e.g.:

7z x -scs903 some.zip

where 903 is 中文簡體 charset. A longer list of charset IDs can be found here.

ohho

Posted 2015-02-02T10:46:20.857

Reputation: 2 356

27z -scs switch chooses only the encoding of the @-defined file list. – Phil Krylov – 2018-04-10T18:51:41.107

4

On OS X, you can use a GUI application called The Unarchiver. It can be installed using Mac App Store or Homebrew Cask:

brew cask install the-unarchiver

When you open a ZIP file with it, the application lets you choose the appropriate encoding using preview of a filename from the archive.

Melebius

Posted 2015-02-02T10:46:20.857

Reputation: 1 145

1

Use 7z to extract the file

7z x yourfile.zip

After that, convert the encoding of those filenames yourself:

convmv --notest -f from_encoding -t utf-8 -r your_extracted_folder/

This works for me.. from_encoding in my case is tis-620 (which is a Thai encoding), you need to find an appropriate encoding of your language. A popular one usually solves the problem but if the file name is still unreadable then try changing from_encoding to other things such as windows-1252 or shift-jis (Japanese) or whatever, you can list the available encoding using command:

convmv --list
iconv --list

This is very simple "how to solve" method for me.

off99555

Posted 2015-02-02T10:46:20.857

Reputation: 111

0

i just used 7zip and it managed to pick the right encoding.

(something that standard zip couldn't do)

but used it on Windows, with the GUI tool. Maybe the command line 7z will work for you, too.

Berry Tsakala

Posted 2015-02-02T10:46:20.857

Reputation: 1 049

There is an answer recommending 7z and your answer adds nothing more to it.

– Melebius – 2018-06-01T06:30:28.567

1Yes, there is *now* another answer recommending 7z.  You can hardly expect Berry’s answer to “add more” to an answer that was posted almost five months later. – Scott – 2018-06-01T07:15:32.867

@Scott My apologies, I failed to read the English month abbreviations correctly. – Melebius – 2018-06-04T14:30:43.253

OK. You might want to know that, if you put your mouse pointer over any date on the page (and “hover” there), it will show you the date as numbers. (At least this works on computers; people say it doesn’t work well on phones.) Also, below the bottom right corner of the question, you will see “active  oldest  votes”. This is answer sort order. If you click on “oldest”, then you will get the answers in order from oldest to newest. – Scott – 2018-06-04T15:59:33.390