Recursively process zip archives to extract files while discarding specific format of files

0

UPDATE: I noticed that many people are viewing this thread, which makes me believe that this situation is not so rare after all. Anyway, I had asked a similar/related question on SO here, which has pretty decent solutions too which might solve the problem in a better way.

On my Windows 7 machine, I have a directory full of downloaded dumps in ZIP archives. Each archive contains few text files, PDFs and rarely XML files. I want to extract all the contents of each ZIP archive into its respective folder(must be created during the process) while discarding/ignoring extraction of PDFs. After extraction of required files from an archive, processed zip must not be deleted(or I would like to know how I can control it in different situations).

If it helps to know, the number of archives in the directory is in the range of 60k-70k. Also, I need separate output directories because files in an archive may have same names as files in other.

For example,

  • I have all my archives like one.zip, two.zip,.. in, say, D:\data
  • I create a new folder for processed data, say, D:\extracted
  • Now the data from D:\data\one.zip should go to D:\extracted\one. Here, D:\extracted\one should be created automatically.
  • During this complete uncompression process, all the encountered PDFs should not be extracted(be ignored). There's no point in extracting and then deleting.
  • (Optional) A log file should be maintained at, say, D:\extracted. Idea is to use this file to resume processing from where it was left in case of an error.
  • (Optional) Script should let me decide whether I want to keep source archives or delete them after processing.

I already did some search to find a solution but couldn't find one. I came across few questions like these

  1. Recursively unzip files where they reside, then delete the archives
  2. 7 zip extract recursively
  3. Is it possible to recursively list zip file contents with 7 zip without extracting

but they were not of much help(I'm not a pro with Windows by the way). I'm open to installing safe and ad free 3rd party software(open-source) like 7-zip.

EDIT: Is there a tool readily available to do what I need, I already tried Multi Unpacker. It doesn't create new directories, it can't ignore *.pdf files. It's even slow to start with, I think it first reads all the archives at source before starting to process them.

Thanks in advance!

Fr0zenFyr

Posted 2014-06-18T06:41:55.133

Reputation: 111

related: http://superuser.com/q/321829/243637

– Fr0zenFyr – 2017-01-25T10:01:59.913

I don't see any way around this without a batch or powershell script, as far as I know there is no out-of-the-box solution for something like this. – private_meta – 2014-06-18T06:54:02.687

@private_meta thanks for your response. I had already guessed it by now, but it's good to be sure. Can you point me in the right direction for writing a powershell for this. I also understand that ignoring PDFs during extraction is a huge challenge, so I'm ready to let the script extract everything and then delete the PDFs. – Fr0zenFyr – 2014-06-18T07:34:12.977

Answers

1

Modifying the answer found here, this piece of PowerShell script should do what you want. Just save it as a file with the Extension ".ps1". When calling it, just call it as ./filename.ps1 and it will extract the files to separate folders, delete the zip files and remove all files with .pdf extension. I have not tested if it works properly with recursive paths, but it should, please test it.

Edit: If you don't want your zip files to be deleted, remove or comment out (#) the line rmdir -Path $_.FullName -Force

Requirements: PowerShell, 7-zip and for you to set the 7-zip path in the file.

param([string]$folderPath="D:\Blah\files")

Get-ChildItem $folderPath -recurse | %{ 

    if($_.Name -match "^*.`.zip$")
    {
        $parent="$(Split-Path $_.FullName -Parent)";    
        write-host "Extracting $($_.FullName) to $parent"

        $arguments=@("e", "`"$($_.FullName)`"", "-o`"$($parent)\$($_.BaseName)`"");
        $ex = start-process -FilePath "`"C:\Program Files\7-Zip\7z.exe`"" -ArgumentList $arguments -wait -PassThru;

        if( $ex.ExitCode -eq 0)
        {
            write-host "Extraction successful, deleting $($_.FullName)"
            rmdir -Path $_.FullName -Force
            $arguments1="$($parent)\$($_.BaseName)\*.pdf"
            rmdir -Recurse -Path $arguments1
        }
    }
}

private_meta

Posted 2014-06-18T06:41:55.133

Reputation: 2 204

Thanks mate, I tip my hat to you. This script achieved almost everything that I wanted(except the log file thing). Since there has been no better answer than this, I accept your answer as the solution. Ohh, and BTW, by default my system's PowerShell didn't allow me to run the script saying it is disabled. I had two choices, either signing the script or executing set-ExecutionPolicy Unrestricted in PowerShell as Administrator. I tried both and they worked, though the 1st is better choice but out of this comment's scope to explain why. – Fr0zenFyr – 2014-06-19T06:22:56.643

Hi again, the script worked beautifully except in one case I found out. Few of my zip files had sub folders, the script extracted the folder and placed its contents parallel to it(outside sub-dir). Can this be fixed somehow? Also, I had few files which were .tar and .zip inside them, so what should I replace if($_.Name -match "^*.'.zip$") with to process them recursively? Thanks in advance. – Fr0zenFyr – 2014-06-20T09:44:26.923

1If you replace $arguments=@("e", with $arguments=@("x", it should preserve directory structure, please test that.

About recursive extraction, I don't know if it works properly like that, but what you can do is have the script call itself with a new directory, in this case every subdirectory. If there is a zip file in a root location of the folder, it will unpack it. Otherwise, it will get a lot more complicated. I'm not good enough with powershell though. – private_meta – 2014-06-21T07:18:06.903

I started disliking Power Shell now, it seems confusing and complicated. I'm trying to manage this with a batch script now, I already did much of it in just 1 line. Thanks mate for the reply though. I just posted a question on SO, you can see my progress there.

– Fr0zenFyr – 2014-06-21T07:30:55.630

I was thinking of asking you to help me modify the code from same answer, you are a mind reader. I will try this code and report the progress here. I'm really glad you took time to read my question carefully and covered almost every aspect of it. – Fr0zenFyr – 2014-06-18T07:49:13.673

You can use it as a basis and modify as needed. The part about not extracting pdf files in the first place is a major challenge, I don't think it would work with normal tools. – private_meta – 2014-06-18T07:52:19.373

Also, if you use more than one "param", you need to call them like this: "./script.ps -folderPath path -delete" and so on. For switches, refer to this

– private_meta – 2014-06-18T07:54:01.203