count lines in a compressed file

48

7

if i have a .gz file on unix which has certain number of lines. How could i count the lines on unix without uncompressing it.

Vijay

Posted 2010-04-27T07:37:14.370

Reputation: 711

See http://stackoverflow.com/questions/846062/wc-gzipped-files

– sancho.s Reinstate Monica – 2015-11-07T12:48:10.243

Without extracting the archive you can't count the lines. – zoli2k – 2010-04-27T07:38:32.757

Answers

66

You can obviously not count newlines if the file is still compressed.

But you can decompress to a stream, and count the newlines in that stream, without ever writing the (decompressed) file to disk. That would go something like so:

zcat file.gz | wc -l

zcat for decompress & cat, wc for wordcount. See man pages for both if you want to know more.

EDIT

If you do not have zcat, zcat is just another name for gunzip -c.

extraneon

Posted 2010-04-27T07:37:14.370

Reputation:

7On Unices where gzip is distinct from compress, you want gzcat. – coneslayer – 2010-04-27T21:56:05.393

8

This also seems to work - grep for the number of line-endings in the file

zgrep -Ec "$" file.gz

Patrick Wright

Posted 2010-04-27T07:37:14.370

Reputation: 81

This gives a different (much higher) answer for me than piping to wc -l – OrangeDog – 2018-03-09T17:03:42.213

6

If you want to do it quickly, I recommend using 'pigz' (which IIRC stands for "Parallel Implementation of GZip"). I just had a similar situation where I wanted to count the number of lines in a bunch of gzip'ed files and here was my solution:

for x in *.gz; do unpigz -p 8 -c $x | wc -l && echo $x; done

Which gave me the number of lines and the file it counted from on alternating lines, using 8 processors. It ran quickly!

peter

Posted 2010-04-27T07:37:14.370

Reputation: 61

1Or if unpigz is not available, simply with for x in *.fastq.gz; do zcat "$x" | wc -l && echo $x; done – Calimo – 2015-11-20T22:34:56.127

2

Use this command:

gzgrep -c $ filename.gz

The command gzgrep behaves the same as grep but on gzip compressed files. It decompress the file on the fly for the regex matching.

In this case -c instruct the command to output number of matched lines and the regex $ matches end of line so it matches every line or the file.

The final result is identical to gzip -dc filename.gz | grep -c $.

Ravi K M

Posted 2010-04-27T07:37:14.370

Reputation: 21

Is gzgrep available on other systems than Solaris? – pabouk – 2014-11-21T09:43:46.183

1No. On other systems, command would be zgrep -c $ filename.gz – Ravi K M – 2016-05-11T08:08:14.280

1Although one might intuitively think this is better than zcat+wc, when I time them, they take the same amount of time. – ngọcminh.oss – 2018-05-10T12:05:33.910

2

If you're okay with a rough estimate rather than an exact count, and actually extracting the whole file or zgrepping it for line endings would both take much too long (which was my situation just now), you can:

zcat "$file" | head -1000 > 1000-line-sample.txt
ls -ls 1000-line-sample.txt "$file"

then the approximate line count is 1000 * (size of $file) / (size of 1000-line-sample), as long as your data is fairly homogeneous per line.

James

Posted 2010-04-27T07:37:14.370

Reputation: 41

Can you explain why this works? – Alex Moore-Niemi – 2020-02-24T02:09:00.143

0

gzip -cd <file.gz> | wc -l

This worked for me.

prashanth

Posted 2010-04-27T07:37:14.370

Reputation: 101