Is there a way to grep gzipped content in hdfs without extracting it?

Question

I'm looking for a way to zgrep hdfs files

something like:

hadoop fs -zcat hdfs://myfile.gz | grep "hi"

or

hadoop fs -cat hdfs://myfile.gz | zgrep "hi"

it does not really work for me is there anyway to achieve that with command line?

gzip is a simple linear compressor, it doesn't contain any index or similar. Thus AFAIK what you want is impossible, just as in the case of hadoop or in any other settings. — peterh, Jan 22 '15 at 10:54
if i have a single gzip file in hdfs i could have expected "hadoop fs" to be able to uncompress it for me and do the `zless` / `zcat` for me... instead i need to do this work by myself... — Jas, Jan 22 '15 at 11:44

score 6 · Answer 1 · answered Jan 03 '16 at 00:27

This command-line will automatically find the right decompressor for any simple text file and print the uncompressed data to standard output:

hadoop fs -text hdfs:///path/to/file [hdfs:///path/to/another/file]

I have used this for .snappy & .gz files. It probably works for .lzo and .bz2 files.

This is an important feature because Hadoop uses a custom file format for Snappy files. This is the only direct way to uncompress a Hadoop-created Snappy file. There is no command-line 'unsnappy' command like there is for the other compressors. I also don't know of any direct command that creates one. I've only created them as Hive table data.

Note: hadoop fs -text is single-threaded and runs the decompression on the machine where you run the command.

score 5 · Accepted Answer · answered Feb 20 '15 at 18:32

zless/zcat/zgrep are just shell wrappers that make gzip output the decompressed data to stdout. To do what you want, you'll just have to write a wrapper around the hadoop fs commands.

Aside: The reason this probably didn't work for you is that you're missing an additional slash in your hdfs URI.

You wrote:

hadoop fs -cat hdfs://myfile.gz | zgrep "hi"

This attempts to contact the host or cluster called myfile.gz. What you really want is either hdfs:///myfile.gz or (assuming your config files are set up correctly), just myfile.gz, which the hadoop command should prepend with the correct cluster/namenode path defined by fs.defaultFS.

The following works for me.

$ hadoop fs -ls hdfs:///user/hcoyote/foo.gz
Found 1 items
-rw-r--r--   3 hcoyote users    5184637 2015-02-20 12:17 hdfs:///user/hcoyote/foo.gz

$ hadoop fs -cat hdfs:///user/hcoyote/foo.gz | gzip -c -d | grep -c Authorization
425893

$ hadoop fs -cat hdfs:///user/hcoyote/foo.gz | zgrep -c Authorization
425893

score 0 · Answer 3 · answered Jul 23 '15 at 18:13

I usually use hdfs fuse mounts.. so can use almost any regular Unix commands (some of the commands may not work as hdfs is not POSIX-compliant filesystem).

gunzip/zcat

$ gunzip /hdfs_mount/dir1/somefile.gz 
$ grep hi /hdfs_mount/dir1/somefile.gz

works just fine on hdfs fuse mounts. And faster to type too :) , easier to read if e.g. you want to script that.

To mount hadoop as a "regular" filesystem: http://www.cloudera.com/content/cloudera/en/documentation/cdh4/latest/CDH4-Installation-Guide/cdh4ig_topic_28.html

Is there a way to grep gzipped content in hdfs without extracting it?

3 Answers3