16

I've a 100GB file and I want to split into 100 of 1GB file each (by line break)

e.g.

split --bytes=1024M /path/to/input /path/to/output

For the 100 files generated, I want to apply gzip/zip to each of these files.

Is it possible to use a single command?

Ryan
  • 5,341
  • 21
  • 71
  • 87
  • 2
    For up to 1GB per file (less if the next line would put it over) use `--line-bytes=1024M`. – Brian May 26 '14 at 19:04

4 Answers4

39

Use "--filter":

split --bytes=1024M --filter='gzip > $FILE.gz' /path/to/input /path/to/output

Skyhawk
  • 14,149
  • 3
  • 52
  • 95
Peter
  • 391
  • 3
  • 3
  • this does not twork for me, keeps overwriting the same file as $FILE is not defined and does not even write in to the des folder. – splaisan Sep 16 '19 at 08:03
  • 1
    my mistake, needs single quotes to get $FILE replaced, my big mistake, apologies and thanks for the help: this final command worked for me to save fastq data that comes in blocks of 4 lines: 'zcat ERR3152365.fastq.gz | split -a 3 -d -l 1200000 --numeric-suffixes --filter='pigz -p 8 > $FILE.fq.gz' - splitout/part_' – splaisan Sep 16 '19 at 08:14
0

A bash function to compress on the fly with pigz

function splitreads(){

# add this function to your .bashrc or alike
# split large compressed read files into chunks of fixed size
# suffix is a three digit counter starting with 000
# take compressed input and compress output with pigz
# keeps the read-in-pair suffix in outputs
# requires pigz installed or modification to use gzip

usage="# splitreads <reads.fastq.gz> <reads per chunk; default 10000000>\n";
    if [ $# -lt 1 ]; then
        echo;
        echo ${usage};
        return;
    fi;

# threads for pigz (adapt to your needs)
thr=8

input=$1

# extract prefix and read number in pair
# this code is adapted to paired reads
base=$(basename ${input%.f*.gz})
pref=$(basename ${input%_?.f*.gz})
readn="${base#"${base%%_*}"}"

# 10M reads (4 lines each)
binsize=$((${2:-10000000}*4))

# split in bins of ${binsize}
echo "# splitting ${input} in chuncks of $((${binsize}/4)) reads"

cmd="zcat ${input} \
  | split \
    -a 3 \
    -d \
    -l ${binsize} \
    --numeric-suffixes \
    --additional-suffix ${readn} \
    --filter='pigz -p ${thr} > \$FILE.fq.gz' \
    - ${pref}_"

echo "# ${cmd}"
eval ${cmd}
}
splaisan
  • 101
  • 4
0

A one-liner using a conditional is as close as you can come.

cd /path/to/output && split --bytes=1024M /path/to/input/filename && gzip x*

gzip will only run if split is successful because of the conditional && which is also between the cd and split making sure the cd is successful, too.. Note that split and gzip output to the current directory instead of having the ability to specify the output directory. You can make the directory, if needed:

mkdir -p /path/to/output && cd /path/to/output && split --bytes=1024M /path/to/input/filename && gzip x*

To put it all back together:

gunzip /path/to/files/x* && cat /path/to/files/x* > /path/to/dest/filename
Dennis Williamson
  • 60,515
  • 14
  • 113
  • 148
0

Using this command with -d option allows you to generate numeric sufixes.

split -d -b 2048m "myDump.dmp" "myDump.dmp.part-" && gzip myDump.dmp.part*

Files generated:

    myDump.dmp.part-00
    myDump.dmp.part-01
    myDump.dmp.part-02
    ...
Iván
  • 1
  • 3