Merge small files to larger one with specific size and preserving lines

0

I got a bunch off files with "line-based" content. They have different sizes but I need a lot of files with the same sizes.

What I got:

  • File 1, 70 Lines, 5MB
  • File 2, 113 Lines, 15MB

What I want:

  • File 1, 10MB
  • File 2, 10MB

I thoght about merging the files together and split them with "split" command - but when using Split it breaks the lines - but I need to preserve the lines and only split after the line break. Using "split" command line based would not work, too because the size of the single lines differs a lot.

PascalTurbo

Posted 2015-03-03T07:56:17.483

Reputation: 356

1If the sizes of the line change a lot, then please answer this question: If you have a line that makes the file 10239 KB big (just under 10 MB) and the line added makes it 10241 KB (just over 10 MB), do you want the line to be included or not? – Master-Guy – 2015-03-03T07:59:35.947

It's better if the file is bigger than 10MB - so I want the line to be included – PascalTurbo – 2015-03-03T08:35:43.520

Answers

0

It isn't the fastest but it does what you've asked:

#!/bin/bash
minimumsize=10000
actualsize=0
infile=$(basename "$1")
filenum=1
outdir=/home/user/bin/testing/tmp
outfile=$infile.out$filenum

if [ ! -f "$outdir/$outfile" ]; then
    mkdir -p "`dirname \"$outdir/$outfile\"`" 2>/dev/null
fi

while read line
do
    if [ $actualsize -ge $minimumsize ]; then
        (( filenum++ ))
        outfile=$infile.out$filenum
        if [ ! -f "$outdir/$outfile" ]; then
            mkdir -p "`dirname \"$outdir/$outfile\"`" 2>/dev/null
        fi
    fi
    echo $line >> $outdir/$outfile
    actualsize=$(wc -c "$outdir/$outfile" | cut -f 1 -d ' ')
done < $1

Set the minimumsize and outdir variables then call it with the path to the file that you want to split by lines or size.

I'm sure that there is a command for doing this which is much faster though.

krowe

Posted 2015-03-03T07:56:17.483

Reputation: 5 031

0

A small shell script should solve the problem.

#!/bin/bash
file="part"
ext=".txt"
n=1
while read line
do
  fname=$file$n$ext
  echo $line >> $fname
  bytes=`wc -c $fname | cut -f1 -d' '`
  if [ $bytes -ge 10485760 ]
  then
    n=$((n+1))
  fi
done < input.txt

input.txt is your input file and the script should produce output like part1.txt, part2.txt, part3.txt... Each having ~10 MB data in it.

Ayan

Posted 2015-03-03T07:56:17.483

Reputation: 2 476