0

I have ~ 40GB file, and a filter command that for some reason breaks when I try to run it on the file (even when passed via pipe).

But. It doesn't fail when I split input file into many small files, pass each of them via the filter, and concatenate outputs.

So, I'm looking for a way to do:

  • split file into small blocks (10MB?)
  • for each block run some command on it
  • concatenate output in correct order

but without first splitting the file completely (I don't want to use that much disk space).

I can write such program myself, but perhaps there is already something that would do what I need?

  • 2
    Have you considered posting your current filter command? Maybe some have a better solution instead of splitting the input file. – Arjan Aug 06 '09 at 14:12
  • Arjan: sure, it's iconv -c -f utf8 -t utf8 it bails out on 40+gb file, but works great on the same file splitted into parts. not sure how's that relevant, but hey - it's not secret :) –  Aug 06 '09 at 15:42
  • Is your version of iconv large-file aware? See http://serverfault.com/questions/24803/why-cant-i-create-files-larger-than-2-gb-on-linux it may be a related problem. – romandas Aug 06 '09 at 19:40
  • @romandas: might not be, but I'm not in position to change iconv/system. –  Aug 06 '09 at 19:43

6 Answers6

1

If you do decide to write it yourself and you are talking about text files, you could use Perl with the Tie::File module. This allow you to work on large files a line at time in place. It is meant for just this sort of thing.

You could try Tie::File::AnyData if the file is not text too.

Kyle Brandt
  • 82,107
  • 71
  • 302
  • 444
0

Edit: Just noticed you don't want to split the file in advance because of disk space, this probably wont work for you

Use split:

$ man split

NAME
   split - split a file into pieces

SYNOPSIS
   split [OPTION] [INPUT [PREFIX]]

DESCRIPTION
   Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT is -, read standard input.

   Mandatory arguments to long options are mandatory for short options too.

   -a, --suffix-length=N
          use suffixes of length N (default 2)

   -b, --bytes=SIZE
          put SIZE bytes per output file

   -C, --line-bytes=SIZE
          put at most SIZE bytes of lines per output file

   -d, --numeric-suffixes
          use numeric suffixes instead of alphabetic

   -l, --lines=NUMBER
          put NUMBER lines per output file

   --verbose
          print a diagnostic to standard error just before each output file is opened

   --help display this help and exit

   --version
          output version information and exit

   SIZE may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
Robert Swisher
  • 1,147
  • 7
  • 14
0

I suggest using sed to extract just the parts you want and piping the output into your command:

sed -n '1,1000p' yourfile | yourcommand

will pipe the first 1000 lines to yourcommand

sed -n '1001,2000p' yourfile | yourcommand

will pipe the next 1000 lines.

etc.

You could put this in a loop in a script if you want.

e.g.

#!/bin/bash
size=1000
lines=`cat $1 | wc -l`
first=1
last=$size

while [ $last -lt $lines ] ; do
    sed -n "${first},${last}p" $1 | yourcommand
    first=`expr $last + 1`
    last=`expr $last + $size`
done

last=$lines
sed -n "${first},${last}p" $1 | yourcommand
bmb
  • 443
  • 4
  • 12
0

Try this:

#!/bin/bash

FILE=/var/log/messages
CHUNKSIZE=100

LINE=1
TOTAL=`wc -l $FILE | cut -d ' ' -f1`
while [ $LINE -le $TOTAL ]; do
  let ENDLINE=$LINE+$CHUNKSIZE
  sed "${LINE},${ENDLINE}p" $FILE | grep -i "mark"
  let LINE=$ENDLINE+1
done
Justin Ellison
  • 718
  • 5
  • 9
0

Well - to everybody suggesting writing my own solution. I can. And I even can do it without multiple "scans" of input file. But the problem/question is: is there any ready tool?

Simplest Perl based approach might look like this:

#!/usr/bin/perl -w
use strict;

my ( $lines, $command ) = @ARGV;

open my $out, '|-', $command;

my $i = 0;
while (<STDIN>) {
    $i++;
    if ($i > $lines) {
        close $out;
        open $out, '|-', $command;
        $i = 1;
    }
    print $out $_;
}

close $out;

exit;

and now I can:

=> seq 1 5
1
2
3
4
5

=> seq 1 5 | ./run_in_parts.pl 3 tac
3
2
1
5
4
0

You are not the first person to run into this problem with iconv. Someone has written a Perl script to solve it.

iconv doesn't handle large files well. From the glibc source code, in iconv/iconv_prog.c:

/* Since we have to deal with
   arbitrary encodings we must read the whole text in a buffer and
   process it in one step.  */

However, for your particular case, it might be better to write your own UTF-8 validator. You could easily distill iconv -c -f utf8 -t utf8 down to a small C program, with a loop that calls iconv(3). Since UTF-8 is modeless and self-synchronizing, you can process it in chunks.

#include <errno.h>
#include <iconv.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define BUFSIZE 4096

/* Copy STDIN to STDOUT, omitting invalid UTF-8 sequences */
int main() {
    char ib[BUFSIZE], ob[BUFSIZE], *ibp, *obp;
    ssize_t bytes_read;
    size_t iblen = 0, oblen;
    unsigned long long total;
    iconv_t cd;

    if ((iconv_t)-1 == (cd = iconv_open("utf8", "utf8"))) {
        perror("iconv_open");
        return 2;
    }

    for (total = 0;
         bytes_read = read(STDIN_FILENO, ib + iblen, sizeof(ib) - iblen);
         total += bytes_read - iblen) {

        if (-1 == bytes_read) {     /* Handle read error */
            perror("read");
            return 1;
        }
        ibp = ib; iblen += bytes_read;
        obp = ob; oblen = sizeof(ob);
        if (-1 == iconv(cd, &ibp, &iblen, &obp, &oblen)) {
            switch (errno) {
              case EILSEQ:          /* Invalid input multibyte sequence */
                fprintf(stderr, "Invalid multibyte sequence at byte %llu\n",
                        1 + total + sizeof(ib) - iblen);
                ibp++; iblen--;     /* Skip the bad byte next time */
                break;
              case EINVAL:          /* Incomplete input multibyte sequence */               
                break;
              default:
                perror("iconv");
                return 2;
            }
        }
        write(STDOUT_FILENO, ob, sizeof(ob) - oblen);

        /* There are iblen bytes at the end of ib that follow an invalid UTF-8
           sequence or are part of an incomplete UTF-8 sequence.  Move them to  
           the beginning of ib. */
        memmove(ib, ibp, iblen);
    }
    return iconv_close(cd);
}
200_success
  • 4,701
  • 1
  • 24
  • 42