How to run command on parts of input file

Question

I have ~ 40GB file, and a filter command that for some reason breaks when I try to run it on the file (even when passed via pipe).

But. It doesn't fail when I split input file into many small files, pass each of them via the filter, and concatenate outputs.

So, I'm looking for a way to do:

split file into small blocks (10MB?)
for each block run some command on it
concatenate output in correct order

but without first splitting the file completely (I don't want to use that much disk space).

I can write such program myself, but perhaps there is already something that would do what I need?

Have you considered posting your current filter command? Maybe some have a better solution instead of splitting the input file. — Arjan, Aug 06 '09 at 14:12
Arjan: sure, it's iconv -c -f utf8 -t utf8 it bails out on 40+gb file, but works great on the same file splitted into parts. not sure how's that relevant, but hey - it's not secret :) — , Aug 06 '09 at 15:42
Is your version of iconv large-file aware? See http://serverfault.com/questions/24803/why-cant-i-create-files-larger-than-2-gb-on-linux it may be a related problem. — romandas, Aug 06 '09 at 19:40
@romandas: might not be, but I'm not in position to change iconv/system. — , Aug 06 '09 at 19:43

score 1 · Answer 1 · answered Aug 06 '09 at 14:08

If you do decide to write it yourself and you are talking about text files, you could use Perl with the Tie::File module. This allow you to work on large files a line at time in place. It is meant for just this sort of thing.

You could try Tie::File::AnyData if the file is not text too.

score 0 · Answer 2 · answered Aug 06 '09 at 18:20

Edit: Just noticed you don't want to split the file in advance because of disk space, this probably wont work for you

Use split:

$ man split

NAME
   split - split a file into pieces

SYNOPSIS
   split [OPTION] [INPUT [PREFIX]]

DESCRIPTION
   Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT is -, read standard input.

   Mandatory arguments to long options are mandatory for short options too.

   -a, --suffix-length=N
          use suffixes of length N (default 2)

   -b, --bytes=SIZE
          put SIZE bytes per output file

   -C, --line-bytes=SIZE
          put at most SIZE bytes of lines per output file

   -d, --numeric-suffixes
          use numeric suffixes instead of alphabetic

   -l, --lines=NUMBER
          put NUMBER lines per output file

   --verbose
          print a diagnostic to standard error just before each output file is opened

   --help display this help and exit

   --version
          output version information and exit

   SIZE may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

bmb · Answer 3 · 2009-08-06T19:02:45.363

I suggest using sed to extract just the parts you want and piping the output into your command:

sed -n '1,1000p' yourfile | yourcommand

will pipe the first 1000 lines to yourcommand

sed -n '1001,2000p' yourfile | yourcommand

will pipe the next 1000 lines.

etc.

You could put this in a loop in a script if you want.

e.g.

#!/bin/bash
size=1000
lines=`cat $1 | wc -l`
first=1
last=$size

while [ $last -lt $lines ] ; do
    sed -n "${first},${last}p" $1 | yourcommand
    first=`expr $last + 1`
    last=`expr $last + $size`
done

last=$lines
sed -n "${first},${last}p" $1 | yourcommand

score 0 · Answer 4 · answered Aug 06 '09 at 19:16

0

Try this:

#!/bin/bash

FILE=/var/log/messages
CHUNKSIZE=100

LINE=1
TOTAL=`wc -l $FILE | cut -d ' ' -f1`
while [ $LINE -le $TOTAL ]; do
  let ENDLINE=$LINE+$CHUNKSIZE
  sed "${LINE},${ENDLINE}p" $FILE | grep -i "mark"
  let LINE=$ENDLINE+1
done

answered Aug 06 '09 at 19:16

Justin Ellison

718
5
9

score 0 · Answer 5 · answered Aug 06 '09 at 19:34

Well - to everybody suggesting writing my own solution. I can. And I even can do it without multiple "scans" of input file. But the problem/question is: is there any ready tool?

Simplest Perl based approach might look like this:

#!/usr/bin/perl -w
use strict;

my ( $lines, $command ) = @ARGV;

open my $out, '|-', $command;

my $i = 0;
while (<STDIN>) {
    $i++;
    if ($i > $lines) {
        close $out;
        open $out, '|-', $command;
        $i = 1;
    }
    print $out $_;
}

close $out;

exit;

and now I can:

=> seq 1 5
1
2
3
4
5

=> seq 1 5 | ./run_in_parts.pl 3 tac
3
2
1
5
4

score 0 · Accepted Answer · answered Aug 06 '09 at 22:00

You are not the first person to run into this problem with iconv. Someone has written a Perl script to solve it.

iconv doesn't handle large files well. From the glibc source code, in iconv/iconv_prog.c:

/* Since we have to deal with
   arbitrary encodings we must read the whole text in a buffer and
   process it in one step.  */

However, for your particular case, it might be better to write your own UTF-8 validator. You could easily distill iconv -c -f utf8 -t utf8 down to a small C program, with a loop that calls iconv(3). Since UTF-8 is modeless and self-synchronizing, you can process it in chunks.

#include <errno.h>
#include <iconv.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define BUFSIZE 4096

/* Copy STDIN to STDOUT, omitting invalid UTF-8 sequences */
int main() {
    char ib[BUFSIZE], ob[BUFSIZE], *ibp, *obp;
    ssize_t bytes_read;
    size_t iblen = 0, oblen;
    unsigned long long total;
    iconv_t cd;

    if ((iconv_t)-1 == (cd = iconv_open("utf8", "utf8"))) {
        perror("iconv_open");
        return 2;
    }

    for (total = 0;
         bytes_read = read(STDIN_FILENO, ib + iblen, sizeof(ib) - iblen);
         total += bytes_read - iblen) {

        if (-1 == bytes_read) {     /* Handle read error */
            perror("read");
            return 1;
        }
        ibp = ib; iblen += bytes_read;
        obp = ob; oblen = sizeof(ob);
        if (-1 == iconv(cd, &ibp, &iblen, &obp, &oblen)) {
            switch (errno) {
              case EILSEQ:          /* Invalid input multibyte sequence */
                fprintf(stderr, "Invalid multibyte sequence at byte %llu\n",
                        1 + total + sizeof(ib) - iblen);
                ibp++; iblen--;     /* Skip the bad byte next time */
                break;
              case EINVAL:          /* Incomplete input multibyte sequence */               
                break;
              default:
                perror("iconv");
                return 2;
            }
        }
        write(STDOUT_FILENO, ob, sizeof(ob) - oblen);

        /* There are iblen bytes at the end of ib that follow an invalid UTF-8
           sequence or are part of an incomplete UTF-8 sequence.  Move them to  
           the beginning of ib. */
        memmove(ib, ibp, iblen);
    }
    return iconv_close(cd);
}

How to run command on parts of input file

6 Answers6