splitting up a file: substitute `egrep` in `sed`

I want to split my $file containing x lines in half, and check how many lines contain "dead" in a log. I started off with the following:

half=`expr $(egrep -c . $file) / 2`

sed -n 1,${half}p $file | 
    xargs echo $file $half $(egrep -c dead $I) > log_1
sed -n ${half},${egrep -c . $file}p | 
    xargs echo $file $half $(egrep -c dead $I) > log_2

output for the first sed command is ok, but when substituting egrep in the range of sed it goes wrong:

DeadOrAlive 5 2
-bash: ${half},${egrep -c . $file}p: bad substitution

Is there a more efficient way of splitting the file in bash?

ChemMod

Posted 2018-04-10T19:15:12.957

Reputation: 1

$(...) and ${...} are different constructs. The former is command substitution, the latter is parameter expansion. – choroba – 2018-04-10T19:42:48.393

Your first egrep counts non-empty lines. grep -c ^ file would produce the total line count, including empty lines. (If your file doesn't contain any empty lines, then of course both are equivalent.) wc -l <file is probably faster because it doesn't need to do any regex matching. If you want to specifically count non-empty lines, then of course you do have to check for matches. – tripleee – 2018-04-11T06:20:25.953

What do you expect $I to contain? – tripleee – 2018-04-11T06:28:45.503

sed "$half,\$" will select lines from $half through to the end of file, though your code will include the middlemost file (line number $half) in both the first and the second half. – tripleee – 2018-04-11T06:30:08.513

sed "1,${half}d" file will delete the first $half lines, and print the rest. With that, you can get the file properly split into two non-overlapping partitions. – tripleee – 2018-04-11T10:12:21.997

Answers

Using wc, head and tail:

half=$(( $(wc -l "$file")/2 ))
head -$half | egrep -c dead | xargs echo "$file" $half > log_1
tail -$half | egrep -c dead | xargs echo "$file" $half > log_2

Using split:

split -a1 --numeric-suffixes=1 -n 'l/2' "$file" "$file"_
echo "$file" "$file"_1 $(egrep -c dead "$file_1") > log_1
echo "$file" "$file"_2 $(egrep -c dead "$file"_2) > log_2
rm "$file"_[12]

agc

Posted 2018-04-10T19:15:12.957

Reputation: 587

Here's an Awk solution.

awk '/dead/ { a[++n] = NR }
    END { for (i=1; i<=n; i++) if (a[i] > NR/2) break
        print ARGV, int(NR/2), i-1 >"log_1";
        print ARGV, int(NR/2)+(int(NR/2)!=NR/2), n-i+1 >"log_2" }' file

We collect into the array a the line numbers of the matches. We then figure out how many of the line numbers in the array are smaller than the middlemost line; their count is assigned to the first partition. (We have to use i-1 because we are already past the partitioning point when we break out of the loop.)

In general, you want to avoid rereading the same file many times, especially if it might be big; and secondly, try to minimize the number of processes.

It's not clear what you expect the middle output field to contain. If the file contains an odd number of lines, the first "half" will contain one line less than the second partition. (This is not hard to change, but you have to decide one way or the other.)

tripleee

Posted 2018-04-10T19:15:12.957

Reputation: 2 480

Strictly speaking, we should close() the files we open, but as long as there are only two, I didn't bother. – tripleee – 2018-04-11T06:27:59.343