Merge by date multiple log files that also include un-dated lines (e.g. stack traces)

6

1

How can I merge log files, i.e. files that are sorted by time but that also have multi-lines, where only the first line has the time, and the remaining ones have not.

log1

01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar

log2

01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3

Expected result

01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar

If it weren't for the non-timestamp lines starting with a digit, a simple sort -nm log1 log2 would do.

Is there an easy way on a unix/linux cmd line to get the job done?

Edit As these log files are often in the gigabytes, merging should be done without re-sorting the (already sorted) log files, and without loading the files completely into memory.

Eugene Beresovsky

Posted 2014-10-07T23:59:13.530

Reputation: 707

Is this an actual UNIX or do you mean Linux? Do you have the GNU tools? – terdon – 2014-10-08T12:30:39.377

Answers

10

Tricky. While it is possible using date and bash arrays, this really is the kind of thing that would benefit from a real programming language. In Perl for example:

$ perl -ne '$d=$1 if /(.+?),/; $k{$d}.=$_; END{print $k{$_} for sort keys(%k);}' log*
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar

Here's the same thing uncondensed into a commented script:

#!/usr/bin/env perl

## Read each input line, saving it 
## as $_. This while loop is equivalent
## to perl -ne 
while (<>) {
    ## If this line has a comma
    if (/(.+?),/) {
        ## Save everything up to the 1st 
        ## comma as $date
        $date=$1;
    }
    ## Add the current line to the %k hash.
    ## The hash's keys are the dates and the 
    ## contents are the lines.
    $k{$date}.=$_;
}

## Get the sorted list of hash keys
@dates=sort(keys(%k));
## Now that we have them sorted, 
## print each set of lines.
foreach $date (@dates) {
    print "$k{$date}";
}

Note that this assumes that all date lines and only the date lines contain a comma. If that's not the case, you can use this instead:

perl -ne '$d=$1 if /^(\d+:\d+:\d+\.\d+),/; $k{$d}.=$_; END{print $k{$_} for sort keys(%k);}' log*

The approach above needs to keep the entire contents of the files in memory. If that is a problem, here's one that doesn't:

$ perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/' log* | 
    sort -n | perl -lne 's/\0/\n/g; printf'
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3    
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3    
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar

This one simply puts all lines between successive timestamps on to a single line by replacing newlines with \0 (if this can be in your log files, use any sequence of characters you know will never be there). This passed to sort and then tr to get the lines back.


As very correctly pointed out by the OP, all of the above solutions need to be resorted and don't take into account that the files can be merged. Here's one that does but which unlike the others will only work on two files:

$ sort -m <(perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/' log1) \
            <(perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/' log2) | 
    perl -lne 's/[\0\r]/\n/g; printf'

And if you save the perl command as an alias, you can get:

$ alias a="perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/'"
$ sort -m <(a log1) <(a log2) | perl -lne 's/[\0\r]/\n/g; printf'

terdon

Posted 2014-10-07T23:59:13.530

Reputation: 45 216

The problem with this is it loads the contents of all files into memory first. As the files already are sorted, this should be easy to avoid and do in a streaming fashion (with a general purpose language such as Perl, with which I'm not familiar with, so it's not easy for me :) – Eugene Beresovsky – 2014-10-09T07:26:45.713

@EugeneBeresovksy I don't see how. The files are not sorted, that's the whole problem. They're only sorted within each file, however, you can find line1 and line3 in fileA while line2 is in fileB. I don't see how you can sort that without either using temp files or storing in memory. If you would rather have tmp files, let me know and I'll give you an example. – terdon – 2014-10-09T10:59:25.247

It's not that complicated: You read in the first line of every file, then you output the oldest line plus any undated multi-lines of that file that follow and forget all those lines. Then you move on to the file that now has the oldest line and do the same. You repeat the process alternating between files until exhausted. If that doesn't make sense to you, you can verify that something like this must be possible by e.g. monitoring ps -eo comm,vsz|grep perl while perl is running vs. sort -nm log*. I tried it with 3 files totaling 650MB. Max mem for your solution: 863 MB,for sort: 8 MB – Eugene Beresovsky – 2014-10-09T23:45:08.337

@EugeneBeresovksy yes, as I said, it is possible using temporary files (that's what sort does). I see what you're describing but implementing a full sort algorithm is way beyond the scope of this site. – terdon – 2014-11-10T14:52:59.007

>

  • I wasn't asking for the implementation of a sort algorithm 2) In order to merge already sorted files, full sorting is not necessary, as explained in my last comment. 3) Even if sorting was indeed necessary, it still would not mean someone had to reinvent the wheel here. There might e.g. be some little-known flag in some unix command.
  • < – Eugene Beresovsky – 2014-11-10T23:02:43.560

    @EugeneBeresovksy I still don't understand how you could merge without either storing the data in memory or in a temporary file (which is what sort does). Perhaps there is a way but, if so, I don't know it. You will need to be able to insert one file into the middle of an another so some kind of data structure will be needed. Anyway, have a look at the updated answer, I've posted a new approach that only ever keeps a single line in memory. – terdon – 2014-11-11T02:24:13.787

    Thanks terdon - Newline replacement was the idea I needed! I've posted an answer using your idea, avoiding the unnecessary re-sort of your answer. – Eugene Beresovsky – 2014-11-11T04:23:33.057

    I will accept your answer. I only edited your answer and added a caret to your regex and made it more specific to prevent mismatches. Thanks for your help - good lesson for me. – Eugene Beresovsky – 2014-11-13T04:30:46.620

    @EugeneBeresovksy your edit was rejected before I had the chance to review it so I did it myself. – terdon – 2014-11-13T12:42:00.057

    Thanks, I too have seen unwarranted rejections of helpful edits from other people to my answers before. My version btw was even more strict: ^\d\d:\d\d:\d\d\.\d{3} . But that's not crucial to this question any more. – Eugene Beresovsky – 2014-11-13T23:20:45.843

    1

    One way to do it (thanks @terdon for the newline replace idea):

    1. Concat all multilines to single lines by replacing those newlines by e.g. NUL in each input file
    2. Do a sort -m on the replaced files
    3. Replace NUL back to newlines

    Example

    As the multiline concatenation is used more than once, let's alias it away:

    alias a="awk '{ if (match(\$0, /^[0-9]{2}:[0-9]{2}:[0-9]{2}\\./, _))\
        { if (NR == 1) printf \"%s\", \$0; else printf \"\\n%s\", \$0 }\
        else printf \"\\0%s\", \$0 } END { print \"\" }'"
    

    Here's the merge command, using above alias:

    sort -m <(a log1) <(a log2) | tr '\0' '\n'
    

    As shell script

    In order to use it like this

    merge-logs log1 log2
    

    I put it into a shell script:

    x=""
    for f in "$@";
    do
     x="$x <(awk '{ if (match(\$0, /^[0-9]{2}:[0-9]{2}:[0-9]{2}\\./, _)) { if (NR == 1) printf \"%s\", \$0; else printf \"\\n%s\", \$0 } else printf \"\\0%s\", \$0 } END { print \"\" }' $f)"
    done
    
    eval "sort -m $x | tr '\0' '\n'"
    

    Not sure if I can offer a variable number of log files without resorting to evil eval.

    Eugene Beresovsky

    Posted 2014-10-07T23:59:13.530

    Reputation: 707

    Nice, +1. Though I have to say that awk seems needlessly complex. Why not use the much shorter perl version in your alias? – terdon – 2014-11-12T15:34:36.520

    I tried to use your perl, but the problem is it always adds a newline at the beginning. The fix of that same issue btw is what makes the awk solution so complex. Your perl will look similar I guess. The reason I switched to awk is just that I'm not a perlist. – Eugene Beresovsky – 2014-11-12T22:50:44.420

    Ah, to get rid of the blank line, just pass it through grep .: perl -ne 's/\n/\0/; s/^/\n/ if /\d+:\d+:\d+\.\d+/ ; print' log* | tr -s '\0' '\n' | grep . Or even awk 'NR>1'. – terdon – 2014-11-12T23:09:42.573

    You are the man, terdon. You've got all the good ideas. Except, grep . also swallows blank lines that existed originally. Your awk is safe though, as would be tail -n+2. – Eugene Beresovsky – 2014-11-12T23:13:48.997

    Your perl is significantly faster than my awk. However, the tr -s that has to come with your current perl solution however removes empty lines that existed in the original. In general that's good enough, but the result is not a perfect merge. – Eugene Beresovsky – 2014-11-13T00:10:05.160

    Ah, true. OK, I posted a new version that i) uses sort -m ii) doesn't eat blank lines and iii) doesn't add extra ones. And then it will turn out that something like logrotate can do all this automatically :) – terdon – 2014-11-13T03:51:24.460

    0

    When using Java is an option for you, try log-merger:

    java -jar log-merger-0.0.3-jar-with-dependencies.jar -f 1 -tf "HH:MM:ss.SSS" -d "," -i log1,log2
    01:02:03.6497,2224,0022 foo
    foo1
    2foo
    foo3
    01:03:03.6497,2224,0022 FOO
    FOO1
    2FOO
    FOO3
    01:04:03.6497,2224,0022 bar
    1bar
    bar2
    3bar
    

    siom

    Posted 2014-10-07T23:59:13.530

    Reputation: 101

    If only Java didn't take that long to start up, I would immediately switch to writing shell scripts in some decent JVM language!

    – Eugene Beresovsky – 2015-09-02T05:11:39.430