awk/perl statement to validate and return fixed-structure data lines at start of file

1

I have a file that starts like this

## CONFIG-PARAMS-START ##
##
## text1 text2 NNNNNNNNN (arbitrary_comment) ##
## text1 text2 NNNNNNNNN (arbitrary_comment) ##
## text1 text2 NNNNNNNNN (arbitrary_comment) ##
##
## CONFIG-PARAMS-END ##
<arbitrary rest of file>

Output:
I'd like to validate the file with awk or perl, to check that it starts this way.

If yes, output just the data lines (not the start/end, or "bare" lines, or anything after this section), and if no, return a nonzero rc [$?] or some other easily testable condition such as [empty string].

File spec:
In modern (PRCE) regex terms, the data lines format is:

^##[[:space:]]*                    - starts with ## and optional spaces
  (([a-zA-Z0-9_-]+\.)+)            - >=1 repetition of [text_string][dot] (no spaces)
    [[:space:]]+                   - spaces
      ([^[:space:]]+)              - block of non-spaces
        [[:space:]]+               - spaces
          ([0-9]+)                 - block of digits
            [[:space:]]            - spaces
              \(.*                 - '(' + any text
                ##[[:space:]]*$    - 2 hashes, optional spaces + line end

( so a typical line might be ## abc.3ef. w;4o8c-uy3tu!ae 9938 (good luck!)##  )

There mustn't be any other lines (including empty/whitespace lines) before the first line, or anywhere else in the data block. Within each line, consecutive white space effectively acts as a single delimiter. White space after the first ## and before+after the last ## are all optional. There will typically be <15 lines in the section so size/speed/efficiency will be negligible considerations.

(The greedy capture on the 2nd last line isn't an issue, it'll backtrack minimally to match '##' in the final line)

Compatibility:
Wide compatibility is important, as the code will eventually need to be runnable on default/standard builds of different Linux, FreeBSD + other BSDs, maybe even other modern *nix platforms. (It's part of a patch for a widely used open-source package). Perhaps basic POSIX would provide a level field rather than assuming only some specific awk/perl variant? Maintainability/ease of understanding is also useful for the same reason. Hoping greatly to avoid perl ;-) removed this last, see comments

I haven't got the hang of using any text processing method for this sort of forward-and-backward referencing and checking, and even less an idea about managing compatibility / slight differences in implementations.

Awk/perl skills would be appreciated to get a working version of this snippet!

Stilez

Posted 2019-03-02T22:51:00.117

Reputation: 1 183

Judging from how competently you expose the problem, it seems you already are knowledgeable enough to solve this. Please tell us exactly where you got stuck and post what you have so far. – simlev – 2019-05-10T09:01:38.687

1-1 for Hoping greatly to avoid perl ;-) Perl is more consistent than, say, AWK between e.g. Linux and FreeBSD. For this kind of job, I'd go with Perl any day (or Python, or PHP) or any language that you are comfortable with and provides solid built-in PCRE support while allowing you to write clear code). – simlev – 2019-05-10T09:08:14.013

@simlev - I'm competent with PCRE regex, and understanding of the problem + its requirements. But I've never used awk or perl in my life, and have zero knowledge - literally - of either. (Which is kinda where I'm stuck, to answer your question). I'm happy to accept your advice on perl, but the syntax appears incomprehensible in examples. But maybe I prejudged - I guess regex must have seemed that way, once, long ago. So scrap that concern, and thank you. Maybe this is where I first dabble in perl? But the question remains, how do I solve this problem? – Stilez – 2019-05-10T15:30:14.273

I also upvoted your comment, I think on reflection you're right to haul me up if there's an appropriate and widely used tool that through ignorance and newness, I've excluded from my thinking. Question edited. – Stilez – 2019-05-10T15:37:45.567

Answers

0

Here's a simple script that should prove easy to follow and effortlessly port to other languages such as AWK, Python, Bash... Usage: perl validate.pl input.txt

use strict;
use warnings;

my @data;
my $a = 0;
my ($filename) = @ARGV;
my $expr = '^##[[:space:]]*([a-zA-Z0-9_-]+\.)+[[:space:]]+[^[:space:]]+[[:space:]]+[0-9]+[[:space:]]+(\(.*)##[[:space:]]*$';
open my $fh, "<:encoding(utf8)", $filename or die "Could not open $filename: $!";

while( my $line = <$fh>)  {
    chomp $line;
    if ($. == 1 and $line ne '## CONFIG-PARAMS-START ##') {
        exit 1;
    }
    if ($. == 2 and $line ne '##') {
        exit 1;
    }
    if ($a == 1 and $line eq '##') {
        $a = 2;
        next;
    }
    if ($. > 2 and $a < 2) {
        if ($line =~ /$expr/) {
            push @data, $line;
            $a = 1;
            next;
        } else {
            exit 1;
        }
    }
    if ($a == 2) {
        if ($line eq '## CONFIG-PARAMS-END ##') {
            print join("\n", @data), "\n";
            exit 0;
        } else {
            exit 1;
        }
    }
}

I also wrote a slightly different version that feels more native:

use strict;
use warnings;

my @data, my ($filename) = @ARGV, my $expr = '^##\s*([\w-]+\.)+\s+\S+\s+\d+\s+\(.*##\s*$';
open my $fh, "<:encoding(utf8)", $filename or die "Could not open $filename: $!";

while(<$fh>)  {
    chomp;
    if ($. == 1 and !/^## CONFIG-PARAMS-START ##$/) {exit 1}
    if ($. == 2 and !/^##$/) {exit 1}
    if ($. > 2) {
        if (/^##$/ and scalar @data == 0) {exit 1}
        if (/^##$/ and scalar @data  > 0) {
            if (<$fh> =~ /^## CONFIG-PARAMS-END ##$/) {
                print join("\n",@data), "\n"; exit 0;
            } else {exit 1;}
        }
        if (/$expr/) {push @data, $_;} else {exit 1}
    }
}

Explanation:

  • The regexp takes advantage of Perl-specific shorthands that make it easier for me to read:

    • \d for any digit ([0-9])
    • \w for a word character ([a-zA-Z0-9_])
    • \s for a space ([\r\n\t\f\v ])
    • \S for a non-space ([^\r\n\t\f\v ])
  • <$fh> reads a line from the $fh filehandle

  • chomp removes the trailing \n from the current line
  • $_ represents the current element (line).
    It is implied when missing, so that e.g. if (/^##$/) actually means if($_ =~ /^##$/).
  • $. contains the current line number
  • scalar @data is the number of elements in the @data array

simlev

Posted 2019-03-02T22:51:00.117

Reputation: 3 184