Linux Extract Matched Text Field from File

1

I have a file which has many lines of the format:

bc("STG1/Phone") = {type=bana_pub; cbb=12.354; abb=0.0}`

I'm looking to extract cbb=12.354;. Currently, I'm doing the following:

cat input_file.txt | grep cbb | awk -F " " '{ print $4 }'`

The problem is that my approach is location specific i.e. assumes it's always 4th field. How do I extract text of the form cbb= knowing after the = it could be any length and the semi-colon ; is optional. The only guarantee I have is that the term cbb=12.354; will be surrounded by whitespace if that helps. The file in future may be of the format:

bc("STG1/Phone") = {type=bana_pub; cbb=12.354; abb=0.0}
bc("STG1/Phone") = {type=bana_pub;  abb=0.0; cbb=12.354}

My gut tells me regex is probably the way to go, but I generally try and avoid it if I can as I prefer simple matching tools (which I understand better).

Thanks in anticipation for your help.

fswings

Posted 2017-10-27T12:54:27.927

Reputation: 666

A one-liner is mandatory or a bash script is allowed? – Alessandro Carini – 2017-10-27T12:59:59.227

Preference is one liner but I'm looking to learn - so yes bash scripts are allowed. – fswings – 2017-10-27T13:08:00.273

You should add a more complete input file snippet that includes lines where the desired string is at different positions. – simlev – 2017-11-02T09:17:17.280

Answers

2

Solution:

grep -Eo 'cbb=[^;}]+'

Let's test it:

$ grep -Eo 'cbb=[^;}]+' <<<'bc("STG1/Phone") = {type=bana_pub; cbb=12.354; abb=0.0}`'
$ cbb=12.354

Explanation:

When you use ... | grep cbb | ... you're using basic regex. Advanced regex isn't so complicated.

Option -E is for advanced regex, useful for don't escape some metacharacters. -o is for print just what grep matches instead the whole line.

The regex cbb=[^;}]+would be the same for any other cmd, not just grep.

cbb= is a fixed string, no metacharacter there (c followed by b etc)

[^;}]+ square brackets delimit a character set in a single position. A caret at the beginning means negated character set. The plus sign means one or more character. This way it will match any character, at least one, until it finds a ; or }

Here's a good link to learn more about regexes: https://www.regular-expressions.info/characters.html

Paulo

Posted 2017-10-27T12:54:27.927

Reputation: 606

Can you add a brief explanation of how it works? – fswings – 2017-10-28T08:48:19.847

It's certainly does work and it's easy enough for me to remember. – fswings – 2017-10-28T08:55:52.257

Thanks, selected because I like it's simplicity and easy enough to remember. – fswings – 2017-10-28T19:30:07.287

2

This works and is position-independent:

grep cbb input_file.txt | awk -F "cbb=" '{ print $2 }'| awk -F ";" '{print "cbb=" $1}'

First it selects only lines containing cbb, then uses the string cbb= as separator and finally uses ; as field separator adding the string cbb= to the final result.

jcbermu

Posted 2017-10-27T12:54:27.927

Reputation: 15 868

Confirmed that this works. For line 1 I get cbb=12.354 and for line 2 I get cbb=12.354} as your trick in using ; is not applicable (doesn't have one). Thanks for the quick response. – fswings – 2017-10-27T13:30:47.087

0

You can also use sed (since sed is called only once, should be faster)

sed -n 's/^.*\(cbb=[0-9\.]*\).*$/\1/p' sample.txt

Where sample.txt is your input file. Check only for numerical ([0-9.]) to address a possible issue with optional semicolon.

Alessandro Carini

Posted 2017-10-27T12:54:27.927

Reputation: 66

Can you add a brief explanation of how it works? – fswings – 2017-10-28T08:47:58.187

I used sed to substitute a string caught by RE with the group captured by re itself.

RegExp cbb=[0-9.]*\ within a capture group '( ... )' search for cbb= followed by any number of digit plus '.' and \1 return the first match found.

option -n is needed to have a quiet behavior while p at the end mean print the result

I choose sed over awk 'cause in awk capture group are not available (gawk not considered)

Note 2: capture group is the regular expression delimited by '(' and ')' it must be escaped

Note 3: Sanity check is not performed (e.g. 123.45.67 will be parsed as a number) – Alessandro Carini – 2017-10-28T09:35:48.867

Thanks, feel free to add it to your answer for posterity. – fswings – 2017-10-28T19:29:33.510

0

In this case, grep is the right tool for the job. However, I thought I'd add:

  • Perl

    perl -lane 'print $1 if /(cbb=[^;}]+)/' input_file.txt
    
  • AWK

    awk 'match($0,/cbb=[^;}]+/,m) {print m[0]}' input_file.txt
    
  • Sed

    sed -rn 's/.*(cbb=[^;}]+).*/\1/p' input_file.txt
    

Credits to Paulo for understanding what the OP meant with:

after the = it could be any length and the semi-colon ; is optional. The only guarantee I have is that the term cbb=12.354; will be surrounded by whitespace

simlev

Posted 2017-10-27T12:54:27.927

Reputation: 3 184