30

I simply need to get the match from a regular expression:

$ cat myfile.txt | SOMETHING_HERE "/(\w).+/"

The output has to be only what was matched, inside the parenthesis.

Don't think I can use grep because it matches the whole line.

Please let me know how to do this.

Alex L
  • 581
  • 2
  • 5
  • 11

7 Answers7

25

Use the -o option in grep.

Eg:

$ echo "foobarbaz" | grep -o 'b[aeiou]r'
bar
Amandasaurus
  • 30,211
  • 62
  • 184
  • 246
24

2 Things:

  • As stated by @Rory, you need the -o option, so only the match are printed (instead of whole line)
  • In addition, you neet the -P option, to use Perl regular expressions, which include useful elements like Look ahead (?= ) and Look behind (?<= ), those look for parts, but don't actually match and print them.

If you want only the part inside the parenthesis to be matched, do the following:

grep -oP '(?<=\/\()\w(?=\).+\/)' myfile.txt

If the file contains the sting /(a)5667/, grep will print 'a', because:

  • /( are found by \/\(, but because they are in a look-behind (?<= ) they are not reported
  • a is matched by \w and is thus printed (because of -o )
  • )5667/ are found by \).+\/, but because they are in a look-ahead (?= ) they are not reported
David
  • 3
  • 2
DrYak
  • 523
  • 5
  • 6
18
    sed -n "s/^.*\(captureThis\).*$/\1/p"

-n      don't print lines
s       substitute
^.*     matches anything before the captureThis 
\( \)   capture everything between and assign it to \1 
.*$     matches anything after the captureThis 
\1      replace everything with captureThis 
p       print it
Joshua
  • 519
  • 4
  • 5
8

Because you tagged your question as bash in addition to shell, there is another solution beside grep :

Bash has its own regular expression engine since version 3.0, using the =~ operator, just like Perl.

now, given the following code:

#!/bin/bash
DATA="test <Lane>8</Lane>"

if [[ "$DATA" =~ \<Lane\>([[:digit:]]+)\<\/Lane\> ]]; then
        echo $BASH_REMATCH
        echo ${BASH_REMATCH[1]}
fi
  • Note that you have to invoke it as bashand not just sh in order to get all extensions
  • $BASH_REMATCH will give the whole string as matched by the whole regular expression, so <Lane>8</Lane>
  • ${BASH_REMATCH[1]} will give the part matched by the 1st group, thus only 8
DrYak
  • 523
  • 5
  • 6
  • Dear @DrYak, I hope you're not parsing XML with regex here.. :) – joonas.fi Jun 16 '16 at 12:43
  • It's even worse. I'm parsing a horrible mix of XML and FASTA data (which both use the `>` symbol for entirely different purposes) as spewed out by the [SANSparallel](http://ekhidna2.biocenter.helsinki.fi/sans/) fast largescale alignement software. Of course both formats are spewed interlaced without any escaping. So it's impossible to throw some standard XML library at this. And I'm using Bash regex at this point of the code because I only need to extract a couple of data, and 2 regex do the job much better for me than writing a dedicated parser for this mess. #LifeInBioinformatics – DrYak Jun 18 '16 at 10:24
  • In other words: there's a point where extracting 1 single number is simpler to do with a regex rathan than dancing the whole XML tango – DrYak Jun 18 '16 at 10:25
  • Hah, gotcha! :) – joonas.fi Jul 04 '16 at 10:07
  • +1 This was absolutely fantastic- I was looking for a way to `match but not capture` as in PCRE lookahead and could not find a way. This *grouping* just worked the way I wanted to capture my results. – Rajib Oct 26 '20 at 19:06
  • 1
    @Rajib: As far as I know, BASH only support _POSIX_ RegEx but not _PCRE_ . Thus there is a few features missing. Indeed, the "look ahead/look behinds" aren't available and using grouping is the next best thing. – DrYak Oct 30 '20 at 19:10
4

If you want only what is in the parenthesis, you need something that supports capturing sub matches (Named or Numbered Capturing Groups). I don't think grep or egrep can do this, perl and sed can. For example, with perl:

If a file called foo has a line in that is as follows:

/adsdds      /

And you do:

perl -nle 'print $1 if /\/(\w).+\//' foo

The letter a is returned. That might be not what you want though. If you tell us what you are trying to match, you might get better help. $1 is whatever was captured in the first set of parenthesis. $2 would be the second set etc.

Kyle Brandt
  • 82,107
  • 71
  • 302
  • 444
  • I was just trying to match what is in parenthesis. Seems like passing it to a perl or a php script might be the answer. – Alex L Aug 06 '09 at 18:01
  • Are you really sure sed can do named capturing groups ? I was unable to find anything related... – ssc May 11 '20 at 16:01
4

Assuming the file contains:

$ cat file
Text-here>xyz</more text

And you want the character(s) between > and </ , you can use either:

grep -oP '.*\K(?<=>)\w+(?=<\/)' file
sed -nE 's:^.*>(\w+)</.*$:\1:p' file
awk '{print(gensub("^.*>(\\w+)</.*$","\\1","g"))}' file
perl -nle 'print $1 if />(\w+)<\//' file

All will print a string "xyz".

If you want to capture the digits of this line:

$ cat file
Text-<here>1234</text>-ends

grep -oP '.*\K(?<=>)[0-9]+(?=<\/)' file
sed -E 's:^.*>([0-9]+)</.*$:\1:' file
awk '{print(gensub(".*>([0-9]+)</.*","\\1","g"))}' file
perl -nle 'print $1 if />([0-9]+)<\//' file

  • 1
    To me crucial was to realize \d doesn't work with sed. There's a reason you use [0-9]+ there. :) – user27432 May 07 '19 at 16:42
  • 1
    @user27423 It does not, but POSIX character classes ([painful reading](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html), [pleasant reading](https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html)) do: `echo 'Text-1234-ends' | sed -E 's|.*>([[:digit:]]+)<.*|\1|'`. In some cases (e.g. `[0-9]` vs. `[[:digit:]]`) they don't help legibility, in others I think they do (e.g. `[ \t\n\r\f\v]` vs. `[:space:]`). – Samuel Harmer Jan 21 '20 at 08:51
  • @SamuelHarmer Could you please clarify what is it that you mean with: *It does not*? –  Mar 11 '20 at 15:52
  • @Isaac I was referring to @user27432's comment about the `\d` character group not working, and drawing their attention to POSIX character classes. – Samuel Harmer Mar 11 '20 at 18:17
0

This will accomplish what you are requesting, but I don't think it is what you really want. I put the .* in the front of the regex to eat up anything before the match, but that is a greedy operation, so this only matches the penultimate \w character in the string.

Note that you need to escape the parens and the +.

sed 's/.*\(\w\).\+/\1/' myfile.txt
Chad Huneycutt
  • 2,096
  • 1
  • 16
  • 14