Don't need the whole line, just the match from regular expression

Question

I simply need to get the match from a regular expression:

$ cat myfile.txt | SOMETHING_HERE "/(\w).+/"

The output has to be only what was matched, inside the parenthesis.

Don't think I can use grep because it matches the whole line.

Please let me know how to do this.

score 25 · Answer 1 · answered Aug 06 '09 at 16:36

25

Use the -o option in grep.

Eg:

$ echo "foobarbaz" | grep -o 'b[aeiou]r'
bar

answered Aug 06 '09 at 16:36

Amandasaurus

30,211
62
184
246

4

Good grief... Do you have any idea how many times I wrestled with `sed` backreferences to do that? – Insyte Aug 06 '09 at 17:36
12

The o option to grep/egrep returns only what matched the entire regular expression, not just what is in () like he asked for. – Kyle Brandt Aug 06 '09 at 17:59
2

However, that is a very good thing to know anyways :-) – Kyle Brandt Aug 06 '09 at 18:00
2

@KyleBrandt: To match only one part (e.g.: the parenses) it's possible to mark the rest with a look ahead or look behind: (?<= ) and (?= ) – DrYak Jan 20 '15 at 13:09
this only works for me using `egrep`. Though it's been 12 years to probably a lot has changed. – Maximilian Press Jul 09 '21 at 17:29

score 24 · Accepted Answer · edited Jan 19 '21 at 12:59

2 Things:

As stated by @Rory, you need the -o option, so only the match are printed (instead of whole line)
In addition, you neet the -P option, to use Perl regular expressions, which include useful elements like Look ahead (?= ) and Look behind (?<= ), those look for parts, but don't actually match and print them.

If you want only the part inside the parenthesis to be matched, do the following:

grep -oP '(?<=\/\()\w(?=\).+\/)' myfile.txt

If the file contains the sting /(a)5667/, grep will print 'a', because:

/( are found by \/\(, but because they are in a look-behind (?<= ) they are not reported
a is matched by \w and is thus printed (because of -o )
)5667/ are found by \).+\/, but because they are in a look-ahead (?= ) they are not reported

score 18 · Answer 3 · answered Apr 22 '16 at 15:58

18

    sed -n "s/^.*\(captureThis\).*$/\1/p"

-n      don't print lines
s       substitute
^.*     matches anything before the captureThis 
\( \)   capture everything between and assign it to \1 
.*$     matches anything after the captureThis 
\1      replace everything with captureThis 
p       print it

answered Apr 22 '16 at 15:58

Joshua

519
4
5

This should be the accepted answer, IMHO. – EdwardG Apr 30 '20 at 18:07

score 8 · Answer 4 · answered Jan 20 '15 at 13:47

8

Because you tagged your question as bash in addition to shell, there is another solution beside grep :

Bash has its own regular expression engine since version 3.0, using the =~ operator, just like Perl.

now, given the following code:

#!/bin/bash
DATA="test <Lane>8</Lane>"

if [[ "$DATA" =~ \<Lane\>([[:digit:]]+)\<\/Lane\> ]]; then
        echo $BASH_REMATCH
        echo ${BASH_REMATCH[1]}
fi

Note that you have to invoke it as bashand not just sh in order to get all extensions
$BASH_REMATCH will give the whole string as matched by the whole regular expression, so <Lane>8</Lane>
${BASH_REMATCH[1]} will give the part matched by the 1st group, thus only 8

answered Jan 20 '15 at 13:47

DrYak

523
5
6

Dear @DrYak, I hope you're not parsing XML with regex here.. :) – joonas.fi Jun 16 '16 at 12:43
It's even worse. I'm parsing a horrible mix of XML and FASTA data (which both use the `>` symbol for entirely different purposes) as spewed out by the [SANSparallel](http://ekhidna2.biocenter.helsinki.fi/sans/) fast largescale alignement software. Of course both formats are spewed interlaced without any escaping. So it's impossible to throw some standard XML library at this. And I'm using Bash regex at this point of the code because I only need to extract a couple of data, and 2 regex do the job much better for me than writing a dedicated parser for this mess. #LifeInBioinformatics – DrYak Jun 18 '16 at 10:24
In other words: there's a point where extracting 1 single number is simpler to do with a regex rathan than dancing the whole XML tango – DrYak Jun 18 '16 at 10:25
Hah, gotcha! :) – joonas.fi Jul 04 '16 at 10:07
+1 This was absolutely fantastic- I was looking for a way to `match but not capture` as in PCRE lookahead and could not find a way. This *grouping* just worked the way I wanted to capture my results. – Rajib Oct 26 '20 at 19:06
1

@Rajib: As far as I know, BASH only support _POSIX_ RegEx but not _PCRE_ . Thus there is a few features missing. Indeed, the "look ahead/look behinds" aren't available and using grouping is the next best thing. – DrYak Oct 30 '20 at 19:10

Kyle Brandt · Answer 5 · 2009-08-06T18:19:10.477

4

If you want only what is in the parenthesis, you need something that supports capturing sub matches (Named or Numbered Capturing Groups). I don't think grep or egrep can do this, perl and sed can. For example, with perl:

If a file called foo has a line in that is as follows:

/adsdds      /

And you do:

perl -nle 'print $1 if /\/(\w).+\//' foo

The letter a is returned. That might be not what you want though. If you tell us what you are trying to match, you might get better help. $1 is whatever was captured in the first set of parenthesis. $2 would be the second set etc.

edited Aug 06 '09 at 18:19

answered Aug 06 '09 at 17:38

Kyle Brandt

82,107
71
302
444

I was just trying to match what is in parenthesis. Seems like passing it to a perl or a php script might be the answer. – Alex L Aug 06 '09 at 18:01
Are you really sure sed can do named capturing groups ? I was unable to find anything related... – ssc May 11 '20 at 16:01

score 4 · Answer 6 · 2017-07-22T20:10:11.157

4

Assuming the file contains:

$ cat file
Text-here>xyz</more text

And you want the character(s) between > and </ , you can use either:

grep grep -oP '.*\K(?<=>)\w+(?=<\/)' file
sed sed -nE 's:^.*>(\w+)</.*$:\1:p' file
awk awk '{print(gensub("^.*>(\\w+)</.*$","\\1","g"))}' file
perl perl -nle 'print $1 if />(\w+)<\//' file

All will print a string "xyz".

If you want to capture the digits of this line:

$ cat file
Text-<here>1234</text>-ends

grep grep -oP '.*\K(?<=>)[0-9]+(?=<\/)' file
sed sed -E 's:^.*>([0-9]+)</.*$:\1:' file
awk awk '{print(gensub(".*>([0-9]+)</.*","\\1","g"))}' file
perl perl -nle 'print $1 if />([0-9]+)<\//' file

edited Jul 22 '17 at 20:10

answered Jul 22 '17 at 08:01

1

To me crucial was to realize \d doesn't work with sed. There's a reason you use [0-9]+ there. :) – user27432 May 07 '19 at 16:42
1

@user27423 It does not, but POSIX character classes ([painful reading](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html), [pleasant reading](https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html)) do: `echo 'Text-1234-ends' | sed -E 's|.*>([[:digit:]]+)<.*|\1|'`. In some cases (e.g. `[0-9]` vs. `[[:digit:]]`) they don't help legibility, in others I think they do (e.g. `[ \t\n\r\f\v]` vs. `[:space:]`). – Samuel Harmer Jan 21 '20 at 08:51
@SamuelHarmer Could you please clarify what is it that you mean with: *It does not*? – Mar 11 '20 at 15:52
@Isaac I was referring to @user27432's comment about the `\d` character group not working, and drawing their attention to POSIX character classes. – Samuel Harmer Mar 11 '20 at 18:17

score 0 · Answer 7 · answered Aug 06 '09 at 18:02

This will accomplish what you are requesting, but I don't think it is what you really want. I put the .* in the front of the regex to eat up anything before the match, but that is a greedy operation, so this only matches the penultimate \w character in the string.

Note that you need to escape the parens and the +.

sed 's/.*\(\w\).\+/\1/' myfile.txt

Don't need the whole line, just the match from regular expression

7 Answers7