non greedy grep command on ubuntu?

Question

I'm building a script which filters out all our translatables from our template system. the problem i'm facing is the occasion where 2 translatables are on one line.

These are 2 example lines from a template file which both hold one or more translatables

<img src="/captcha/generate.jpg" alt="[#Captcha#]" />
<span>[#Velden met een * zijn verplicht in te vullen#]</span> <button type="submit" name="frm_submit" class="right">[#Verzend#]

And when i set loose the following regexp

egrep "\[#(.*)#\]" . -Rohis

I get this output

[#Captcha#]"
[#Velden met een * zijn verplicht in te vullen#]</span> <button type="submit" name="frm_submit" class="right">[#Verzend#]

While the desired output is

[#Captcha#]
[#Velden met een * zijn verplicht in te vullen#]
[#Verzend#]

Kyle Brandt · Answer 1 · 2010-02-04T14:27:28.613

2

Trying to parse HTML with regular expressions leads to pain, just don't do it. Jeff wrote about it in Parsing Html The Cthulhu Way. "But I only want to ______", doesn't matter, seriously, don't do it. Take a little time and learn something designed to parse html.

I personally would recommend a Perl Module for this (such as HTML::TreeBuilder), but what languages do you know? Other options are: tidy+ xmlstarlet xmlgawk xpath xml2.

If you just want the "rendered" html in text, you can use something like links: links -dump www.google.com

edited Feb 04 '10 at 14:27

answered Feb 04 '10 at 14:20

Kyle Brandt

82,107
71
302
444

I'm not trying to parse HTML, i'm trying to compile a list of all translation strings used in our framework, wether they are in a html, tpl, txt, csv or php file doesn't mater, nor does the context. – ChrisR Feb 04 '10 at 14:46
1

Ah, well the negated character class is the general way to get non-greedy like operations as others stated. – Kyle Brandt Feb 04 '10 at 15:03

score 1 · Accepted Answer · answered Feb 04 '10 at 14:38

1

Agree with Kyle. However, in this case, you could do this: egrep "\[#([^]])*)#\]" . -Rohis and get what you're looking for.

The [^]]* matches non-] characters, so it'll stop at the first ] it hits.

answered Feb 04 '10 at 14:38

Bill Weiss

10,782
3
37
65

on the side note, this trick (in 2nd paragraph) works here but is not solution to general problem, it fails here `[#Verze]nd#]` – Edward Torvalds Feb 09 '19 at 18:59

score 1 · Answer 3 · answered Feb 04 '10 at 14:52

Seconded Kyle comment.

Anyway, if the pattern is sufficiently uniform, you can achieve what you want my matching anything but the closing character, i.e.

egrep "\[#([^]]*)#\]" ...

This pattern will not work if you have closing ] inside the text, e.g. [#xyz]#] will not match.

score 0 · Answer 4 · answered May 03 '10 at 17:07

Use Perl, which lets you do non-greedy matches. For your case where you watch multiple matches, do it like with the /g modifier. We add the -l switch because we have to handle linefeeds ourselves:

perl -lne'while(/\[#(.*?)#\]/g) {print $&}' filename

non greedy grep command on ubuntu?

4 Answers4