0

I'm building a script which filters out all our translatables from our template system. the problem i'm facing is the occasion where 2 translatables are on one line.

These are 2 example lines from a template file which both hold one or more translatables

<img src="/captcha/generate.jpg" alt="[#Captcha#]" />
<span>[#Velden met een * zijn verplicht in te vullen#]</span> <button type="submit" name="frm_submit" class="right">[#Verzend#]

And when i set loose the following regexp

egrep "\[#(.*)#\]" . -Rohis

I get this output

[#Captcha#]"
[#Velden met een * zijn verplicht in te vullen#]</span> <button type="submit" name="frm_submit" class="right">[#Verzend#]

While the desired output is

[#Captcha#]
[#Velden met een * zijn verplicht in te vullen#]
[#Verzend#]
ChrisR
  • 262
  • 3
  • 13

4 Answers4

2

Trying to parse HTML with regular expressions leads to pain, just don't do it. Jeff wrote about it in Parsing Html The Cthulhu Way. "But I only want to ______", doesn't matter, seriously, don't do it. Take a little time and learn something designed to parse html.

I personally would recommend a Perl Module for this (such as HTML::TreeBuilder), but what languages do you know? Other options are: tidy+ xmlstarlet xmlgawk xpath xml2.

If you just want the "rendered" html in text, you can use something like links: links -dump www.google.com

Kyle Brandt
  • 82,107
  • 71
  • 302
  • 444
  • I'm not trying to parse HTML, i'm trying to compile a list of all translation strings used in our framework, wether they are in a html, tpl, txt, csv or php file doesn't mater, nor does the context. – ChrisR Feb 04 '10 at 14:46
  • 1
    Ah, well the negated character class is the general way to get non-greedy like operations as others stated. – Kyle Brandt Feb 04 '10 at 15:03
1

Agree with Kyle. However, in this case, you could do this: egrep "\[#([^]])*)#\]" . -Rohis and get what you're looking for.

The [^]]* matches non-] characters, so it'll stop at the first ] it hits.

Bill Weiss
  • 10,782
  • 3
  • 37
  • 65
  • on the side note, this trick (in 2nd paragraph) works here but is not solution to general problem, it fails here `[#Verze]nd#]` – Edward Torvalds Feb 09 '19 at 18:59
1

Seconded Kyle comment.

Anyway, if the pattern is sufficiently uniform, you can achieve what you want my matching anything but the closing character, i.e.

egrep "\[#([^]]*)#\]" ...

This pattern will not work if you have closing ] inside the text, e.g. [#xyz]#] will not match.

Dan Andreatta
  • 5,384
  • 2
  • 23
  • 14
0

Use Perl, which lets you do non-greedy matches. For your case where you watch multiple matches, do it like with the /g modifier. We add the -l switch because we have to handle linefeeds ourselves:

perl -lne'while(/\[#(.*?)#\]/g) {print $&}' filename
Andy Lester
  • 740
  • 5
  • 16