Your regexp
sed 's#&*;##g' <file>
does not do what you think it does. The *
character is a multiplier that says that the preceding character is repeated 0 or more times. The previous character is &
, so this would match e.g. &&&;
and ;
(&
is written 0 times before ;
! This is what is matching in your test cases) but not what you want in this case.
You need to specify "any character" before the multiplier, which is represented by a single dot, .
.
$ echo 'Text:3' | sed 's#&.*;##g'
Text3
That was the first problem. The second is the concept of so called "greedy" matching: sed
will see the first &
and then try to match the largest string it can. If you have several HTML entities on a single line, this would be a problem since:
$ echo 'Text:3 and some more text å and end' | sed 's#&.*;##g'
Text and end
If you want to see a fix in the sed
context, you could look for the ending character of the entity by matching any number of "not ;
" before a closing ;
by doing:
$ echo 'Text:3 and some more text å and end' | sed 's#&[^;]*;##g'
Text3 and some more text and end
You will still have problems with legitimate uses of the ampersand sign (&
) in the text (well, &
is the real "legitimate" use, but the real world is not always as parsable as the ideal one) and matching too much, but this explains why sed
is behaving the way it does.
*sed 's#&[^;];##g'** works flawlessly. – Peter – 2013-01-12T14:31:51.817
@Peter: Good to hear! Note though as I said: if you have a stray single
&
in a line, the pattern might clear too much. If the input is well behaved, it won't be a problem. If not: more rigor is needed in the pattern, and quicklysed
's limits would make themselves known and other tools would be preferred. – Daniel Andersson – 2013-01-12T15:37:24.093