sed regex remove special characters

Your regexp

sed 's#&*;##g' <file>

does not do what you think it does. The * character is a multiplier that says that the preceding character is repeated 0 or more times. The previous character is &, so this would match e.g. &&&; and ; (& is written 0 times before ;! This is what is matching in your test cases) but not what you want in this case.

You need to specify "any character" before the multiplier, which is represented by a single dot, ..

$ echo 'Text&#58;3' | sed 's#&.*;##g'
Text3

That was the first problem. The second is the concept of so called "greedy" matching: sed will see the first & and then try to match the largest string it can. If you have several HTML entities on a single line, this would be a problem since:

$ echo 'Text&#58;3 and some more text &aring; and end' | sed 's#&.*;##g'
Text and end

If you want to see a fix in the sed context, you could look for the ending character of the entity by matching any number of "not ;" before a closing ; by doing:

$ echo 'Text&#58;3 and some more text &aring; and end' | sed 's#&[^;]*;##g'
Text3 and some more text  and end

You will still have problems with legitimate uses of the ampersand sign (&) in the text (well, & is the real "legitimate" use, but the real world is not always as parsable as the ideal one) and matching too much, but this explains why sed is behaving the way it does.

Daniel Andersson

Posted 2013-01-11T12:25:45.033

Reputation: 20 465

*sed 's#&[^;];##g'** works flawlessly. – Peter – 2013-01-12T14:31:51.817

@Peter: Good to hear! Note though as I said: if you have a stray single & in a line, the pattern might clear too much. If the input is well behaved, it won't be a problem. If not: more rigor is needed in the pattern, and quickly sed's limits would make themselves known and other tools would be preferred. – Daniel Andersson – 2013-01-12T15:37:24.093

sed regex remove special characters

Answers