sed regex remove special characters

0

I have a file with several strings that come from an HTML formated text, so they have some HTML sequences that doesn't look good in a console interface. Here's an example:

Text1™
[Text®2]
Text:3

The thing I'm trying is to remove everything between & and ; so the text is readable again, like the following:

Text1
Text2
Text3

I'm actually trying to use sed to remove the extra characters:

sed 's#&*;##g' <file>

The problem is that it only removes the ; from the text strings.

The question then is, how should the regex expresion be coded in order to remove the extra chain: &#[1-9]+;

Peter

Posted 2013-01-11T12:25:45.033

Reputation: 1 157

Answers

1

Your regexp

sed 's#&*;##g' <file>

does not do what you think it does. The * character is a multiplier that says that the preceding character is repeated 0 or more times. The previous character is &, so this would match e.g. &&&; and ; (& is written 0 times before ;! This is what is matching in your test cases) but not what you want in this case.

You need to specify "any character" before the multiplier, which is represented by a single dot, ..

$ echo 'Text&#58;3' | sed 's#&.*;##g'
Text3

That was the first problem. The second is the concept of so called "greedy" matching: sed will see the first & and then try to match the largest string it can. If you have several HTML entities on a single line, this would be a problem since:

$ echo 'Text&#58;3 and some more text &aring; and end' | sed 's#&.*;##g'
Text and end

If you want to see a fix in the sed context, you could look for the ending character of the entity by matching any number of "not ;" before a closing ; by doing:

$ echo 'Text&#58;3 and some more text &aring; and end' | sed 's#&[^;]*;##g'
Text3 and some more text  and end

You will still have problems with legitimate uses of the ampersand sign (&) in the text (well, &amp; is the real "legitimate" use, but the real world is not always as parsable as the ideal one) and matching too much, but this explains why sed is behaving the way it does.

Daniel Andersson

Posted 2013-01-11T12:25:45.033

Reputation: 20 465

*sed 's#&[^;];##g'** works flawlessly. – Peter – 2013-01-12T14:31:51.817

@Peter: Good to hear! Note though as I said: if you have a stray single & in a line, the pattern might clear too much. If the input is well behaved, it won't be a problem. If not: more rigor is needed in the pattern, and quickly sed's limits would make themselves known and other tools would be preferred. – Daniel Andersson – 2013-01-12T15:37:24.093

0

Is not it better to replace the codes with the actual characters?

echo 'Text1&#8482;
&#91;Text&#174;2&#93;
Text&#58;3' | perl -C -pe 's/&#([^;]*)/chr$1/eg'

Output:

Text1™;
[;Text®;2];
Text:;3

choroba

Posted 2013-01-11T12:25:45.033

Reputation: 14 741