How to avoid Sed changing the file format?

The problem is that sed's regexp engine doesn't see your input file nor your […] match as a list of Unicode characters; instead it sees each of them as multiple independent bytes. For example, it sees • as three bytes \xe2 \x80 \xa2 and tries to match each of them individually against [ \xe2 \x80 \x98 \xe2 \x80 \x99 \x22 \xe2 \x80 ... ].

So in the example you've shown in your post, the regex only matches and deletes the last byte of each punctuation character, but leaves the other 2 still there. That's what gives you an invalid (non-UTF-8) output file.

With GNU sed (tested on 4.5), this can be avoided by making sure that the system locale (the $LANG or at least $LC_CTYPE environment variables) is set to an UTF-8 compatible locale. For example:

$ export LANG='C'
$ echo '‘test’ “test”' | sed 's/[“”•]/X/g'
XX�testXX� XXXtestXXX
$ echo '•_test' | sed 's/[•‡]_/X_/'
��X_test

$ export LANG='en_US.UTF-8'
$ echo '‘test’ “test”' | sed 's/[“”•]/X/g'
‘test’ XtestX
$ echo '•_test' | sed 's/[•‡]_/X_/'
X_test

(The locale language does not matter. Any UTF-8 locale will work.)

If this does not work for you, avoid […] completely and use $…\|…\|…$ (or (…|…|…) in sed -r), which is a multi-character alternative and will work regardless of how those characters end up being interpreted.

$ export LANG='C'
$ echo '‘test’ “test”' | sed 's/\(“\|”\|•\)/X/g'
‘test’ XtestX
$ echo '•_test' | sed 's/\(•\|‡\)_/X_/'
X_test

user1686

Posted 2018-08-10T09:01:10.847

Reputation: 283 655

I don't think sed supports Unicode... what is it you're trying to do? (please include the full content of process.sed) – Attie – 2018-08-10T09:07:01.737

1What are your LC_ALL / LANG / LANGUAGE environment variables set to? – Attie – 2018-08-10T09:09:38.543

@Attie I'm trying to remove all the Chinese punctuation symbols with tags. – Luca – 2018-08-10T09:37:17.777

@Attie like ''This is a 、/DunHao punc" --> "This is a punc". But only remove punctuations with tags – Luca – 2018-08-10T09:39:13.120

How to avoid Sed changing the file format?

Answers