The problem is that sed's regexp engine doesn't see your input file nor your […]
match as a list of Unicode characters; instead it sees each of them as multiple independent bytes. For example, it sees •
as three bytes \xe2 \x80 \xa2
and tries to match each of them individually against [ \xe2 \x80 \x98 \xe2 \x80 \x99 \x22 \xe2 \x80 ... ]
.
So in the example you've shown in your post, the regex only matches and deletes the last byte of each punctuation character, but leaves the other 2 still there. That's what gives you an invalid (non-UTF-8) output file.
With GNU sed (tested on 4.5), this can be avoided by making sure that the system locale (the $LANG or at least $LC_CTYPE environment variables) is set to an UTF-8 compatible locale. For example:
$ export LANG='C'
$ echo '‘test’ “test”' | sed 's/[“”•]/X/g'
XX�testXX� XXXtestXXX
$ echo '•_test' | sed 's/[•‡]_/X_/'
��X_test
$ export LANG='en_US.UTF-8'
$ echo '‘test’ “test”' | sed 's/[“”•]/X/g'
‘test’ XtestX
$ echo '•_test' | sed 's/[•‡]_/X_/'
X_test
(The locale language does not matter. Any UTF-8 locale will work.)
If this does not work for you, avoid […]
completely and use \(…\|…\|…\)
(or (…|…|…)
in sed -r), which is a multi-character alternative and will work regardless of how those characters end up being interpreted.
$ export LANG='C'
$ echo '‘test’ “test”' | sed 's/\(“\|”\|•\)/X/g'
‘test’ XtestX
$ echo '•_test' | sed 's/\(•\|‡\)_/X_/'
X_test
I don't think
sed
supports Unicode... what is it you're trying to do? (please include the full content ofprocess.sed
) – Attie – 2018-08-10T09:07:01.7371What are your
LC_ALL
/LANG
/LANGUAGE
environment variables set to? – Attie – 2018-08-10T09:09:38.543@Attie I'm trying to remove all the Chinese punctuation symbols with tags. – Luca – 2018-08-10T09:37:17.777
@Attie like ''This is a 、/DunHao punc" --> "This is a punc". But only remove punctuations with tags – Luca – 2018-08-10T09:39:13.120