How to avoid Sed changing the file format?

0

I was trying to use a sed file to preprocess a file, but the output from sed seems to change the format. How should I avoid it?

file A.txt
A.txt UTF-8 Unicode English text, with very long lines

sed -f process.sed < A.txt > B.txt

head -2 process.sed
#!/bin/sed -f
s/[‘’"“”•·・、。《》™®\.★☆]\\[a-z\-]\+ //g

file B.txt
Non-ISO extended-ASCII English text, with very long lines, with LF, NEL line terminators

Because B.txt is not encoded UTF-8, I cannot do following processing.

vim B.txt
è·¯æ<98><93>æ<96>¯ Âç½<97>å¾·é<87><8c>æ ¼æ<96>¯ //è·¯æ<98><93>æ<96>¯Â·ç½<97>å¾·é<87><8c>æ ¼æ<96>¯ ]

Luca

Posted 2018-08-10T09:01:10.847

Reputation: 1

I don't think sed supports Unicode... what is it you're trying to do? (please include the full content of process.sed) – Attie – 2018-08-10T09:07:01.737

1What are your LC_ALL / LANG / LANGUAGE environment variables set to? – Attie – 2018-08-10T09:09:38.543

@Attie I'm trying to remove all the Chinese punctuation symbols with tags. – Luca – 2018-08-10T09:37:17.777

@Attie like ''This is a 、/DunHao punc" --> "This is a punc". But only remove punctuations with tags – Luca – 2018-08-10T09:39:13.120

Answers

1

The problem is that sed's regexp engine doesn't see your input file nor your […] match as a list of Unicode characters; instead it sees each of them as multiple independent bytes. For example, it sees as three bytes \xe2 \x80 \xa2 and tries to match each of them individually against [ \xe2 \x80 \x98 \xe2 \x80 \x99 \x22 \xe2 \x80 ... ].

So in the example you've shown in your post, the regex only matches and deletes the last byte of each punctuation character, but leaves the other 2 still there. That's what gives you an invalid (non-UTF-8) output file.

With GNU sed (tested on 4.5), this can be avoided by making sure that the system locale (the $LANG or at least $LC_CTYPE environment variables) is set to an UTF-8 compatible locale. For example:

$ export LANG='C'
$ echo '‘test’ “test”' | sed 's/[“”•]/X/g'
XX�testXX� XXXtestXXX
$ echo '•_test' | sed 's/[•‡]_/X_/'
��X_test

$ export LANG='en_US.UTF-8'
$ echo '‘test’ “test”' | sed 's/[“”•]/X/g'
‘test’ XtestX
$ echo '•_test' | sed 's/[•‡]_/X_/'
X_test

(The locale language does not matter. Any UTF-8 locale will work.)

If this does not work for you, avoid […] completely and use \(…\|…\|…\) (or (…|…|…) in sed -r), which is a multi-character alternative and will work regardless of how those characters end up being interpreted.

$ export LANG='C'
$ echo '‘test’ “test”' | sed 's/\(“\|”\|•\)/X/g'
‘test’ XtestX
$ echo '•_test' | sed 's/\(•\|‡\)_/X_/'
X_test

user1686

Posted 2018-08-10T09:01:10.847

Reputation: 283 655