How can I delete U+200B (Zero-width space) using sed

15

5

I have a very large file that has zero-width spaces scattered throughout. It takes too long to open and edit using vi so I'd like to delete all instances of the character using sed. The problem is, I can't figure out how to match the character! I've tried using \u200B, \x{200b}. Any ideas?

I'm running CentOS 5 if that helps at all.

thetaiko

Posted 2010-11-04T20:33:11.510

Reputation: 265

Does your copy of sed support the Unicode encoding that the file is encoded with? If not there is probably no good way to do it properly with sed, and you'd better use a python script or something like that... – JanC – 2010-11-04T21:38:37.617

@JanC - indeed, I've gone with Python. The file is encoded with utf8, seems standard enough that anything should be able to process it. I've added my python script below, in case its useful to anyone. – thetaiko – 2010-11-04T21:47:50.807

Answers

11

This seems to work for me:

sed 's/\xe2\x80\x8b//g' inputfile

Demonstration:

$ /usr/bin/printf 'X\u200bY\u200bZ' | hexdump -C
00000000  58 e2 80 8b 59 e2 80 8b  5a                       |X...Y...Z|
$ /usr/bin/printf 'X\u200bY\u200bZ' | sed 's/\xe2\x80\x8b//g' | hexdump -C
00000000  58 59 5a                                          |XYZ|

Edit:

Based partially on Gilles' answer:

tr -d $(/usr/bin/printf "\u200b") < inputfile

Paused until further notice.

Posted 2010-11-04T20:33:11.510

Reputation: 86 075

Perfect - this is exactly what I was looking for. In fact, I noticed that same set of characters (\xe2\x80\x8b) when looking at some sample strings in Python. Thank you! – thetaiko – 2010-11-04T23:07:41.127

4

GNU sed's behavior with UTF-8 doesn't seem to be very well-defined. Experimentally, you can make it replace the bytes of the UTF-8 representation:

<old sed 's/\xe2\x80\e8b//g' >new

Alternatively, you can type the character into your shell and use any of the standard commands in a UTF-8 locale:

<old tr -d '​' >new
<old sed 's/​//g' >new

In zsh, you can also enter the character through an escape sequence:

<old tr -d $'\u200B' >new

Gilles 'SO- stop being evil'

Posted 2010-11-04T20:33:11.510

Reputation: 58 319

As of Bash 4.2, Unicode sequences are supported by echo -e, printf format strings and ANSI quoted strings (e.g. echo -e '\u1E4F', printf '\u01DD %s\n' 'X', mkdir $'\u0250) – Paused until further notice. – 2018-10-02T17:36:57.460

0

Well, unless anyone has any ideas for how to get sed to do this (which I'm still interested in, by the way) its Python to the rescue...

import sys, re
pattern = re.compile(u"\u200b")
f = open(sys.stdin, "rb")
for line in f:
    a = pattern.sub("", line.decode("utf8"))
    print a.encode("utf8"),
f.close()

thetaiko

Posted 2010-11-04T20:33:11.510

Reputation: 265

+1 to the Gilles which also works on Mac OSX. perl -C -pi.bak -e 's/\x{200B}//g' yourfile results in yourfile fixed and a backup in yourfile.bak – MarkHu – 2014-11-08T01:37:03.260

2If you're going to reach for the big guns, how about the much simpler perl -C -pe 's/\x{200B}//g'? – Gilles 'SO- stop being evil' – 2010-11-04T22:53:33.790