Can GNU sed (for Windows) handle Unicode? If so, is it a code-page/locale issue, or a switch?

10

I've been using GNU SED on and off for a couple of years now. It spins me out a bit sometimes, but it does a good job... for single-byte char sets!
I now and then notice references to GNU SED being Unicode-aware, but the closest I've seen of this is its "binary" mode.. and binary is not Unicode.
Can GSED process a Unicode text file at CodePoint resolution, including and especially \r\n (Windows)... and if it can, does it expect UTF-8, UTF-16, or what? and how does SED detect the encoding?

Peter.O

Posted 2010-08-04T20:57:34.697

Reputation: 2 743

1

Usually unicode is specified with the \uXXXX option. Try this japanease guy's compile http://sky.geocities.jp/hp_gabo200x/room_tool.html

– Mikhail – 2012-11-01T02:03:28.517

Answers

1

I don't know a ton about sed, but after some hard Googling it seems to have support for a variety of code pages through the LANG environment variable. I believe UTF-8 is in fact the default in the absence of LANG. I don't know how the Windows port is set up though. I do have a strong suspicion that sed performs no detection processing at all on the input stream.

Sources: https://stackoverflow.com/questions/67410/why-does-sed-fail-with-international-characters-and-how-to-fix http://omgili.com/mailinglist/cygwin/cygwin/com/20100520123926GA1432onderneming10xs4allnl.html

You could also try escape characters as mentioned here: http://forums.whirlpool.net.au/forum-replies-archive.cfm/841095.html That seems very cumbersome though.

Vanessa Phipps

Posted 2010-08-04T20:57:34.697

Reputation: 302

@Mathew: Thanks. It's starting to seem that the issue may not lie with SED.exe itself, but in the way that Windows doesn't handle code-pages very well in its cmd.exe console. Maybe it works in its PowerShell, but if I have to go there, I'd rather focus on Python instead. As far as I can see, Windows own pride and joy, UTF-16 (code-page 1200, http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx) is available only to managed applications, whatever that means, but it surely doesn't work in the console.. Convert to UTF-8 and back? No way! Python here I come. (or 'Nix and Bash)

– Peter.O – 2010-08-07T02:11:03.943

Sounds good to me. I'm kind of confused how cmd.exe is causing the problem and how Python solves it (besides the well-known magic powers of Python (http://xkcd.com/353/)), but I'm not going to pretend to understand the vagaries of cmd.exe. Good luck to you!

– Vanessa Phipps – 2010-08-18T20:56:13.953

@fred: Read: .NET. – Hello71 – 2011-05-04T22:34:38.737

2I've solved the problem ... I've dropped MS completely about 6 months ago, and now use Ubuntu.. It is as smooth as silk... (and I'm not looking back... – Peter.O – 2011-05-05T10:33:26.503