Bake me some moji

26

4

Given a string, character list, byte stream, sequence… which is both valid UTF-8 and valid Windows-1252 (most languages will probably want to take a normal UTF-8 string), convert it from (that is, pretend it is) Windows-1252 to UTF-8.

Walked-through example

The UTF-8 string
I            UTF-8
is represented as the bytes
49 20E2 99 A520 55 54 46 2D 38
these byte values in the Windows-1252 table gives us the Unicode equivalents
49 20 E2 2122 A5 20 55 54 46 2D 38
which render as
I ⥠UTF-8

Examples

£Â£

£Â£

£Â£

I ♥ UTF-8I ♥ UTF-8

árvíztűrő tükörfúrógépárvÃztűrÅ‘ tükörfúrógép

Adám

Posted 2018-06-21T10:35:26.733

Reputation: 37 779

What's the title about? – user202729 – 2018-06-21T10:37:56.540

9@user202729 See the "convert it" link. It's a pun. – Erik the Outgolfer – 2018-06-21T10:38:24.483

So in summary, is it possible to take a list/array of integers? – user202729 – 2018-06-21T11:46:07.223

@user202729 I'm pretty sure you can. It states "character list, byte stream, sequence...", which are integer lists/arrays by default in some languages. – Kevin Cruijssen – 2018-06-21T12:00:06.817

@KevinCruijssen Is there any difference? – Adám – 2018-06-21T12:06:08.727

5For convenience: The Windows 1252 character set is the same as Unicode, except in 0x80..0x9F, where the characters are € ‚ƒ„…†‡ˆ‰Š‹Œ Ž ‘’“”•–—˜™š›œ žŸ. (space = unused) – user202729 – 2018-06-21T12:48:20.737

Is a string valid as output rather than a list of bytes or code points? And if so, does it need to be encoded in UTF-8 internally? – Jakob – 2018-06-21T17:08:35.920

@Jakob A string out is what's being asked for. I don't care about internal encoding, only what comes out in the end. – Adám – 2018-06-21T19:40:01.967

@user202729 https://en.wikipedia.org/wiki/Mojibake

– viraptor – 2018-06-21T22:19:35.230

@viraptor It is the first link in the OP. – Adám – 2018-06-21T22:28:44.497

3@user202729 Uh, I'm not sure what you were trying to say, but that isn't remotely close to being true. Unicode has millions of characters, Windows-1252 only 256. – David Conrad – 2018-06-21T23:48:13.660

@DavidConrad "in the 0x00-0xFF range". – user202729 – 2018-06-21T23:52:41.670

@Adám I missed it, because I expect "convert it" to point to some article about valid character conversion. It looks like the first commented missed it as well. It's not super clear :( – viraptor – 2018-06-22T00:56:04.380

+1 for mentioning the árvíztűrő tükörfúrógép. – zovits – 2018-06-22T11:28:56.287

1@DavidConrad, "Unicode has millions of characters" is exaggerated. Unicode defines 1,114,112 codepoints. Out of that 136,690 codepoints are currently used. – Wernfried Domscheit – 2018-06-22T11:42:26.163

1@Wernfried the point is comparing that to a 256-character charset. – David Conrad – 2018-06-22T17:59:16.573

Answers

23

bash, 14 bytes

iconv -fCP1252

Try it online!

Doorknob

Posted 2018-06-21T10:35:26.733

Reputation: 68 138

upvoted, but if I'm not mistaken,that assumes, that system encoding is utf-8 – GiM – 2019-02-21T10:22:13.303

19

Java 8, 72 66 36 25 bytes

s->new String(s,"cp1252")

Try it online.

s->  // Method with byte-array (UTF-8 by default) as parameter and String return-type
  new String(s,"cp1252")
     //  Pretend this UTF-8 input is (and convert it to) Windows-1252,
     //  and return it as UTF-8 String (by default) as well

cp1252 is an alias for Windows-1252. This alias cp1252 is the Canonical Name for the java.io and java.lang APIs, while the full name Windows-1252 is the Canonical Name for the java.nio API. See here for a full list of supported Java encodings, where we'd always want to use the shortest of the two for codegolfing.

Kevin Cruijssen

Posted 2018-06-21T10:35:26.733

Reputation: 67 575

13Java winning code golf‽ That can't be right. – Adám – 2018-06-21T12:38:06.023

1@Adám Hehe, I'm actually pleasantly surprised as well to see all these longer answers. ;) But I'm pretty sure Jelly, 05AB1E, etc. will beat me pretty soon. – Kevin Cruijssen – 2018-06-21T12:45:19.813

1I doubt that. They probably don't have built-in translate tables. Dyalog APL does though… – Adám – 2018-06-21T12:46:52.767

"Canonial Name for the java.nio API" :P – ASCII-only – 2018-06-22T17:39:34.547

8

R 3.5.0 or higher, 32 20 bytes

scan(,"",e="latin1")

Try it online!

Oddly short for a challenge in R... thanks to JayCe for golfing down 12 more bytes!

scan optionally takes an encoding argument to set the encoding of the input string. latin1 corresponds to, according to the documentation of Encoding

There is some ambiguity as to what is meant by a ‘Latin-1’ locale, since some OSes (notably Windows) make use of character positions used for control characters in the ISO 8859-1 character set. How such characters are interpreted is system-dependent but as from R 3.5.0 they are if possible interpreted as per Windows codepage 1252 (which Microsoft calls ‘Windows Latin 1 (ANSI)’) when converting to e.g. UTF-8.

Giuseppe

Posted 2018-06-21T10:35:26.733

Reputation: 21 077

3

I followed the link to the documentation of Encoding... and learned that scan also has an encoding argument O_O... 20 bytes

– JayCe – 2018-06-22T18:02:11.683

@JayCe whoda thunk it! Very nice! – Giuseppe – 2018-06-22T18:22:12.910

6

Python 2, 40 38 bytes

-2 bytes thanks to Erik the Outgolfer.

lambda s:s.decode('1252').encode('u8')

Try it online!

u8 is an alias for utf-8.

ovs

Posted 2018-06-21T10:35:26.733

Reputation: 21 408

Perhaps you could "cheat" a little with this: input().decode(...).encode(...) :) also I think you might be able to use some windows console encoding if in powershell (but I'm totally unsure about this). – KeyWeeUsr – 2018-06-21T15:32:25.807

See https://codegolf.stackexchange.com/a/167238/52609

– KeyWeeUsr – 2018-06-21T15:43:08.180

@KeyWeeUsr the problem with your suggestion is that is doesn't actually output anything, as opposed to the answer you linked. R does output the value of bare expression while does not. – ovs – 2018-06-21T16:02:40.587

4

Python 3, 38 36 34 bytes

lambda s:s.encode().decode('1252')

Try it online!

note: After I had a working function I used ovs's python2 answer answer to learn about the header and footer fields for tio, so the header and footer are the same

edit: Trimmed it a little thanks to python3 defaulting to utf8 and a tip from ovs's submission :)

GammaGames

Posted 2018-06-21T10:35:26.733

Reputation: 995

3

JavaScript, 64 bytes

x=>new TextDecoder('cp1252').decode(new TextEncoder().encode(x))

f =
x=>new TextDecoder('cp1252').decode(new TextEncoder().encode(x))
<p><input id=i oninput="o.value=f(i.value)" style="width:100%" /></p>
<p><output id=o></output></p>

Even longer than Java answer. So sad. :(

tsh

Posted 2018-06-21T10:35:26.733

Reputation: 13 072

3

Ruby, 31 bytes

->s{s.encode("UTF-8","CP1252")}

Try it online!

Tests cases are included in the TIO

crashoz

Posted 2018-06-21T10:35:26.733

Reputation: 611

3

C#, 81 bytes

using e=System.Text.Encoding;s=>e.GetEncoding(1252).GetString(e.UTF8.GetBytes(s))

Try it online!

Thanks to Schmalls for 3 bytes

Mego

Posted 2018-06-21T10:35:26.733

Reputation: 32 998

Can it be using e=System.Text.Encoding;s=>e.GetEncoding(1252).GetString(e.UTF8.GetBytes(s)) to get it down to 81? – Schmalls – 2018-06-22T19:46:26.417

@Schmalls Looks like yes, thanks! – Mego – 2018-06-22T22:31:22.123

2

180 bytes, machine code (16-bit x86)

I noticed most answers use builtin encode/decode (which I believe is perfectly fine), but I thought I'll continue my 16-bit quest.

As with previous ones, this was done without compiler using mostly HT hexeditor and ICY's hexplorer.

00000000: eb40 ac20 0000 1a20 9201 1e20 2620 2020  .@. ... ... &                     
00000010: 2120 c602 3020 6001 3920 5201 0000 7d01  ! ..0 `.9 R...}.                  
00000020: 0000 0000 1820 1920 1c20 1d20 2220 1320  ..... . . . " .                   
00000030: 1420 dc02 2221 6101 3a20 5301 0000 7e01  . .."!a.: S...~.                  
00000040: 7801 89f7 4646 89fa 89d9 4143 4bb4 3fcd  x...FF....ACK.?.                  
00000050: 2185 c074 288a 053c 8073 05e8 1700 ebec  !..t(..<.s......                  
00000060: 3ca0 721a d440 0d80 c050 86c4 e806 0058  <.r..@...P.....X                  
00000070: e802 00eb d7b4 4088 05b3 01cd 21c3 2c80  ......@.....!.,.                  
00000080: d0e0 89c3 8b00 89cb 85c0 74c0 3dff 0773  ..........t.=..s                  
00000090: 08c1 c002 c0e8 02eb cd50 c1e8 0c0c e0e8  .........P......                  
000000a0: d3ff 5825 ff0f c1c0 02c0 e802 0d80 8050  ..X%...........P                  
000000b0: 86c4 ebb8                                ....                              

bake.com < input.txt > out.dat

Dissection

Implementation is pretty straight-forward, although I haven't given much thought to flow upfront so there is SOME spaghetti there.

I'll mix order a bit, to make it easier to follow...

0000 eb40               jmp         0x42

Skip over table that maps chars >= 0x80 < 0xa0, to unicode codes.

data db ACh,20h, 00h,00h, 1Ah,20h, ...

Invalid ones are encoded as 0, they are not mapped to anything

0075 b440               mov         ah, 0x40   
0077 8805               mov         [di], al   
0079 b301               mov         bl, 0x1    
007b cd21               int         0x21       
007d c3                 ret                    

Helper function used to print char in al, will be called few times.

0042 89f7               mov         di, si     
0044 46                 inc         si         
0045 46                 inc         si         
0046 89fa               mov         dx, di     
0048 89d9               mov         cx, bx     
004a 41                 inc         cx         
004b 43                 inc         bx         

Prepare registers. Data will be read into 0x100, let si point into translation table above.

004c 4b                 dec         bx         
004d b43f               mov         ah, 0x3f   
004f cd21               int         0x21       
0051 85c0               test        ax, ax     
0053 7428               jz          0x7d       

Read char from stdin, jump to 0x7d if EOF.

Sidenote: This actually is a small (but pretty well known) trick, 0x7d contains ret, this will cause pop sp, sp at start points to end of a segment, there's 00 00 there, and cs:0 in DOS contains CD 20, which causes application to exit.

0055 8a05               mov         al, [di]   
0057 3c80               cmp         al, 0x80   
0059 7305               jnc         0x60       
005b e81700             call        0x75       
005e ebec               jmp         0x4c       

If char is < 0x80, just print it out, and go to beginning of loop (because helper function is setting BX to 1 - stdout, jumps will go to dec bx)

0060 3ca0               cmp         al, 0xa0   
0062 721a               jc          0x7e       
0064 d440               aam         0x40       
0066 0d80c0             or          ax, c080   
0069 50                 push        ax         
006a 86c4               xchg        ah, al     
006c e80600             call        0x75       
006f 58                 pop         ax         
0070 e80200             call        0x75       
0073 ebd7               jmp         0x4c       

This part deals with chars >= 0xa0, splits ascii code into "high" two bits and "low" 6 bits and applies utf-8 mask c080 for two bytes, then prints both of them

007e 2c80               sub         al, 0x80   
0080 d0e0               shl         al, 0x1    
0082 89c3               mov         bx, ax     
0084 8b00               mov         ax, [bx+si]
0086 89cb               mov         bx, cx     
0088 85c0               test        ax, ax     
008a 74c0               jz          0x4c       
008c 3dff07             cmp         ax, 07ff   
008f 7308               jnc         0x99       
0091 c1c002             rol         ax, 0x2    
0094 c0e802             shr         al, 0x2    
0097 ebcd               jmp         0x66       

This part deals with chars >= 0x80 < 0xa0, it finds proper utf-8 code in the table at the top, if code equals 0, just skip to beginning, if it's below 0x7ff (ergo: fits on two UTF-8 bytes), just adjust the value and re-use previous code at 0x166.

0099 50                 push        ax         
009a c1e80c             shr         ax, 0xc    
009d 0ce0               or          al, e0     
009f e8d3ff             call        0x75       
00a2 58                 pop         ax         
00a3 25ff0f             and         ax, 0fff   
00a6 c1c002             rol         ax, 0x2    
00a9 c0e802             shr         al, 0x2    
00ac 0d8080             or          ax, 8080   
00af 50                 push        ax         
00b0 86c4               xchg        ah, al     
00b2 ebb8               jmp         0x6c       

Final part, deals with codes that are above 0x7FF, drop low 12 bits, apply 0xE0 (see UTF-8 encoding description for reference) and print it out, adjust lower 12 bits and apply 8080 mask and again reuse part that spits out two chars.

GiM

Posted 2018-06-21T10:35:26.733

Reputation: 310

1

PHP+mbstring, 63 49 bytes

<?=mb_convert_encoding($argv[1],'UTF8','CP1252');

It doesn't work on TIO due to the lack of mbstring. The third parameter force mbstring to interpret the string as Windows-1252 encoded

-14 bytes thanks to Ismael Miguel

Sefa

Posted 2018-06-21T10:35:26.733

Reputation: 582

<?=mb_convert_encoding($argv[1],'UTF8','CP1252'); <-- even shorter! – Ismael Miguel – 2018-06-22T15:21:42.693

0

C (gcc) + libiconv, 119 117 bytes

*f(s,t,u)void*s,*t,*u;{long i=strlen(s),j=i*4;u=t=malloc(j);iconv(iconv_open("UTF8","CP1252"),&s,&i,&u,&j);return t;}

Try it online!

ErikF

Posted 2018-06-21T10:35:26.733

Reputation: 2 149

You should change the language to "C (gcc) + libiconv" in this case – ASCII-only – 2018-06-22T17:37:34.763

103 bytes – ceilingcat – 2018-12-29T01:05:32.630