Convert CP437 to UTF-8

7

4

The goal is to read bytes from input encoded in Code Page 437 and output the same characters encoded as UTF-8. That is, bytes should be translated into Unicode codepoints per this table (a particular revision of the Wikipedia article). Bytes 0-31 and 127 are to be treated as graphics characters.

Rules

Input and output is via stdin/stdout, or something equivalent to it (only a function that performs the conversion doesn't qualify). Input is 0 or more bytes before the input stream is closed.

Encoding with UTF-8 is part of the task (as is decoding the input as CP437), so to that end built-in functions or language features that perform conversion between character encodings are not allowed. Submissions should implement the character encoding conversion themselves.

Scoring is pure code-golf, shortest code wins. Code length is measured in bytes assuming the source code is encoded using UTF-8.

Example

A byte containing 0x0D received on input, representing '♪', should output bytes 0xE2, 0x99, 0xAA (in that order), which represents codepoint U+266A encoded in UTF-8.

Links

FireFly

Posted 2013-10-04T09:19:46.340

Reputation: 7 107

Perl (for example), stores Unicode strings internally as UTF-8, but treats them logically as a sequence of Unicode characters. Does telling Perl to print such strings in their internal presentation (e.g. using the -CO switch) violate the rule against built-in conversion features? Personally, I can see valid arguments either way. – Ilmari Karonen – 2013-10-04T14:32:26.583

Good question. I think I'd go with "no" since it bypasses the UTF-8 encoding step (and I intended the code doing the encoding to be part of one's solution), but I agree that it's not very obvious with how the rules are currently stated. – FireFly – 2013-10-04T17:58:09.797

Erm, I just noticed didn't pay attention to how the question was formulated.. that'd be a "yes", such a flag does violate the rule (at least in its updated form). – FireFly – 2013-10-04T20:27:13.543

OK, thanks. Another question: if the program is written in Unicode text, is the length measured in characters or in bytes? And can we require that the code be stored in a particular Unicode encoding (specifically, UTF-8), if (or if not) that's the default source code encoding for the language used? Just trying to close (or at least define) the loopholes here... – Ilmari Karonen – 2013-10-04T20:48:13.697

Thanks, clarified in the task description. Maybe this isn't such an interesting task to golf as I imagined it to be at first... – FireFly – 2013-10-04T20:58:37.313

No, it is not. You need lookup tables (ok, you can probably use logic for a smaller result), and the only challenge is to keep them small. – Johannes Kuhn – 2013-10-04T22:22:45.990

Answers

4

Tcl, 675

eval [zlib i Ò×sgÆáB\nPÃïÿòpÁ\f0\fv®4\fBc!j¶%cluYbãdÃw¨)+½MB()zàõÍóÝ=\{vÏì®ÿ¢­Ö¶\rk×­áóUë\tÌÀOÖ)'*ð,ò¦Ì1\[¦y\v?e~oÀÛ\;¼3)ï*S!ïMgÀÜé¬0OVy_sÌÇ侀@ýa>ÀoïæCMl°7ÎG¸«£Æ'º6Ä\"e±³DÆX*#,Ãïd¹êª²Á&ü®Al±!¬Ì,K˳KÓØèÀ*tbU¶cCìÀöÓíc'VcæKl+¬(V\$µÇFI`ãtc#ô`ôbHbS¤°ú°~vã²ôãrìÁÀäD4!í%\$ÕÉcð(E\\®În²vPÁÕ¨âSÃ5ØË°We?nÚ>Ê°1\"ãÄOyÆð»3Ë,²¯d7dº2)\{iÈ=ý|-w3¥\tÈGå(NÖ0ÙÏ1ä¸lpBý\}\t¾e¾E¾i¾<Y£)KCy~ÎÉçe\v2ÆEç¬rYV¸¢ç&¹*»¸¦­\;¸.ÃÜ\]ÜnÉ\ ·ñùÄÇóÍ\ ?ãåøæîàÆøf»xEîá%¹WæÞ(¿é_àwíü7Æ:.ó¿3Ã_¸a©,ð·Ìó>Veÿ°\tëtOpxªS<ÃMð\\e/ð7Oñ7ÂÿzÍ4¯pé´lølÕjZÛV¯\]·råk]

Just a simple mapping between input and output. Decompressed it looks like

puts [string map { ⺠ â»  ⥠ ⦠ ⣠ â   ⢠ â {  } â {
} â {} â {} â {
} ⪠ â«  â¼  ⺠ â  â  â¼  ¶  §  ⬠ ⨠ â  â  â  â  â  â  â²  â¼  â  Ã  ü  é  â  ä  à  Ã¥  ç  ê  ë  è  ï  î  ì  à  à  à  æ  à  ô  ö  ò  û  ù  ÿ  à  à  ¢  £  Â¥  ⧠ Æ   á ¡ í ¢ ó £ ú ¤ ñ ¥ à ¦ ª § º ¨ ¿ © â ª ¬ « ½ ¬ ¼ ­ ¡ ® « ¯ » ° â ± â ² â ³ â ´ ⤠µ â¡ ¶ ⢠· â ¸ â ¹ ⣠º â » â ¼ â ½ â ¾ â ¿ â À â Á ⴠ ⬠à â Ä â Å â¼ Æ â Ç â È â É â Ê â© Ë â¦ Ì â  Í â Î â¬ Ï â§ Ð â¨ Ñ â¤ Ò â¥ Ó â Ô â Õ â Ö â × â« Ø âª Ù â Ú â Û â Ü â Ý â Þ â ß â à α á à â Î ã Ï ä Σ å Ï æ µ ç Ï è Φ é Î ê Ω ë δ ì â í Ï î ε ï â© ð â¡ ñ ± ò ⥠ó ⤠ô â  õ â¡ ö ÷ ÷ â ø ° ù â ú · û â ü â¿ ý ² þ â  ÿ  } [read stdin]]

Johannes Kuhn

Posted 2013-10-04T09:19:46.340

Reputation: 7 122

The other approach is to use a cp437 -> codepoint table and encode the codepoints yourself, but this would require much more logic. – Johannes Kuhn – 2013-10-04T19:49:46.190

2

JavaScript (ES6), 668 bytes

console.log(prompt()[p='replace'](/[\S\s]/g,s=>String.fromCharCode(...(z=`BjuBjvBl1Bl2BkzBkwAciBh4BgrBh5Bk2Bk0Bl6Bl7BjwBgaBgkAmtAd8D2CnBfwAncAmpAmrAmqAmoAqnAmsBg2Bgc${','.repeat(96)}DjB0AhAaAcA8AdAfAiAjAgAnAmAkDgDhDlAeDiAsAuAqAzAxB3DyA4CiCjClAg7,b6A9AlArAyApDtCqD6DbAxcCsD9D8ChCrD7Bf5Bf6Bf7Bb6Bc4GtGuGiGhGvGdGjGpGoGnBbkBboBckBccBbwBb4BcsGqGrGmGgBe1GyGsGcBe4GzBe0GwGxGlGkGeGfBe3Be2BbsBbgBewBesBf0Bf4BeoE9A7FfEoFvErD1EsFyFkE1EcAqmEuEdAqxAshCxAslAskAxsAxtAvArsCwAqhD3AqiAf3CyBfkCg`[p](/G/g,'Bd')[p](/[A-F]/g,s=>','+'745qp6'[s[o='charCodeAt']()%6]).split`,`.map(i=>parseInt(i,36))[s=s[o]()]||s)<(x=128)?[z]:z<2048?[0|192+z/64,x+z%64]:[0|224+z/4096,0|x+z%4096/64,x+z%64])))

prompt()s for input and console.log()s the result. Tested in Firefox, utilises ES6 features of arrow functions, template strings and the spread operator (...). The bulk of the data here is the string which is a comma separated list of base36 numbers that equate to the UTF-8 code points of the characters to update (1-31, 127-255) and pads the other points with empty space/NaNs. The code iterates around each char in the source string, replacing it if necessary. I'm sure it should be possible to shave off more bytes, but I'm done for now! Here's a function for easier testing:

c=t=>t[p='replace'](/[\S\s]/g,s=>String.fromCharCode(...(z=(`BjuBjvBl1Bl2BkzBkwAciBh4BgrBh5Bk2Bk0Bl6Bl7BjwBgaBgkAmtAd8D2CnBfwAncAmpAmrAmqAmoAqnAmsBg2Bgc${','.repeat(96)}DjB0AhAaAcA8AdAfAiAjAgAnAmAkDgDhDlAeDiAsAuAqAzAxB3DyA4CiCjClAg7,b6A9AlArAyApDtCqD6DbAxcCsD9D8ChCrD7Bf5Bf6Bf7Bb6Bc4GtGuGiGhGvGdGjGpGoGnBbkBboBckBccBbwBb4BcsGqGrGmGgBe1GyGsGcBe4GzBe0GwGxGlGkGeGfBe3Be2BbsBbgBewBesBf0Bf4BeoE9A7FfEoFvErD1EsFyFkE1EcAqmEuEdAqxAshCxAslAskAxsAxtAvArsCwAqhD3AqiAf3CyBfkCg`[p](/G/g,'Bd')[p](/[A-F]/g,s=>','+'745qp6'[s[o='charCodeAt']()%6])).split`,`.map(i=>parseInt(i,36))[s=s[o]()]||s)<(x=128)?[z]:z<2048?[0|192+z/64,x+z%64]:[0|224+z/4096,0|x+z%4096/64,x+z%64]))

Run the above and call c() to extract data:

c('\x0d').split('').map(s=>`\\x${s.charCodeAt(0).toString(16)}`).join``
"\xe2\x99\xaa"

JavaScript (ES6), 618 bytes

There is a method in JavaScript for easily converting to source bytes for Unicode chars (that was shared to me by Mathias Bynens on a conversion tool I'd written!) that involves URL encoding and decoding the string which saves bytes, but I feel isn't in the spirit of the original challenge:

console.log(unescape(encodeURIComponent(prompt()[p='replace'](/[\S\s]/g,s=>String.fromCharCode((`BjuBjvBl1Bl2BkzBkwAciBh4BgrBh5Bk2Bk0Bl6Bl7BjwBgaBgkAmtAd8D2CnBfwAncAmpAmrAmqAmoAqnAmsBg2Bgc${','.repeat(96)}DjB0AhAaAcA8AdAfAiAjAgAnAmAkDgDhDlAeDiAsAuAqAzAxB3DyA4CiCjClAg7,b6A9AlArAyApDtCqD6DbAxcCsD9D8ChCrD7Bf5Bf6Bf7Bb6Bc4GtGuGiGhGvGdGjGpGoGnBbkBboBckBccBbwBb4BcsGqGrGmGgBe1GyGsGcBe4GzBe0GwGxGlGkGeGfBe3Be2BbsBbgBewBesBf0Bf4BeoE9A7FfEoFvErD1EsFyFkE1EcAqmEuEdAqxAshCxAslAskAxsAxtAvArsCwAqhD3AqiAf3CyBfkCg`[p](/G/g,'Bd')[p](/[A-F]/g,s=>','+'745qp6'[s[o='charCodeAt']()%6])).split`,`.map(i=>parseInt(i,36))[s=s[o]()]||s)))))

and as a function:

c=t=>unescape(encodeURIComponent(t[p='replace'](/[\S\s]/g,s=>String.fromCharCode(z=(`BjuBjvBl1Bl2BkzBkwAciBh4BgrBh5Bk2Bk0Bl6Bl7BjwBgaBgkAmtAd8D2CnBfwAncAmpAmrAmqAmoAqnAmsBg2Bgc${','.repeat(96)}DjB0AhAaAcA8AdAfAiAjAgAnAmAkDgDhDlAeDiAsAuAqAzAxB3DyA4CiCjClAg7,b6A9AlArAyApDtCqD6DbAxcCsD9D8ChCrD7Bf5Bf6Bf7Bb6Bc4GtGuGiGhGvGdGjGpGoGnBbkBboBckBccBbwBb4BcsGqGrGmGgBe1GyGsGcBe4GzBe0GwGxGlGkGeGfBe3Be2BbsBbgBewBesBf0Bf4BeoE9A7FfEoFvErD1EsFyFkE1EcAqmEuEdAqxAshCxAslAskAxsAxtAvArsCwAqhD3AqiAf3CyBfkCg`[p](/G/g,'Bd')[p](/[A-F]/g,s=>','+'745qp6'[s[o='charCodeAt']()%6])).split`,`.map(i=>parseInt(i,36))[s=s[o]()]||s))))

Dom Hastings

Posted 2013-10-04T09:19:46.340

Reputation: 16 415

1

Tcl, 183

chan con stdin -en cp437
chan con stdout -en utf-8
puts [string map [concat {*}[lmap c [split {☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼} {}] {list [format %c [incr i]] $c}]  ⌂] [read stdin]]

You probably have to setup the input/output in your shell to binary, so it does not mangle the input/output.

Johannes Kuhn

Posted 2013-10-04T09:19:46.340

Reputation: 7 122

I think this probably breaks the prohibition of "built-in functions or language features specifically meant for converting from/to CP437 or UTF-8". – Ilmari Karonen – 2013-10-04T13:58:30.877

It is not specifically meant for converting from to cp437/utf-8, just any supported encoding. (Much broader usage) – Johannes Kuhn – 2013-10-04T15:15:14.277

1I did mean to prohibit this, but in retrospect I didn't word my rules very well. I wanted to prohibit any built-in mechanics to convert between character encodings, while not prohibiting basicstring manipulation mechanics. I'll try to update the task to reflect this. I do feel a bit bad for changing a task after-the-fact to deny a solution though.. – FireFly – 2013-10-04T18:02:04.860