JavaScript (ES6), 668 bytes
console.log(prompt()[p='replace'](/[\S\s]/g,s=>String.fromCharCode(...(z=`BjuBjvBl1Bl2BkzBkwAciBh4BgrBh5Bk2Bk0Bl6Bl7BjwBgaBgkAmtAd8D2CnBfwAncAmpAmrAmqAmoAqnAmsBg2Bgc${','.repeat(96)}DjB0AhAaAcA8AdAfAiAjAgAnAmAkDgDhDlAeDiAsAuAqAzAxB3DyA4CiCjClAg7,b6A9AlArAyApDtCqD6DbAxcCsD9D8ChCrD7Bf5Bf6Bf7Bb6Bc4GtGuGiGhGvGdGjGpGoGnBbkBboBckBccBbwBb4BcsGqGrGmGgBe1GyGsGcBe4GzBe0GwGxGlGkGeGfBe3Be2BbsBbgBewBesBf0Bf4BeoE9A7FfEoFvErD1EsFyFkE1EcAqmEuEdAqxAshCxAslAskAxsAxtAvArsCwAqhD3AqiAf3CyBfkCg`[p](/G/g,'Bd')[p](/[A-F]/g,s=>','+'745qp6'[s[o='charCodeAt']()%6]).split`,`.map(i=>parseInt(i,36))[s=s[o]()]||s)<(x=128)?[z]:z<2048?[0|192+z/64,x+z%64]:[0|224+z/4096,0|x+z%4096/64,x+z%64])))
prompt()
s for input and console.log()
s the result. Tested in Firefox, utilises ES6 features of arrow functions, template strings and the spread operator (...
). The bulk of the data here is the string which is a comma separated list of base36 numbers that equate to the UTF-8 code points of the characters to update (1-31, 127-255) and pads the other points with empty space/NaN
s. The code iterates around each char in the source string, replacing it if necessary. I'm sure it should be possible to shave off more bytes, but I'm done for now! Here's a function for easier testing:
c=t=>t[p='replace'](/[\S\s]/g,s=>String.fromCharCode(...(z=(`BjuBjvBl1Bl2BkzBkwAciBh4BgrBh5Bk2Bk0Bl6Bl7BjwBgaBgkAmtAd8D2CnBfwAncAmpAmrAmqAmoAqnAmsBg2Bgc${','.repeat(96)}DjB0AhAaAcA8AdAfAiAjAgAnAmAkDgDhDlAeDiAsAuAqAzAxB3DyA4CiCjClAg7,b6A9AlArAyApDtCqD6DbAxcCsD9D8ChCrD7Bf5Bf6Bf7Bb6Bc4GtGuGiGhGvGdGjGpGoGnBbkBboBckBccBbwBb4BcsGqGrGmGgBe1GyGsGcBe4GzBe0GwGxGlGkGeGfBe3Be2BbsBbgBewBesBf0Bf4BeoE9A7FfEoFvErD1EsFyFkE1EcAqmEuEdAqxAshCxAslAskAxsAxtAvArsCwAqhD3AqiAf3CyBfkCg`[p](/G/g,'Bd')[p](/[A-F]/g,s=>','+'745qp6'[s[o='charCodeAt']()%6])).split`,`.map(i=>parseInt(i,36))[s=s[o]()]||s)<(x=128)?[z]:z<2048?[0|192+z/64,x+z%64]:[0|224+z/4096,0|x+z%4096/64,x+z%64]))
Run the above and call c()
to extract data:
c('\x0d').split('').map(s=>`\\x${s.charCodeAt(0).toString(16)}`).join``
"\xe2\x99\xaa"
JavaScript (ES6), 618 bytes
There is a method in JavaScript for easily converting to source bytes for Unicode chars (that was shared to me by Mathias Bynens on a conversion tool I'd written!) that involves URL encoding and decoding the string which saves bytes, but I feel isn't in the spirit of the original challenge:
console.log(unescape(encodeURIComponent(prompt()[p='replace'](/[\S\s]/g,s=>String.fromCharCode((`BjuBjvBl1Bl2BkzBkwAciBh4BgrBh5Bk2Bk0Bl6Bl7BjwBgaBgkAmtAd8D2CnBfwAncAmpAmrAmqAmoAqnAmsBg2Bgc${','.repeat(96)}DjB0AhAaAcA8AdAfAiAjAgAnAmAkDgDhDlAeDiAsAuAqAzAxB3DyA4CiCjClAg7,b6A9AlArAyApDtCqD6DbAxcCsD9D8ChCrD7Bf5Bf6Bf7Bb6Bc4GtGuGiGhGvGdGjGpGoGnBbkBboBckBccBbwBb4BcsGqGrGmGgBe1GyGsGcBe4GzBe0GwGxGlGkGeGfBe3Be2BbsBbgBewBesBf0Bf4BeoE9A7FfEoFvErD1EsFyFkE1EcAqmEuEdAqxAshCxAslAskAxsAxtAvArsCwAqhD3AqiAf3CyBfkCg`[p](/G/g,'Bd')[p](/[A-F]/g,s=>','+'745qp6'[s[o='charCodeAt']()%6])).split`,`.map(i=>parseInt(i,36))[s=s[o]()]||s)))))
and as a function:
c=t=>unescape(encodeURIComponent(t[p='replace'](/[\S\s]/g,s=>String.fromCharCode(z=(`BjuBjvBl1Bl2BkzBkwAciBh4BgrBh5Bk2Bk0Bl6Bl7BjwBgaBgkAmtAd8D2CnBfwAncAmpAmrAmqAmoAqnAmsBg2Bgc${','.repeat(96)}DjB0AhAaAcA8AdAfAiAjAgAnAmAkDgDhDlAeDiAsAuAqAzAxB3DyA4CiCjClAg7,b6A9AlArAyApDtCqD6DbAxcCsD9D8ChCrD7Bf5Bf6Bf7Bb6Bc4GtGuGiGhGvGdGjGpGoGnBbkBboBckBccBbwBb4BcsGqGrGmGgBe1GyGsGcBe4GzBe0GwGxGlGkGeGfBe3Be2BbsBbgBewBesBf0Bf4BeoE9A7FfEoFvErD1EsFyFkE1EcAqmEuEdAqxAshCxAslAskAxsAxtAvArsCwAqhD3AqiAf3CyBfkCg`[p](/G/g,'Bd')[p](/[A-F]/g,s=>','+'745qp6'[s[o='charCodeAt']()%6])).split`,`.map(i=>parseInt(i,36))[s=s[o]()]||s))))
Perl (for example), stores Unicode strings internally as UTF-8, but treats them logically as a sequence of Unicode characters. Does telling Perl to print such strings in their internal presentation (e.g. using the
-CO
switch) violate the rule against built-in conversion features? Personally, I can see valid arguments either way. – Ilmari Karonen – 2013-10-04T14:32:26.583Good question. I think I'd go with "no" since it bypasses the UTF-8 encoding step (and I intended the code doing the encoding to be part of one's solution), but I agree that it's not very obvious with how the rules are currently stated. – FireFly – 2013-10-04T17:58:09.797
Erm, I just noticed didn't pay attention to how the question was formulated.. that'd be a "yes", such a flag does violate the rule (at least in its updated form). – FireFly – 2013-10-04T20:27:13.543
OK, thanks. Another question: if the program is written in Unicode text, is the length measured in characters or in bytes? And can we require that the code be stored in a particular Unicode encoding (specifically, UTF-8), if (or if not) that's the default source code encoding for the language used? Just trying to close (or at least define) the loopholes here... – Ilmari Karonen – 2013-10-04T20:48:13.697
Thanks, clarified in the task description. Maybe this isn't such an interesting task to golf as I imagined it to be at first... – FireFly – 2013-10-04T20:58:37.313
No, it is not. You need lookup tables (ok, you can probably use logic for a smaller result), and the only challenge is to keep them small. – Johannes Kuhn – 2013-10-04T22:22:45.990