Substitute Unprintable ASCII Characters

20

6

Sandbox

Have y'all ever written an answer with unprintable ASCII characters in it and wished that there was an easy way to represent those characters in a printable way? Well, that's why the Control Pictures Unicode block was invented.

However, manually substituting these characters into one's answer is time-consuming, so that's what today's challenge is about: swapping out the nasty invisible characters for nice, readable characters.

Input

You will be given strings that contain a mix of ASCII-only characters (i.e. the UTF-8 code point of each character will be in the range: \$0 \lt char \le 127\$).

Output

For all unprintable characters, replace it with its corresponding character in the Control Pictures Unicode range.

In other words:

  • Characters in the range \$0 \lt char \lt 9\$ are replaced with their corresponding character
  • Horizontal tabs and newlines (9 and 10) aren't replaced
  • Characters in the range \$11 \le char \lt 32\$ are replaced with their corresponding character
  • Spaces (32) aren't replaced
  • The delete character (127) is replaced with its corresponding character:

Tests

Characters are given as escapes for nice formatting here, but each character will be replaced with the unprintable character

In -> Out
\x1f\x1c\x1f\x1e\x1f\x1e\x1f\x1f\x1e\x1f\x1e\x1f -> ␟␜␟␞␟␞␟␟␞␟␞␟
Hello\x07World! -> Hello␇World!
\x01\x02\x03\x04\x05\x06\x07\x08\x0c\x0b\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x7f -> ␁␂␃␄␅␆␇␈␌␋␎␏␐␑␒␓␔␕␖␗␘␙␚␛␜␝␞␟␡
\r -> ␍

Rules

  • All standard loopholes are forbidden
  • Character substitutions must be made according to the Control Pictures Unicode block
  • Input will be given with the literal unprintable characters

Scoring

This is code-golf so the answer with the fewest amount of bytes wins.

Test-Case Generator

I have provided a test case generator for y'all. It prints inputs in the way they will be passed to your program and outputs the expected result.

Try it here!

Leaderboards

Here is a Stack Snippet to generate both a regular leaderboard and an overview of winners by language.

To make sure that your answer shows up, please start your answer with a headline, using the following Markdown template:

# Language Name, N bytes

where N is the size of your submission. If you improve your score, you can keep old scores in the headline, by striking them through. For instance:

# Ruby, <s>104</s> <s>101</s> 96 bytes

If there you want to include multiple numbers in your header (e.g. because your score is the sum of two files or you want to list interpreter flag penalties separately), make sure that the actual score is the last number in the header:

# Perl, 43 + 2 (-p flag) = 45 bytes

You can also make the language name a link which will then show up in the leaderboard snippet:

# [><>](http://esolangs.org/wiki/Fish), 121 bytes

var QUESTION_ID=196014;
var OVERRIDE_USER=78850;
var ANSWER_FILTER="!t)IWYnsLAZle2tQ3KqrVveCRJfxcRLe",COMMENT_FILTER="!)Q2B_A2kjfAiU78X(md6BoYk",answers=[],answers_hash,answer_ids,answer_page=1,more_answers=!0,comment_page;function answersUrl(d){return"https://api.stackexchange.com/2.2/questions/"+QUESTION_ID+"/answers?page="+d+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+ANSWER_FILTER}function commentUrl(d,e){return"https://api.stackexchange.com/2.2/answers/"+e.join(";")+"/comments?page="+d+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+COMMENT_FILTER}function getAnswers(){jQuery.ajax({url:answersUrl(answer_page++),method:"get",dataType:"jsonp",crossDomain:!0,success:function(d){answers.push.apply(answers,d.items),answers_hash=[],answer_ids=[],d.items.forEach(function(e){e.comments=[];var f=+e.share_link.match(/\d+/);answer_ids.push(f),answers_hash[f]=e}),d.has_more||(more_answers=!1),comment_page=1,getComments()}})}function getComments(){jQuery.ajax({url:commentUrl(comment_page++,answer_ids),method:"get",dataType:"jsonp",crossDomain:!0,success:function(d){d.items.forEach(function(e){e.owner.user_id===OVERRIDE_USER&&answers_hash[e.post_id].comments.push(e)}),d.has_more?getComments():more_answers?getAnswers():process()}})}getAnswers();var SCORE_REG=function(){var d=String.raw`h\d`,e=String.raw`\-?\d+\.?\d*`,f=String.raw`[^\n<>]*`,g=String.raw`<s>${f}</s>|<strike>${f}</strike>|<del>${f}</del>`,h=String.raw`[^\n\d<>]*`,j=String.raw`<[^\n<>]+>`;return new RegExp(String.raw`<${d}>`+String.raw`\s*([^\n,]*[^\s,]),.*?`+String.raw`(${e})`+String.raw`(?=`+String.raw`${h}`+String.raw`(?:(?:${g}|${j})${h})*`+String.raw`</${d}>`+String.raw`)`)}(),OVERRIDE_REG=/^Override\s*header:\s*/i;function getAuthorName(d){return d.owner.display_name}function process(){var d=[];answers.forEach(function(n){var o=n.body;n.comments.forEach(function(q){OVERRIDE_REG.test(q.body)&&(o="<h1>"+q.body.replace(OVERRIDE_REG,"")+"</h1>")});var p=o.match(SCORE_REG);p&&d.push({user:getAuthorName(n),size:+p[2],language:p[1],link:n.share_link})}),d.sort(function(n,o){var p=n.size,q=o.size;return p-q});var e={},f=1,g=null,h=1;d.forEach(function(n){n.size!=g&&(h=f),g=n.size,++f;var o=jQuery("#answer-template").html();o=o.replace("{{PLACE}}",h+".").replace("{{NAME}}",n.user).replace("{{LANGUAGE}}",n.language).replace("{{SIZE}}",n.size).replace("{{LINK}}",n.link),o=jQuery(o),jQuery("#answers").append(o);var p=n.language;p=jQuery("<i>"+n.language+"</i>").text().toLowerCase(),e[p]=e[p]||{lang:n.language,user:n.user,size:n.size,link:n.link,uniq:p}});var j=[];for(var k in e)e.hasOwnProperty(k)&&j.push(e[k]);j.sort(function(n,o){return n.uniq>o.uniq?1:n.uniq<o.uniq?-1:0});for(var l=0;l<j.length;++l){var m=jQuery("#language-template").html(),k=j[l];m=m.replace("{{LANGUAGE}}",k.lang).replace("{{NAME}}",k.user).replace("{{SIZE}}",k.size).replace("{{LINK}}",k.link),m=jQuery(m),jQuery("#languages").append(m)}}
body{text-align:left!important}#answer-list{padding:10px;float:left}#language-list{padding:10px;float:left}table thead{font-weight:700}table td{padding:5px}
 <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <link rel="stylesheet" type="text/css" href="https://cdn.sstatic.net/Sites/codegolf/primary.css?v=f52df912b654"> <div id="language-list"> <h2>Winners by Language</h2> <table class="language-list"> <thead> <tr><td>Language</td><td>User</td><td>Score</td></tr></thead> <tbody id="languages"> </tbody> </table> </div><div id="answer-list"> <h2>Leaderboard</h2> <table class="answer-list"> <thead> <tr><td></td><td>Author</td><td>Language</td><td>Size</td></tr></thead> <tbody id="answers"> </tbody> </table> </div><table style="display: none"> <tbody id="answer-template"> <tr><td>{{PLACE}}</td><td>{{NAME}}</td><td>{{LANGUAGE}}</td><td><a href="{{LINK}}">{{SIZE}}</a></td></tr></tbody> </table> <table style="display: none"> <tbody id="language-template"> <tr><td>{{LANGUAGE}}</td><td>{{NAME}}</td><td><a href="{{LINK}}">{{SIZE}}</a></td></tr></tbody> </table> 

Lyxal

Posted 2019-11-18T00:15:53.863

Reputation: 5 253

2Your Hello,\x07World testcase disagrees with itself over whether it should contain a comma. – Lynn – 2019-11-18T00:45:33.493

@JoKing, Lynn, both issues fixed – Lyxal – 2019-11-18T00:49:18.833

1You say input will be limited to 0-127 but you have ¿⊙ in your test cases? – totallyhuman – 2019-11-18T05:41:40.157

Why not include vertical tabs, newlines, or spaces? The newlines and spaces in particular make this challenge worth it. – ouflak – 2019-11-18T08:49:44.320

Maybe add the idea that the environnement is in the LC_ALL=C locale, so that some ranges are readily available ( [ -~] for exemple). The locale is often compatible with that range... but not always. (for ex see: https://stackoverflow.com/a/3208902/1841533 )

– Olivier Dulac – 2019-11-18T11:26:25.677

Does the output need to be UTF-8, or can it be any valid Unicode encoding? – Brian Minton – 2019-11-18T15:35:30.123

Oof. Runic (or the C# interpreter) does not like many of the input bytes. I spent probably 20 minutes trying to figure out why it was dying when trying to handle 0b11 and it turns out that it never gets into the string-on-the-stack.

– Draco18s no longer trusts SE – 2019-11-19T15:27:55.603

1@Grimmy, fixed. I've got no clue how I missed that. – Lyxal – 2019-11-20T21:52:41.917

Answers

7

Zsh, 83 79 78 bytes

-4 bytes by changing to a recursive solution, (-0 from bugfix,) -1 by abusing unquoted empty parameters expanding to 0 words

<<<${1+"${(#)$((x=#1,x>126?x+9122:(x<9)^(x<11)^(x<32)?x+9216:x))}`$0 ${1:1}`"}

Try it online! Try it online! Try it online!

In arithmetic contexts, #1 is the code of the first character of $1. We use it 7 times, so x=#1, saves us 2 bytes. There may be a more compact way to do (x<9)^(x<11)^(x<32).

The ${1+foo} expands to foo unless $1 is undefined, providing our recursive base case. The "`$0 ${1:1}`" is the actual recursive call. When an empty parameter is unquoted, it is removed from the word list, thus causing the final recursion be run with zero arguments.

Running strings longer than about 35 characters with the recursive solution causes TIO to complain: "fork failed: resource temporarily unavailable". However, with usual resource limits on a desktop, there shouldn't be a problem.

Potential changes:

  • With the flag -oCPRECEDENCES, < binds more tightly than ^, which would save 6 bytes by removing parentheses.
  • With the flag -oEXTENDEDGLOB, we gain access to backreferences using (#m) and the $MATCH parameter. An approach like Arnauld's Javascript answer could be used, but it ends up being longer than the purely arithmetic approach.
  • I also tried a transliterate solution, but the larger codepoints of the control pictures grew the bytecount too quickly.

GammaFunction

Posted 2019-11-18T00:15:53.863

Reputation: 2 838

7

JavaScript (Node.js),  67  66 bytes

Saved 1 byte thanks to @Grimmy

s=>s.replace(/[^ -~ \n]/g,c=>(B=Buffer)([226,144,B(c)[0]%94+128]))

includes a literal TAB

Try it online!

How?

We match all characters that are:

   +-------> neither printable
   |  +----> nor a tabulation
   |  | +--> nor a linefeed
  / \ | |
[^ -~\t\n]

and replace them with the UTF-8 sequence \$[226, 144, (n\bmod94)+128]\$, where \$n\$ is the ASCII code of the original character.

This is equivalent to generating the code point \$(n\bmod94)+9216\$. But given the length of the corresponding JS method names, that would be 74 73 bytes in plain ES6:

s=>s.replace(/[^ -~ \n]/g,c=>String.fromCharCode(c.charCodeAt()%94+9216))

Try it online!

Arnauld

Posted 2019-11-18T00:15:53.863

Reputation: 111 334

This is a good approach. However it must be noted that the [^ -~] range depends on the locale. To be sure it works it should be used along with a LC_ALL=C environnement ... (or LC_COLLATE, but LC_ALL is shorter) – Olivier Dulac – 2019-11-18T11:24:32.607

@OlivierDulac As far as I know, regular expressions in Node (or JS) are not locale-aware. But I'd be interested in a counter-example. Also, note that the input characters are guaranteed to be in the range $[1,127]$. – Arnauld – 2019-11-18T11:54:43.393

1I can't help here, as I don't know yet those 2 langages... But I'd be surprised if they didn't respect the user's locale and decided to overwrite it with their own "internal locale", even though in this case it seems more "logical". for other programs (perl, grep, awk, etc) on unix systems, the regex interpretation depends on the locale. In some locales, for ex, [a-z] can be "a" or "B" or "b" or "C" or "c" ... or "Y" or "y" or "Z" or "z", ie all letters caps and non caps except for "A". In others, "e" may be followed by "é", etc. Good shell scripts needs to ensure proper locale setting. – Olivier Dulac – 2019-11-18T12:17:16.900

That \t can be a literal tab character for -1 (TIO).

– Grimmy – 2019-11-20T17:27:13.440

@Grimmy Nice catch. Thanks. – Arnauld – 2019-11-20T17:33:30.680

6

C (gcc), 119 109 108 89 75 bytes

Thanks to ceilingcat and Arnauld for the suggestions.

Like my original submission, the UTF-8 code sequence is largely hard-coded, but instead of using string literals an integer is used instead. This takes advantage of the fact that the LSB of an integer (on a little-endian processor) is the first byte stored in memory, and 0 MSB bytes map to NUL which terminates the string. For unprintable characters that map to the control pictures, three bytes are used to represent the value; one byte is used for everything else.

One trick which I use is to force the comparison of a contiguous range of values to zero so that I can use unsigned comparisons. This wraps around any values below the range to high values, allowing me to use a single compare (the d-9U>1, which is true for x<9 or x>10.)

d;f(char*s){for(;d=*s++;printf(&d))d=d>126|d<32&d-9U>1?d%94<<16|8425698:d;}

Try it online!

Original version (89 bytes)

As C doesn't handle UTF-8 natively, I output the hard-coded prefix bytes and compute the trailing byte (which is the only byte that requires changing) if I need to print a control picture glyph. Other than that, it's a pretty standard string-processing function.

d;f(char*s){for(;d=*s++;printf(d>126|d<32&d-9U>1?"\xE2\x90%c":"%2$c",d%94+128,d));}

Try it online!

ErikF

Posted 2019-11-18T00:15:53.863

Reputation: 2 149

83 bytes – Arnauld – 2019-11-18T11:06:02.287

175 bytes by expanding on @ceilingcat's last version. – Arnauld – 2019-11-18T22:12:21.633

5

Retina 0.8.2, 27 24 bytes

T`-`␁-␟␡`[^	\n]

Try it online! Transliterates the unprintables to the appropriate control pictures, but the tabs and newlines are excluded (for a 3-byte saving). Note: No CRs in the test cases because I don't know how to demonstrate them in TIO. Code would probably work with nulls with the obvious adjustments for the same byte count, but I don't know how to test that either.

Neil

Posted 2019-11-18T00:15:53.863

Reputation: 95 035

5

Python 3, 69 bytes

lambda s:s.translate({i%128:i%34+9216for i in{*range(-1,32)}-{9,10}})

Try it online!

Jitse

Posted 2019-11-18T00:15:53.863

Reputation: 3 566

3

Red, 94 92 bytes

func[s][n: charset[not{	
}#" "-#"~"]parse s[any[p: change n(to sp p/1 % 94 + 9216)| skip]]s]

Try it online!

A port or @Arnauld's 74-byte JavaScript solution, don't forget to upvote his answer!

Galen Ivanov

Posted 2019-11-18T00:15:53.863

Reputation: 13 815

2

Perl 5 (-pC -Mutf8), 27 bytes

same as Neil's retina solution

y;--;␁-␈␋-␟␡

Try it online!

Nahuel Fouilleul

Posted 2019-11-18T00:15:53.863

Reputation: 5 582

2

Jelly, 24 bytes

32R;Ø⁷’ḟ9ḟ⁵ż%94+⁽ ḥƊ$ỌFy

Try it online!

A monadic link taking a Jelly string and returning a Jelly string with the desired substitutions.

Nick Kennedy

Posted 2019-11-18T00:15:53.863

Reputation: 11 829

1

Ruby -p, 36 bytes

\0 should really be a null byte but I was having trouble getting it in the code, oh well.

Input in STDIN should be the direct bytes but I put in some boilerplate code to make it more readable and obvious that they were the test cases.

$_.tr!"\0--","␀-␈␋-␟␡"

Try it online!

Value Ink

Posted 2019-11-18T00:15:53.863

Reputation: 10 608