HE COMETH NOT - a zalgo challenge

25

2

Write a program or function that, given a string, will strip it of zalgo, if any exists.

Zalgo

For this post, zalgo is defined as any character from the following Unicode ranges:

  • Combining Diacritical Marks (0300–036F)
  • Combining Diacritical Marks Extended (1AB0–1AFF)
  • Combining Diacritical Marks Supplement (1DC0–1DFF)
  • Combining Diacritical Marks for Symbols (20D0–20FF)
  • Combining Half Marks (FE20–FE2F)

https://en.wikipedia.org/wiki/Combining_character#Unicode_ranges

Input

  • May be passed via command line arguments, STDIN, or any other standard method of input supported by your language
  • Will be a string that may or may not contain zalgo or other non-ASCII characters

Output

Output should be a string that does not contain any zalgo.

Test Cases

Input -> Output

HE̸͚ͦ ̓C͉Õ̗͕M͙͌͆E̋̃ͥT̠͕͌H̤̯͛ -> HE COMETH
C͉̊od̓e͔͝ ̆G̀̑ͧo͜l͔̯͊f͉͍ -> Code Golf
aaaͧͩa͕̰ȃ̘͕aa̚͢͝aa͗̿͢ -> aaaaaaaaa
ññ        -> ñn
⚡⃤       -> ⚡

Scoring

As this is , shortest answer in bytes wins.

totallyhuman

Posted 2017-05-06T18:44:35.673

Reputation: 15 378

3Is the string guaranteed to only contain ASCII and/or Zalgo? Or may it contain other unicode? – James – 2017-05-06T19:00:06.903

4What about legitimate uses of those characters? Zalgo is pretty much only when those characters stack with each other in a way that was never intended. – Draco18s no longer trusts SE – 2017-05-06T19:16:08.777

@DJMcMayhem The input string may have other non-ASCII characters that must not be removed. – totallyhuman – 2017-05-06T19:16:16.293

@Draco18s Any character in those Unicode ranges must be removed. Besides, I don't think golfing code that recognizes valid words with combining characters would be fun. – totallyhuman – 2017-05-06T19:19:21.513

Is an encoding mandated or can any encoding be used? – Doorknob – 2017-05-06T19:21:35.433

@Doorknob Any encoding can be used but the definition of zalgo for this question still stands. – totallyhuman – 2017-05-06T19:24:37.360

1@totallyhuman I was thinking a more generic approach: only stripping if more than one occurs after a "standard" character. That is is fine but a͕̰ gets stripped to a. (Also now, thanks to the emoji detector, I want to put diacritics on emoji...̘͕̑ pfft, that looks silly) – Draco18s no longer trusts SE – 2017-05-06T19:25:12.440

@Draco18s That... might actually be a good idea but isn't it too late? Won't I be disrupting current progress? – totallyhuman – 2017-05-06T19:27:41.013

No idea, honestly. I don't have a good idea of how things work around here. If this is deemed a good challenge, then my idea might make a good second challenge. But it's why we have a sandbox.

– Draco18s no longer trusts SE – 2017-05-06T19:35:35.097

I did put it in the sandbox but I got different questions there. – totallyhuman – 2017-05-06T19:36:29.667

Then this question is probably fine as is. :) – Draco18s no longer trusts SE – 2017-05-06T19:40:52.600

Related. – Martin Ender – 2017-05-06T19:53:40.153

2You should add some test cases with non-ASCII output. – xnor – 2017-05-06T22:02:33.803

I would be grateful if somebody could do that as I am unable to do so for a while. (Preferably with the same length as the others because that just works. :P) – totallyhuman – 2017-05-06T22:07:58.313

Answers

13

Retina, 35 bytes

T`̀-ͯ᪰-᫿᷀-᷿⃐-⃿︠-︯

Try it online!

Simply removes all characters in the ranges given in the challenge from the input. The code is super unreadable of course, but the code is conceptually no different from something like T`0-9A-Za-z which would delete all alphanumeric characters.

Martin Ender

Posted 2017-05-06T18:44:35.673

Reputation: 184 808

3Seems unbeatable enough to me. – Erik the Outgolfer – 2017-05-06T20:12:20.187

@EriktheOutgolfer I don't know, I think Jelly might be able to generate the code point ranges more efficiently than just listing the characters. – Martin Ender – 2017-05-06T20:13:45.553

Actually I don't think it's able to. – Erik the Outgolfer – 2017-05-06T20:27:33.140

I'm surprised there is no Jelly solution yet. – totallyhuman – 2017-05-07T23:00:16.587

@icrieverytim here, and rip it's longer. I haven't figured out how to generate codepoints more effectively than this :P

– HyperNeutrino – 2017-09-14T12:20:35.303

7

Python 3, 73 69 bytes

-4 bytes thanks to L3viathan.

Not sure if participating in your own challenge is ok or not but... Stole the regex and essentially the idea as well >< straight from the JS and Retina answers.

lambda s:re.sub('[̀-ͯ᪰-᫿᷀-᷿⃐-⃿︠-︯]','',s)
import re

Try it online!

totallyhuman

Posted 2017-05-06T18:44:35.673

Reputation: 15 378

1Save 4 bytes by making that a normal import statement. – L3viathan – 2017-05-06T21:54:06.963

You forgot to update the byte count. – xnor – 2017-05-06T22:15:15.953

@xnor Huh? Seems right to me. – totallyhuman – 2017-05-07T00:26:06.133

@totallyhuman My mistake, missed that those char are multibyte. – xnor – 2017-05-07T00:43:20.747

Well, it's fair to say that I stole the character range from the Retina answer. (With care though, since my editor wanted to remove the zalgo along with the ```.) – Neil – 2017-05-07T09:40:09.843

4

JavaScript (ES6), 55 bytes

f=
s=>s.replace(/[̀-ͯ᪰-᫿᷀-᷿⃐-⃿︠-︯]/g,'')
<textarea oninput=o.textContent=f(this.value)></textarea><pre id=o>

Neil

Posted 2017-05-06T18:44:35.673

Reputation: 95 035

4

PHP, 67 Bytes

shorter as the write out

<?=preg_replace("#[̀-ͯ᪰-᫿᷀-᷿⃐-⃿︠-︯]#u","",$argn);

Try it online!

PHP, 115 Bytes

<?=preg_replace("#[\u{300}-\u{36f}\u{1ab0}-\u{1aff}\u{1dc0}-\u{1dff}\u{20d0}-\u{20ff}\u{fe20}-\u{fe2f}]#u","",$argn);

Try it online!

PHP, 35 Bytes

Valid for the given Testcases it removes all Marks

<?=preg_replace("#\pM#u","",$argn);

Try it online!

Jörg Hülsermann

Posted 2017-05-06T18:44:35.673

Reputation: 13 026

@FelixDombek No it replaces only all Marks in the given ranges with nothing – Jörg Hülsermann – 2017-05-07T12:31:51.413

4

Japt, 37 bytes

r"[̀-ͯ᪰-᫿᷀-᷿⃐-⃿︠-︯]

Try it online!

Luke

Posted 2017-05-06T18:44:35.673

Reputation: 4 675

3

Python 3, 127 118 bytes

Just a straightforward answer for now, let's see how golfable it is.

lambda y:"".join(chr(x)for x in map(ord,y)if not(767<x<880or 6831<x<6912or 7615<x<7680or 8399<x<8448or 65055<x<65072))

Changelog:

  • When will I ever learn that comprehensions are shorter than functional stuff (-9 bytes).

L3viathan

Posted 2017-05-06T18:44:35.673

Reputation: 3 151

0or is not a thing, so you have to fix it or it will raise SyntaxError. – Erik the Outgolfer – 2017-05-06T20:01:31.627

2@EriktheOutgolfer Did you actually test it? Doesn't throw an error for me on neither Python 3 nor 2. – L3viathan – 2017-05-06T20:09:57.040

Oh right. I was confused for a bit. – Erik the Outgolfer – 2017-05-06T20:11:28.253

3

Bash + coreutils, 41

tr -d '̀-ͯ᪰-᫿᷀-᷿⃐-⃿︠-︯'

Simply strips out characters in the given ranges.

Try it online.

Digital Trauma

Posted 2017-05-06T18:44:35.673

Reputation: 64 644

2

APL (Dyalog Unicode), 43 bytes

'[̀-ͯ᪰-᫿᷀-᷿⃐-⃿︠-︯]'⎕R''

Try it online!

PCRE Replace all those with nothing


44 byte version not using RegEx or strange character literals (and thus single byte per character):

⍞~⎕UCS∊65055 8399 7615 6831 767+⍳¨16×2 6~⍨⍳7

Try it online! ⍳7 1…7 (1 2 3 4 5 6 7)

2 6~⍨ except 2 and 6 (1 3 4 5 7)

16× multiply by 16 (16 48 64 80 112)

⍳¨ 1… each (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16, 1 2 3…, …110 111 112)

+ add offset to each list (65056 65057 65058…, …877 878 879)

 enlist (flatten)

⎕UCS convert to corresponding Unicode character

⍞~ get text input and remove all such characters

Adám

Posted 2017-05-06T18:44:35.673

Reputation: 37 779

2

Jelly, 32 bytes

“Żȷ'⁺¦60ƭṖ_WTɦ7Ụ|ṫYɠF’bȷ5r2/FỌḟ@

Try it online!

Explanation

“Żȷ'⁺¦60ƭṖ_WTɦ7Ụ|ṫYɠF’bȷ5r2/FỌḟ@  Main link
“Żȷ'⁺¦60ƭṖ_WTɦ7Ụ|ṫYɠF’            Base 250 compressed integer; 768008790683206911076160767908400084476505665071
                      bȷ5         Convert into base 100000; [768, 879, 6832, 6911, 7616, 7679, 8400, 8447, 65056, 65071]
                         r2/      Inclusive range on non-overlapping slices of length 2
                            F     Flatten
                             Ọ    chr; cast to character from codepoints
                              ḟ@  Filter; remove all characters from input that are in the characters generated before

HyperNeutrino

Posted 2017-05-06T18:44:35.673

Reputation: 26 575

o0 Didn't realize I bumped this question up lol. Is that b65072 what I think it is? :o – totallyhuman – 2017-09-14T12:21:11.567

@icrieverytim yes numerical list compression :D – HyperNeutrino – 2017-09-14T12:23:10.170

jelly is definitely the most zalgo language. i wonder what would happen if you ran the program on its own code? edit: unfortunately nothing – space junk – 2017-09-14T13:15:16.837

1

Java 8, 57 bytes

s->s.replaceAll("[̀-ͯ᪰-᫿᷀-᷿⃐-⃿︠-︯]","")

Try it here.

Kevin Cruijssen

Posted 2017-05-06T18:44:35.673

Reputation: 67 575

1

05AB1E, 32 bytes

•3xIαEλ¤’ä₆Ćkмм`0Â9"´•žHв2ôvyŸçK

Try it online!

Emigna

Posted 2017-05-06T18:44:35.673

Reputation: 50 798