I'm sure you're all familiar with Z̃͗̇̚͟Ḁ̬̹̈̊̂̏̚L̜̼͊ͣ̈́̿̚G̱̮ͩ̃͑̆ͤ̂̚Õ̷͇͉̺̜̲ͩ́ͪͬͦ͐ ̪̀ͤͨ͛̍̈͢ĝ̭͇̻̊ͮ̾͂e̬̤͔̩̋ͮ̊̈ͭ̓̃n͖͎̘̭̯̳͎͒͂̏̃̾ͯe͕̖̋ͧ͑ͪ̑r̛ͩa̴͕̥̺̺̫̾ͭ͂ͥ̄ͧ͆t͍̻̘̆o͓̥ͤͫ̃̈̂r̹̤͇̰̻̯̐ͮ̈́ͦ͂͞. If not, you can play with the classical generator a bit. Zalgo text is unicode text that started as English text and had a bunch of combining characters to make it ~~hard to read~~ artistic.

This is operation unzalgo. The object is to undo the generate zalgo text operation. Since only combining characters were added, only combining characters have to be removed. We're only going to be operating on English, because of course we are. However, all proper English text must to come through unmutated. That is, if you were fed proper English, you must output your input. Sounds easy, right? Not so fast. English has combining characters, just not very many, and people are slowly forgetting they exist.

The rules:

1) You may either strip combining characters or filter out all invalid characters, at your discretion. This doesn't have to work at all languages because the operation itself makes no sense if we started with something that wasn't English text. Abandon hope all ye who enter here.

2) You may assume that unicode normalization was not executed after zalgo generation; that is all added combining characters are still combining characters.

3) All printable ASCII characters (codepoints between 0x20 and 0x7E inclusive) must survive, as well as tab (0x09), and newline (0x0A).

4) English has diaresis over vowels. These must survive whether expressed as natural characters or combining characters.

Vowel table (Each group of three is in the form (unmodified character, single character, combining characters):

a ä ä A Ä Ä
e ë ë E Ë Ë
i ï ï I Ï Ï
o ö ö O Ö Ö
u ü ü U Ü Ü
y ÿ ÿ Y Ÿ Ÿ

The combining diaresis character is code-point 0x308.

5) English does not have diaeresis over consonants. You must dump them when constructed as combining characters.

6) British English has four ligatures. These must survive:

æ Æ œ Œ

7) The following symbols must survive:

Editor substitution table (includes smart quotes in both directions):

… ¼ ½ ¾ ‘ ’ “ ” ™

Symbol table:

$ ¢ £ ¬ † ‡ • ‰ · ° ± ÷

8) If you get something like ö̎̋̉͆̉ö͒̿̍ͨͦ̽, both oo and öö are valid answers.

9) Input may contain both zalgo and non-zalgo characters; and the non-zalgo characters should be unmodified: if somebody sends 'cöoperate with dͧḯ̍̑̊͐sc͆͐orͩ͌ͮ̎ͬd̄̚', they should still get back 'cöoperate with discord' not 'cooperate with discord'.

10) If any character is not specified, it doesn't matter what you do with it. Feel free to use this rule to lossily compress the keep-drop rules.

11) Your program must handle all unicode codepoints as input. No fair specifying a code-page that trivializes the problem.

Additional test cases:

"z̈ ỏ"        "z o"
"r̈ëën̈ẗr̈ÿ"    "rëëntrÿ"  (don't outsmart yourself)

I have been informed this case is also excellent but it's starting point contains a few characters with unspecified behavior. If you maul Θ, Ό, Ɲ, or ȳ I don't care.

A pretty comprehensive test input:

    cöoperate with dͧḯ̍̑̊͐sc͆͐orͩ͌ͮ̎ͬd̄̚ æ Æ œ Œ…¼½¾‘’“”™$¢£¬†‡•‰·°±÷

a ä ä A Ä Ä
e ë ë E Ë Ë
i ï ï I Ï Ï   Z̃͗̇̚͟Ḁ̬̹̈̊̂̏̚L̜̼͊ͣ̈́̿̚G̱̮ͩ̃͑̆ͤ̂̚Õ̷͇͉̺̜̲ͩ́ͪͬͦ͐ ̪̀ͤͨ͛̍̈͢ĝ̭͇̻̊ͮ̾͂e̬̤͔̩̋ͮ̊̈ͭ̓̃n͖͎̘̭̯̳͎͒͂̏̃̾ͯe͕̖̋ͧ͑ͪ̑r̛ͩa̴͕̥̺̺̫̾ͭ͂ͥ̄ͧ͆t͍̻̘̆o͓̥ͤͫ̃̈̂r̹̤͇̰̻̯̐ͮ̈́ͦ͂͞
o ö ö O Ö Ö
u ü ü U Ü Ü
y ÿ ÿ Y Ÿ Ÿ
Unz̖̬̜̺̬a͇͖̯͔͉l̟̭g͕̝̼͇͓̪͍o̬̝͍̹̻


ö̎̋̉͆̉ö͒̿̍ͨͦ̽
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅ

Joshua

Posted 2019-07-17T02:11:42.787

Reputation: 3 043

Comments are not for extended discussion; this conversation has been moved to chat.

– James – 2019-07-18T18:32:28.003

10No! The title z̖̬̜̺̬a͇͖̯͔͉l̟̭g͕̝̼͇͓̪͍o̬̝͍̹̻! Wherefore art thou removed? – tjjfvi – 2019-07-18T20:06:47.373

1@tjjfvi: Don't worry overmuch. I'll put it back as soon as it's off HNQ. – Joshua – 2019-07-18T20:15:18.777

9To be fair, the zalgo showing up in HNQ is what brought me here. – Kenneth K. – 2019-07-18T21:30:03.143

@KennethK.: Take it up with the mods if you want it back sooner. – Joshua – 2019-07-18T21:30:53.973

It's at the bottom now; lower z̖̬̜̺̬a͇͖̯͔͉l̟̭g͕̝̼͇͓̪͍o̬̝͍̹̻ like you had should be fine, right? :) – tjjfvi – 2019-07-19T11:56:20.880

1@tjjfvi The ordering isn't consistent across refreshes. – wizzwizz4 – 2019-07-19T12:45:37.500

This challenge is similar to this one but is way more comprehensive. Should we close the other one as dupe?

– totallyhuman – 2019-07-20T21:58:36.260

Answers

JavaScript (Node.js), 45 bytes

s=>s.replace(/([aeiouy]̈)|[̀-ͯ҉]/ig,'$1')

Try it online!

Perl 5 (`-pC -Mutf8`), 30 28 bytes

using tsh regex, upvote him. -2 bytes thanks to DomHastings

s/[aeiouy]̈\K|[̀-ͯ҉]//gi

TIO

Nahuel Fouilleul

Posted 2019-07-17T02:11:42.787

Reputation: 5 582

(+1 byte), adding /i flag, to fix for uppercase – Nahuel Fouilleul – 2019-07-18T07:47:47.210

You can save a couple of bytes using \K instead of $1: Try it online!

– Dom Hastings – 2019-07-19T08:04:08.023

thank you, how could I not think about it, i used it in that other answer from me

– Nahuel Fouilleul – 2019-07-19T09:14:14.737

Japt, 22 bytes

Another port of tsh's solution

r"(%ÿ)|[̀-ͯ҉]""$1

Try it or run all test cases

Shaggy

Posted 2019-07-17T02:11:42.787

Reputation: 24 623

Hey. We're having trouble figuring out what's up with % in the regex. – Joshua – 2019-07-17T23:43:32.290

@Joshua, % is the RegEx escape character in Japt v1.x and y is its character class for vowels+y, so %y -> \y -> [AEIOUYaeiouy]. You can read more about Japt's RegEx implementation in the docs.

– Shaggy – 2019-07-18T09:37:37.767

Retina, 64 49 bytes (44 chars)

Removes combining diaresis only if not for a vowel, then removes any character not in the character class. \p{L} is a handy class for Unicode letters.

i`(?<![aeiouy])\u0308

[^	-÷\p{L}‘-™\u0308]

Try it online!

This successfully ensures no dotted consonants remain, and it preserved the dots over the vowels in a way that passes rules #4 and #9.

Test on a real-world example: Try it online!

If I copy the regex in @tsh's answer, I get this for 27 chars (31 bytes):

i`(?!(?<=[aeiouy])̈)[̀-ͯ҉]

Try it online!

mbomb007

Posted 2019-07-17T02:11:42.787

Reputation: 21 944

2Maybe you can take the regexp in my javascript answer. (I dont speak Retina, so maybe) – tsh – 2019-07-17T04:01:15.593

Copying the replacement from @tsh's answer would save you another 4 bytes. – Neil – 2019-07-17T08:29:07.330

Actually, those are characters, not bytes, so you'd have to recalculate in UTF8. (TIO counts characters as bytes because it's assuming you're using ISO 8859-1.) – Neil – 2019-07-17T08:30:23.360

Stax, 27 25 24 bytes

Ç╣>ñ↓í$▐wø⌐∞≤Ö1e╖ÖÅ╤Δ╩+º

Run and debug it

Now handles the Y, and uses no regex.

Unpacked, ungolfed, and commented, it looks like this:

400Vk|r push range array [400, 401, ..., 998, 999]
776X-   store 776 in the X register, and remove it from the array
-       remove all the *remaining* codepoints from the input
{       start a block for filtering the remaining string
  x-    subtract 776 from the codepoint; this will be truthy for all other values
  M     bitwise-or; if there's only a single value on the stack, no-op
  Vv'y+ "aeiouy"
  _]v#  current character in lowercase; is it in the vowels?
  s     swap top two stack entries so the lower value can be used in next iteration
f       complete the filter and output implicitly

Run this one

recursive

Posted 2019-07-17T02:11:42.787

Reputation: 8 616

1Explanation would be appreciated – data – 2019-07-18T06:24:22.113

@data: I added some comments. – recursive – 2019-07-18T21:18:34.683

Jelly, 29 bytes

768r2ȷỌ
ẹ9ị¢¤©fẹⱮØyF‘ƊœPḟ€¢j®

Try it online!

A full program that unzalgos its argument. Filters out all characters with code points between 768 and 2000 except diareses on vowels.

Explanation

Helper link: characters from code point 768 to 2000 inclusive

768r2ȷ  | Range from 768 to 2000 inclusive
      Ọ | Convert from code points to characters

Main link

ẹ                     | All indices in input string of:
 9ị¢¤©                | - the ninth character in the helper link (i.e.  ̈) (which is also copied to the register for use at the end)
      f      Ɗ        | Filter keeping only those in the result of the following, applied to the original input as a monad:
       ẹⱮ             | - Indices of each of the following:
         Øy           |   - Vowels plus y
           F          | - Flatten
            ‘         | - Decrease by 1
              œP      | Now split the original input at each of these indices, discarding the character at the index itself (which will be a diaresis following a vowel)
                ḟ€¢   | Filter each of these lists discarding any characters in the helper link
                   j® | Join using the stored diaresis

Nick Kennedy

Posted 2019-07-17T02:11:42.787

Reputation: 11 829

1+1 f͠͡o͝r̕ ̟̗̼̹̉̐̐ͪͬ̑͌͊̄̃̉̑̒̒͟ḃ̠͉̬̳̪͓͕͎ŏ̘̳̬̗͎̆ͦ͛ỵ͈͗̋̇͗̒̒̀̑̎c̘͎͚̞͈̘ͬ̽̊o̼̳͉̟ͮ̏͌̃̀͒ͧ̃t͎̬̼͛͛̋͆͌t͙̥͎̘͔͕̋͑i̲̮͙͕̫ͯ̊̄̈́̍̉̿̈̌n̫̥͕̞͖̺ͮ̋̏̆̒͗ͅg̘̻̜̟̩̺̈̌̈́͆̿ ̠͙̙̣̆̿͋ͫͅͅw̸̟̽̿̾ͬ̎͢ỉ̵̝͖̜̼̺͡͞t̩̳̫͙̋ͮ̏ͤͅh̴̥̬͓̭͙͊̇̔ ̵̖̥̩̟͌̐̅̎̄͗̈́͑́m̵̑͑̊ͨͮ͊͛ͦ̉ͪ̍̆́҉̝͖̥̕ͅe̶̛͔͍̹̰͚̰̯͈ͭ̌͛̋ͣ̂̓̃͂͆̚̕͟!̵̻̯͍͕̦̰̥̬͚͍ͨͥ̉̈́ͨ͂͗̎͌͊ͧ̈ͮ͑̅̅̇̔͞ – L. F. – 2019-07-18T03:16:44.170

PowerShell can do this too, 39 chars/43 bytes

$args-ireplace'([aeiouy]̈)|[̀-ͯ҉]','$1'

Try it online!

+1 to @tsh.

Readable versions:

$args-ireplace'([aeiouy]\u0308)|[\u0300-\u036f\u0489]','$1'

$args-ireplace"([aeiouy]$([char]0x308))|[$([char]0x300)-$([char]0x36f)$([char]0x489)]",'$1'

Andrei Odegov

Posted 2019-07-17T02:11:42.787

Reputation: 939

JavaScript (ES6), 51 chars/60 bytes

Complete golfness:

x=>x.replace(/([^   -ʷ -₿™̈]|(?![^aeiouy])̈)/giu,"")

Escaped golfness: ~~(As Unicode escape sequence doesn't exist, I write u+xxxx as <xxxx>.)~~

x=>x.replace(/([^\u0009-\u02b7\u2000-\u20bf\u2122\u0308]|(?![^aeiouy])\u0308)/giu,"")

Naruyoko

Posted 2019-07-17T02:11:42.787

Reputation: 459

Swift, 115 chars, 119 bytes (UTF-8)

import UIKit
let f={(s)in(""+s).replacingOccurrences(of:"([aeiouy]̈)|[̀-ͯ҉]",with:"$1",options:.regularExpression)}

(based on tsh regex)

Cœur

Posted 2019-07-17T02:11:42.787

Reputation: 401

Java (JDK), 120 50 bytes

s->s.replaceAll("(?i)([aeiouy]̈)|[̀-ͯ҉]","$1")

Try it online!

Based on Holger comment and Benjamin Urquhart comment, itself based on tsh regex.

Cœur

Posted 2019-07-17T02:11:42.787

Reputation: 401

The question explicitly says “You may assume that unicode normalization was not executed after zalgo generation”, so you don’t need to decompose the characters, just fix the input (as all other solutions also assume decomposed characters). See also this comment

– Holger – 2019-07-18T15:32:29.080

@Holger you're right, so I took your solution and made the post a community wiki. – Cœur – 2019-07-18T16:20:56.880

So it did work after all. Unicode is a strange thing. PS: I saw the comment – Benjamin Urquhart – 2019-07-18T17:22:13.647

05AB1E, 32 bytes

žAžHŸ776©KçKεÇ¤®Qi¬žOsçlå≠iн]˜çJ

This took longer than it should have, and I'm still not very happy with the result.. Definitely room for improvements..

Removes all characters with code-points in the range \$[512,65536]\$, except for vowels with diareses.

Try it online.

Explanation:

žA                # Push builtin 512
  žH              # Push builtin 65536
    Ÿ             # Create a list with these ranges: [512,513,514,...,65534,65535,65536]
     776          # Push 776
        ©         # Store it in variable `®` (without popping)
         K        # Remove it from this list
          ç       # Convert all these integers to characters with these code-points
           K      # And remove all those characters from the (implicit) input-string
ε                 # Then map all remaining characters to:
 Ç                #  Get the code-points of the current character
  ¤               #  Get the last code-point (without popping the code-points themselves)
   ®Qi            #  And if it's equal to variable `®` (776):
      ¬           #   Get the first code-point (without popping the code-points themselves)
       žO         #   Push builtin "aeiouy"
         s        #   Swap to get the first code-point again
          ç       #   Convert it to a character
           l      #   Then to lowercase
            å≠i   #   And if it's NOT in the vowel-string:
               н  #    Pop the code-points, and only leave the first code-point
                  #  (implicit elses: keep the code-point list)
]                 # Close both if-statements and the map
 ˜                # Flatten the list of lists of code-points
  ç               # Convert these code-point integers back into characters
   J              # And join the character-list back together to a single string
                  # (after which the result is output implicitly)

Kevin Cruijssen

Posted 2019-07-17T02:11:42.787

Reputation: 67 575

Python, 63 bytes

Uses the tsh regex

lambda s:re.sub("(?i)([aeiouy]̈)|[̀-ͯ҉]","\\1",s)
import re

TIO

Benjamin Urquhart

Posted 2019-07-17T02:11:42.787

Reputation: 1 262

Operation Unz̖̬̜̺̬a͇͖̯͔͉l̟̭g͕̝̼͇͓̪͍o̬̝͍̹̻

Answers

JavaScript (Node.js), 45 bytes

Perl 5 (-pC -Mutf8), 30 28 bytes

Japt, 22 bytes

Retina, 64 49 bytes (44 chars)

Stax, 27 25 24 bytes

Jelly, 29 bytes

Explanation

Helper link: characters from code point 768 to 2000 inclusive

Main link

PowerShell can do this too, 39 chars/43 bytes

JavaScript (ES6), 51 chars/60 bytes

Swift, 115 chars, 119 bytes (UTF-8)

Java (JDK), 120 50 bytes

05AB1E, 32 bytes

Python, 63 bytes

Perl 5 (`-pC -Mutf8`), 30 28 bytes