23

This is truly crazy. I received a SPAM email in which there is a URL crafted from apparent Unicode characters that surprisingly exist for italic/bold letters, which when I reported it to Google's spam collector using Thunderbird's Report Spam Email feature it had already been converted to ASCII letters, therefore the URL was not properly reported.

Here is the Unicode version: <base href="http://.COM">

Notice! These characters are bold/italic NOT because I selected to make them so, but because Unicode bizarrely contains bold/italic letters.

See the hex values here:

0011660   e   >   <   /   t   i   t   l   e   >   <   b   a   s   e  sp
       3e65    2f3c    6974    6c74    3e65    623c    7361    2065
      e   >   <   /   t   i   t   l   e   >   <   b   a   s   e    
0011700   h   r   e   f   =   "   h   t   t   p   :   /   /   p  gs  em
       7268    6665    223d    7468    7074    2f3a    f02f    999d
      h   r   e   f   =   "   h   t   t   p   :   /   / 360 235 231
0011720   *   p  gs  em   /   p  gs  em   #   p  gs  em  em   p  gs  em
       f0aa    999d    f0af    999d    f0a3    999d    f099    999d
    252 360 235 231 257 360 235 231 243 360 235 231 231 360 235 231
0011740   '   p  gs  em sub   p  gs  em   (   p  gs  em   (   .   C   O
       f0a7    999d    f09a    999d    f0a8    999d    2ea8    4f43
    247 360 235 231 232 360 235 231 250 360 235 231 250   .   C   O

Can a URL actually contain these Unicode characters, or will all browsers convert them to ASCII?

Whether ASCII or Unicode, ping resolves this to 185.86.76.164.

Why do these Unicode characters exist in the first place? Whoever requested bold/italic letters?

asker13
  • 341
  • 2
  • 6
  • 4
    Unicode has skin-tone modifers for emojis. Why is font style so surprising? https://unicode.org/emoji/charts/full-emoji-modifiers.html – MonkeyZeus Aug 11 '21 at 13:31

3 Answers3

45

Previous answers both tell part of the story here, but there's a few different aspects to understand.

Firstly, why do these code points exist? Unicode has the ambition to replace all previous ways of encoding text, which means it contains a lot of different types of script and symbol. Among those are things which look like letters (because they are) but are treated as symbols by mathematicians. For instance, U+211D DOUBLE-STRUCK CAPITAL R is the "ℝ" symbol used to represent "the set of all real numbers".

The code points used in your spam e-mail are from a block of these called Mathematical Alphanumeric Symbols.

Secondly, why do they get treated as "normal" letters in some contexts? Unicode defines a set of "normalization forms", because some natural characters can be represented more than one way with Unicode code points. For instance, "â" is code point U+00E2, but it can also be represented with "a" (U+0061) + the modifier U+0302 COMBINING CIRCUMFLEX ACCENT. "NFC" is a mapping which converts characters into "composed" forms where possible (e.g. [U+0061, U+0302] becomes U+00E2); "NFD" converts them into "decomposed" forms where possible (e.g. U+00E2 becomes [U+0061, U+0302]).

In this case, there is no difference in representation between "NFD" and "NFC", but there is an additional normalisation called "NFKC", which uses "compatibility" mappings. These are one-way mappings that select more common code points which are equivalent in usage, such as "ffi" (U+0066, U+0066, U+0069) as a replacement for the combined ligature "ffi" (U+FB03) - or in the current case, a standard Latin "u" (U+0075) for the mathematical symbol "" (U+1D66A).

How does this relate to URLs? The standard for handling Unicode in domain names is called "IDNA", and is quite a complicated standard. The relevant parts I was able to find are these:

  • RFC 5890 specifies that all strings should be normalised according to NFC before use in domains. This would be relevant for some URLs, but not the code points we're looking at here.
  • RFC 5892 lists a number of code points as "DISALLOWED": a domain name containing those code points is simply not allowed to exist. That list includes the code points we're looking at ("1D552..1D6A5; DISALLOWED").
  • RFC 5894 clarifies that the disallowed code points are those which would change if they were normalised according to NFKC. It therefore suggests that user agents (e.g. browsers) might want to apply NFKC mappings on user input prior to treating it as a domain name.

So, as far as I can make out:

  • ".COM" is not a valid domain name
  • a browser encountering it is allowed to transform it to "uzndress.com" rather than displaying an error (just as it transforms the "COM" to lower-case "com")

As a final note, which you didn't ask, but is worth discussing: why did the spam e-mail use this domain, if it's not valid? The reason is that if a spam filter looks only at the text of messages, without applying a mapping such as NFKC, different "spellings" of the same domain may not trip the filter. So using these code points is the same as writing "uZnDreSs.cOm" and hoping that the spam filter doesn't apply case folding.

Note that this is a different issue than that of IDN homograph attacks where visually similar code points can be used in valid domain names, such as "еbаy.com", which looks like "ebay.com" but is actually a different domain, mixing Latin and Cyrillic letters. (NFKC does not convert Cyrillic to Latin, as they are different alphabets which happen to have some visually similar letters.)

IMSoP
  • 3,780
  • 1
  • 15
  • 19
  • 1
    Great addition / combining/ paraphrasing of mine and @Steffen Ullrich awnsers… and good deep dive into the motivation and definitions… ;) – LvB Aug 10 '21 at 14:19
  • 2
    It should also be noted that NFKC normalization was only mandated by the old IDNA2003 spec. This was changed in IDNA2008. Now normalization is "done by the applications themselves, possibly in a local fashion, before invoking the protocol." I'd actually consider it a security issue if an application still uses NFKC. – nwellnhof Aug 11 '21 at 10:35
  • @nwellnhof Ah, I didn't know it used to be mandated. My answer does reference the current standard though, which makes clear that applying NFKC is not considered harmful, because any input that it would change is not a valid domain to start with: "if an application chooses to perform NFKC normalization before lookup, that operation is safe since this will never make the application unable to look up any valid string". As far as IDNA is concerned, it's no different from decoding HTML entities or other context-relevant encodings - just an interpretation of user input. – IMSoP Aug 11 '21 at 11:42
  • An actual example of the IDN homograph attack: https://www.аррӏе.com/ (proof-of-concept demo created by a researcher, it's safe) - this one still works in Firefox, though it's fixed in most other browsers. – Jonas Czech Aug 11 '21 at 20:31
  • @JonasCz interestingly Firefox here (on Linux) looks like `appIe.com` (mocked up with Latin I, actually `www.аррӏе.com/` if I paste it)with serifs on the palochka in both the link preview and the address bar. It does the same with capital `I`. I don't know whether that's a careful choice in an otherwise sans serif font, or where it comes from, but it shows the difference better than in the font used here for code samples, and much better than the default body text font. The other Cyrillic characters look just like their Latin counterparts – Chris H Aug 12 '21 at 09:23
  • @JonasCz That's an interesting example, although I find the wording on the blog post somewhat misleading. Displaying that domain was never really a "bug" in any browser, since it's a perfectly valid set of Cyrillic letters, no more invalid than "xkcd.com", or "app1e.com". Saying that it's "fixed" is a bit of an exaggeration - what Chromium (on which pretty much everything other than Firefox and Safari is based) have done is implement some heuristics which catch that example, at the expense of some innocent Cyrillic domains not being displayed; but they will never be able to "solve" the issue. – IMSoP Aug 12 '21 at 10:31
26

What you have here are mathematical symbols, see output from unicode text analyzer:

Browser Codepoint Name # Fonts Script
U+1D66A MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL U 12 Common
U+1D66F MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL Z 12 Common
U+1D663 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL N 12 Common
U+1D659 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL D 12 Common
U+1D667 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL R 12 Common
U+1D65A MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL E 12 Common
U+1D668 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL S 12 Common
U+1D668 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL S 12 Common

These symbols are considered equivalent in terms of Unicode to the respective "normal" characters, i.e. u, z, n, ... . When dealing with a URL containing Unicode clients will first do such a Unicode normalization step and if it after that still contains non-ASCII characters (not the case here) it will convert it as Punycode.

... it had already been converted to ASCII letters, therefore the URL was not properly reported

Since it was correctly normalized it is the actual relevant URL as a browser would access it. Thus it was properly reported.

But, it is even more complicated than this fairly simply explanation. For the details see the answer from IMSoP.

Wai Ha Lee
  • 113
  • 1
  • 7
Steffen Ullrich
  • 184,332
  • 29
  • 363
  • 424
  • 4
    Hmmm. I didn’t know browsers did the normalize operation…. Good to know. – LvB Aug 09 '21 at 22:23
  • 3
    There's something slightly more to this than Unicode equivalence. [This online tool](https://dencode.com/en/string/unicode-normalization) shows those code points as being converted to ASCII equivalents under NFKC, but not NFC and [RFC 5890](https://datatracker.ietf.org/doc/html/rfc5890#section-2.3.2.1) appears to require NFC. Instead, these code points are explicitly marked "DISALLOWED" in [RFC 5892](https://datatracker.ietf.org/doc/html/rfc5892). [RFC 5894](https://datatracker.ietf.org/doc/html/rfc5894) suggests that user agents are _allowed_ to use NFKC, but doesn't recommend it. – IMSoP Aug 10 '21 at 10:34
0

Unicode contains sets for all kind of reasons, like unicode italic r

   1D667 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL R Lowercase Letter

So in this case it’s for mathematical use-case.

These will be resolved not as it ASCII value but Propably with RFC: 5890 IDNA or with just irl encoding.

It wouldn’t be translated to the nearest ASCII code point.

As to the why, you can read the minutes from the unicode consortium meetings regarding there acceptance.

LvB
  • 8,217
  • 1
  • 26
  • 43