28

I'm trying to detect homograph attacks and other attacks where an attacker uses a spoof domain name that looks visually similar to a trusted domain name (e.g., bankofthevvest.com instead of bankofthewest.com).

Is there a dictionary or database of visually similar characters that's suitable for programmatic use?

For example, if I look up "l", I'd like to get back a list indicating that "l" is visually similar to "1" and "i" (at least in some fonts). If I look up "w", it might tell me that it's visually similar to "vv" (in some fonts). If I look up "d", it might tell me that it's visually similar to "cl" (in some fonts). At least for now, my focus is on visual similarity between ASCII characters. It's fine to ignore Unicode. (However, it's an extra bonus if there's a list that also knows about which Unicode characters are visually similar to each ASCII character.)

If such a thing already exists, I'd like to avoid re-inventing the wheel. Does such a list already exist?

Here's what I've found so far:

D.W.
  • 98,420
  • 30
  • 267
  • 572

2 Answers2

12

There are different approaches for homograph attacks. The success depends on the used font. For example in some fonts the small letter l looks very much like the capitalized letter I. And in others they don't.

Similarities

Use similar characters. They substitute the real character.

  • b ⇔ 6
  • c ⇔ (
  • g ⇔ q, 9
  • C ⇔ (
  • G ⇔ 6
  • L ⇔ l, I, 1, |
  • O ⇔ 0
  • S ⇔ 5
  • V ⇔ U
  • Z ⇔ 2

Sound Alteration Characters

Some language, like German, have special characters (e.g. umlaut). Under some circumstances they may look like a character without them:

  • a ⇔ ä, à, á
  • e ⇔ ë, è, é
  • i ⇔ ï, ì, í
  • o ⇔ ö, ò, ó
  • u ⇔ ü, ù, ú

Multi-letter

In some fonts the multi-letter approach is very successful.

  • a ⇔ ci
  • d ⇔ cl
  • g ⇔ cj
  • m ⇔ rn
  • A ⇔ fi
  • W ⇔ VV

Constructions

Construct a single character from multiple characters. Very popular is vv instead of w.

  • A ⇔ /\
  • B ⇔ |3
  • D ⇔ |)
  • G ⇔ (¬
  • H ⇔ |-|
  • K ⇔ |<, |{
  • L ⇔ |_
  • M ⇔ |v|
  • N ⇔ |\|
  • V ⇔ \/

Injections

Injection involves inserting meaningless characters into a string, especially within a domain/url.

http://somewebsite.examplehttp://some-website.example

Whitespaces are often overlooked in this case. The Zero Width Whitespace (&#8203;) is a nice tool under some circumstances.

Marc Ruef
  • 1,060
  • 5
  • 12
11

Try looking under the term "Homoglyph" instead of "homograph".

For instance, this might be what you wanted:

https://codebox.net/pages/homoglyph-detection

It contains code and dictionaries.

J Kimball
  • 2,137
  • 1
  • 13
  • 19