List of visually similar characters, for detecting spoofing and social engineering attacks

Question

I'm trying to detect homograph attacks and other attacks where an attacker uses a spoof domain name that looks visually similar to a trusted domain name (e.g., bankofthevvest.com instead of bankofthewest.com).

Is there a dictionary or database of visually similar characters that's suitable for programmatic use?

For example, if I look up "l", I'd like to get back a list indicating that "l" is visually similar to "1" and "i" (at least in some fonts). If I look up "w", it might tell me that it's visually similar to "vv" (in some fonts). If I look up "d", it might tell me that it's visually similar to "cl" (in some fonts). At least for now, my focus is on visual similarity between ASCII characters. It's fine to ignore Unicode. (However, it's an extra bonus if there's a list that also knows about which Unicode characters are visually similar to each ASCII character.)

If such a thing already exists, I'd like to avoid re-inventing the wheel. Does such a list already exist?

Here's what I've found so far:

I found Is there a dictionary of visibly similar Unicode characters for Spam processing?, but the question is focused on Unicode, and the answers there don't really solve this question: they propose an alternate detection mechanism.
The following two research papers devise UC-SimList, a list of visually similar characters. However, it focuses on Unicode characters, and doesn't have similarity between ASCII letters (e.g., l vs 1, vv vs w).

Anthony Y. Fu, Xiaotie Deng, Liu Wenyin, Greg Little. The Methodology and an Application to Fight against Unicode Attacks. SOUPS 2005.

Anthony Y. Fu, Wan Zhang, Xiaotie Deng, Liu Wenyin. Safeguard against Unicode Attacks: Generation and Applications of UC-SimList. WWW 2006.

This is a website that finds visually similar characters. https://unicode.org/cldr/utility/confusables.jsp — TigerYT, Jul 24 '19 at 10:08

score 12 · Answer 1 · edited Aug 19 '18 at 16:24

There are different approaches for homograph attacks. The success depends on the used font. For example in some fonts the small letter l looks very much like the capitalized letter I. And in others they don't.

Similarities

Use similar characters. They substitute the real character.

b ⇔ 6
c ⇔ (
g ⇔ q, 9
C ⇔ (
G ⇔ 6
L ⇔ l, I, 1, |
O ⇔ 0
S ⇔ 5
V ⇔ U
Z ⇔ 2

Sound Alteration Characters

Some language, like German, have special characters (e.g. umlaut). Under some circumstances they may look like a character without them:

a ⇔ ä, à, á
e ⇔ ë, è, é
i ⇔ ï, ì, í
o ⇔ ö, ò, ó
u ⇔ ü, ù, ú

Multi-letter

In some fonts the multi-letter approach is very successful.

a ⇔ ci
d ⇔ cl
g ⇔ cj
m ⇔ rn
A ⇔ fi
W ⇔ VV

Constructions

Construct a single character from multiple characters. Very popular is vv instead of w.

A ⇔ /\
B ⇔ |3
D ⇔ |)
G ⇔ (¬
H ⇔ |-|
K ⇔ |<, |{
L ⇔ |_
M ⇔ |v|
N ⇔ |\|
V ⇔ \/

Injections

Injection involves inserting meaningless characters into a string, especially within a domain/url.

http://somewebsite.example ⇔ http://some-website.example

Whitespaces are often overlooked in this case. The Zero Width Whitespace () is a nice tool under some circumstances.

score 11 · Answer 2 · answered Jun 27 '16 at 14:08

11

Try looking under the term "Homoglyph" instead of "homograph".

For instance, this might be what you wanted:

https://codebox.net/pages/homoglyph-detection

It contains code and dictionaries.

answered Jun 27 '16 at 14:08

J Kimball

2,137
1
13
19