Is there a dictionary of visibly similar Unicode characters for Spam processing?

Question

I have spam that looks like this:

мy вυddy'ѕ мoм мαĸeѕ $74/нoυr oɴ тнe lαpтop. ѕнe нαѕ вeeɴ lαιd oғғ ғor ѕeveɴ мoɴтнѕ вυт lαѕт мoɴтн нer pαy cнecĸ wαѕ $19420 jυѕт worĸιɴɢ oɴ тнe lαpтop ғor α ғew нoυrѕ. нere'ѕ тнe ѕιтe тo reαd мore something.com

Obviously this is using Unicode to avoid spam detection. Are there any ASCII -> Unicode dictionaries that can assist in finding similar letters?

This is a variation of this Stack Overflow answer, but is specific to spam .

So now I realize that my previous comment wasn't exactly what related to this topic. While searching, I found an [interesting paper about exactly what you're facing](http://research.sidstamm.com/papers/unicode-spam.pdf). — Adi, May 21 '13 at 11:58
Once you've applied the mechanism described in the SO answer, surely anything else is most easily managed via Bayesian filering? — symcbean, May 21 '13 at 12:44
If you are specifically looking for English language filtering, why do you need a dictionary at all. Isn't the presence of non-latin characters evidence of SPAM by itself? — AJ Henderson, May 21 '13 at 13:06
@AJHenderson Good observation, and yes, even mixed locales would also be an indicator as well. I removed the English requirement to make this question useful to other people. — makerofthings7, May 21 '13 at 13:40
@makerofthings7 thats cool, is there a generator for english letters to that example text? — Nikos, Apr 15 '14 at 14:26

dr jimbob · Answer 1 · 2013-05-21T20:27:07.157

I would add to my spam classification algorithm something that detects multiple encodings in the same word/sentence. E.g., lαѕт having a Latin l, greek/coptic alpha, cyrillic dze, cyrillic te seems very suspicious. A very quick and dirty thing in python could be done using unicodedata. That is

>>> unicode_str = u"""мy вυddy'ѕ мoм мαĸeѕ $74/нoυr oɴ тнe lαpтop. ѕнe нαѕ 
вeeɴ lαιd oғғ ғor ѕeveɴ мoɴтнѕ вυт lαѕт мoɴтн нer pαy cнecĸ wαѕ $19420 jυѕт 
worĸιɴɢ oɴ тнe lαpтop ғor α ғew нoυrѕ. нere'ѕ тнe ѕιтe тo reαd мore something.com"""

>>> import unicodedata
>>> for uc in unicode_str:
        print uc, unicodedata.name(uc).split(' ')[0], unicodedata.category(uc)
м CYRILLIC Ll
y LATIN Ll
  SPACE Zs
в CYRILLIC Ll
υ GREEK Ll
d LATIN Ll
d LATIN Ll
y LATIN Ll
' APOSTROPHE Po
ѕ CYRILLIC Ll
(...)

Where "Ll means lowercase letter" and we only use the first part of the unicode name for letters to get its category. So then we can do something like

>>> def count_letter_encodings(unicode_str):
        unicode_block_set = set()
        for uc in unicode_str:
            category = unicodedata.category(uc)
            if category[0] == 'L':
                unicode_block = unicodedata.name(uc).split(' ')[0]
                if unicode_block not in unicode_block_set:
                    unicode_block_set.add(unicode_block)
        return len(unicode_block_set)

>>> map(count_letter_encodings, unicode_str.split(' '))
[2, 3, 2, 3, 3, 1, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, 0, 3, 2, 1, 2, 3, 2, 1, 2, 3, 2, 2, 3, 2, 2, 2, 1]

The fact that there are several words with 3 different letter encodings is highly suspicious (even two is suspicious). As well as other suspicious things; e.g., the soft-hyphens alternated by one character (codepoint 0xad) in the .com.

Pulling some brief foreign text samples off the web, you find that most of the time words should only have one encoding:

>>> greek_text = u'Ελληνας ϰαὶ δὴ ϰαὶ γράμματα'
>>> spanish_text = u'¿Cómo estas? Espero que todo esté bien. Hace bastantes años que estoy'
>>> russian_text = u'Все люди рождаются свободными и равными в своем достоинстве и правах. Они наделены'
>>> for text in (greek_text, spanish_text, russian_text):
        print map(count_letter_encodings, text.split(' '))
[1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Obviously, you would have to thoroughly test with machine learning as there may be benign circumstances where encodings are mixed, but this could be useful input to your spam classifier that's trained on a large dataset.

score 2 · Answer 2 · edited Jun 25 '16 at 20:55

I'd check for the use of multiple encodings in the same word (or sentence). That would be a dead ringer for this kind of thing.

Otherwise, something like this could help you - unfortunately it's only a partial table, and you'd have to use it in reverse. The hex codes are UTF-8, so c3a0 means U+00E0 here.

a:  c3a0,c3a1,c3a2,c3a3,c3a4,c3a5,c3a6,c481,c483,c485,
c:  c2a2,c3a7,c487,c489,c48b,c48d
d:  c48f,c491
e:  c3a8,c3a9,c3aa,c3ab,c493,c495,c497,c499,c49b
g:  c49d,c49f,c4a1,c4a3
h:  c4a5,c4a7
i:  c2a1,c3ac,c3ad,c3ae,c3af,c4a9,c4ab,c4ae,c4b0,c4ba
j:  c4b5
k:  c4b7,c4b8
l:  c4ae,c4af,c4ba,c4bc
n:  c3b1,c584,c586,c588,c589,c58b
o:  c3b0,c3b2,c3b3,c3b4,c3b5,c3b6,c3b8,c58d,c58f,c591,c593
p:  c3be
s:  c29a
u:  c2b5,c3b9,c3ba,c3bb,c3bc
x:  c397
y:  c3bd,c3bf
z:  c29e
A:  c380,c381,c382,c383,c384,c385,c386,c480,c482,c484
B:  c39f
C:  c387,c486,c488,c48a,c48c
D:  c390,c48e,c490,
E:  c388,c389,c38a,c38b,c492,c494,c496,c498,c49a,c592
G:  c49c,c49e,c4a0,c4a2
H:  c4a4,c4a6
I:  c38c,c38d,c38e,c38f,c4a8,c4aa,c4ac
J:  c4b4
K:  c4b6
L:  c4b9,c4bb,c4bd,c4bf
N:  c391,c583,c585,c587
O:  c392,c393,c394,c395,c396,c398,c58c,c58e,c590,c592
P:  c39e
R:  c594
r:  c595
S:  c28a
U:  c399,c39a,c39b,c39c,
Y:  c29f,c39d
Z:  c28e

On second thought, you'd probably have to add a list of "ignore-me" characters that can be added to a string to make it different while looking similar, for example U+0082. And now that I think about it, this could be used to defeat the "at most two encodings in each sentence". A word such as "déja vu" can be used legitimately (I remember seeing it out of a Mac editor), but the combining U+0300 accent can be used to make "Víágŕa" look like something else altogether.

So first all "combinings" should be removed, then some legitimate characters must be ignored (e.g. the ellipsis - Word processors adore it... and the various styles of quotes). Finally encodings can be counted, or you can replace characters with their OCR lookalikes as above.

c3a0 is the hex encoding for the character otherwise known as U+00E0 (http://www.fileformat.info/info/unicode/char/00e0/index.htm). In the same way c48f is 0xC4 0x8F, or U+010F http://www.fileformat.info/info/unicode/char/10f/index.htm and is visually similar to "d" (except for the caron). I don't think there's a "general pattern" to letter assignments, unfortunately. — LSerni, Jun 25 '16 at 20:30
You're welcome. I'm following your originating question; if I can I'll try and contribute. — LSerni, Jun 25 '16 at 21:00

score 0 · Answer 3 · answered Nov 26 '21 at 07:56

The Unicode Consortium publishes such a collection as part of their database. https://util.unicode.org/UnicodeJsps/confusables.jsp

E.g. https://pypi.org/project/homoglyphs/ simply downloads this data in machine-readable form and creates an in-memory mapping which allows you to do things like

Python 3.8.2 (default, May 18 2021, 11:47:11) 
[Clang 12.0.5 (clang-1205.0.22.9)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import homoglyphs
>>> hg = homoglyphs.Homoglyphs(strategy=homoglyphs.STRATEGY_LOAD)
>>> hg.to_ascii("ⅰΙе")
['trip1eee', 'tripIeee', 'tripleee', 'trip|eee']

Unfortunately, out of the box, it does not work very well at all with your sample data. I have found that obfuscations in the wild often seem to use code points which are not in the Unicode confusables mapping.

>>> >>> hg.to_ascii("мy вυddy'ѕ мoм")
[]
>>> hg.to_ascii("something.com")
[]
>>> hg.to_ascii("lαpтop")
[]
>>> hg.to_ascii("oғғ")
[]

The Python library I found is retired, though there is a fork which attempts to revive it. Perhaps they will be able to improve the performance on real-world obfuscated data (or better yet convince Unicode to extend their database of confusables).

Actually e.g. U+0138 _ĸ_ is listed as confusable with regular _k_ in the Unicode database, so I guess there's something wrong with (my use of?) the Python library implementation. — tripleee, Nov 26 '21 at 08:09
However, that mapping is also not visible in https://util.unicode.org/UnicodeJsps/character.jsp?a=0138 ... Perhaps I just need to understand the database better. — tripleee, Nov 26 '21 at 08:19

Is there a dictionary of visibly similar Unicode characters for Spam processing?

3 Answers3

Linked