I'm trying to detect homograph attacks and other attacks where an attacker uses a spoof domain name that looks visually similar to a trusted domain name (e.g., bankofthevvest.com instead of bankofthewest.com).
Is there a dictionary or database of visually similar characters that's suitable for programmatic use?
For example, if I look up "l", I'd like to get back a list indicating that "l" is visually similar to "1" and "i" (at least in some fonts). If I look up "w", it might tell me that it's visually similar to "vv" (in some fonts). If I look up "d", it might tell me that it's visually similar to "cl" (in some fonts). At least for now, my focus is on visual similarity between ASCII characters. It's fine to ignore Unicode. (However, it's an extra bonus if there's a list that also knows about which Unicode characters are visually similar to each ASCII character.)
If such a thing already exists, I'd like to avoid re-inventing the wheel. Does such a list already exist?
Here's what I've found so far:
I found Is there a dictionary of visibly similar Unicode characters for Spam processing?, but the question is focused on Unicode, and the answers there don't really solve this question: they propose an alternate detection mechanism.
The following two research papers devise UC-SimList, a list of visually similar characters. However, it focuses on Unicode characters, and doesn't have similarity between ASCII letters (e.g., l vs 1, vv vs w).
Anthony Y. Fu, Xiaotie Deng, Liu Wenyin, Greg Little. The Methodology and an Application to Fight against Unicode Attacks. SOUPS 2005.
Anthony Y. Fu, Wan Zhang, Xiaotie Deng, Liu Wenyin. Safeguard against Unicode Attacks: Generation and Applications of UC-SimList. WWW 2006.