I'd check for the use of multiple encodings in the same word (or sentence). That would be a dead ringer for this kind of thing.
Otherwise, something like this could help you - unfortunately it's only a partial table, and you'd have to use it in reverse. The hex codes are UTF-8, so c3a0
means U+00E0
here.
a: c3a0,c3a1,c3a2,c3a3,c3a4,c3a5,c3a6,c481,c483,c485,
c: c2a2,c3a7,c487,c489,c48b,c48d
d: c48f,c491
e: c3a8,c3a9,c3aa,c3ab,c493,c495,c497,c499,c49b
g: c49d,c49f,c4a1,c4a3
h: c4a5,c4a7
i: c2a1,c3ac,c3ad,c3ae,c3af,c4a9,c4ab,c4ae,c4b0,c4ba
j: c4b5
k: c4b7,c4b8
l: c4ae,c4af,c4ba,c4bc
n: c3b1,c584,c586,c588,c589,c58b
o: c3b0,c3b2,c3b3,c3b4,c3b5,c3b6,c3b8,c58d,c58f,c591,c593
p: c3be
s: c29a
u: c2b5,c3b9,c3ba,c3bb,c3bc
x: c397
y: c3bd,c3bf
z: c29e
A: c380,c381,c382,c383,c384,c385,c386,c480,c482,c484
B: c39f
C: c387,c486,c488,c48a,c48c
D: c390,c48e,c490,
E: c388,c389,c38a,c38b,c492,c494,c496,c498,c49a,c592
G: c49c,c49e,c4a0,c4a2
H: c4a4,c4a6
I: c38c,c38d,c38e,c38f,c4a8,c4aa,c4ac
J: c4b4
K: c4b6
L: c4b9,c4bb,c4bd,c4bf
N: c391,c583,c585,c587
O: c392,c393,c394,c395,c396,c398,c58c,c58e,c590,c592
P: c39e
R: c594
r: c595
S: c28a
U: c399,c39a,c39b,c39c,
Y: c29f,c39d
Z: c28e
On second thought, you'd probably have to add a list of "ignore-me" characters that can be added to a string to make it different while looking similar, for example U+0082. And now that I think about it, this could be used to defeat the "at most two encodings in each sentence". A word such as "déja vu" can be used legitimately (I remember seeing it out of a Mac editor), but the combining U+0300
accent can be used to make "Víágŕa" look like something else altogether.
So first all "combinings" should be removed, then some legitimate characters must be ignored (e.g. the ellipsis - Word processors adore it... and the various styles of quotes). Finally encodings can be counted, or you can replace characters with their OCR lookalikes as above.