Say we have a server-side code does a whitelist validation of allowed file extensions when user uploads a file:
// Java code
if (allowedExtensionsInLowercase.contains(fileExtension.toLowerCase())) {
// File allowed
} else {
// Booh, bad file
}
I was wondering if a unicode-extension could be lowercased into a valid extension, while not being actually white listed, so I made up a Java test, and brute forced it.
Running it, I found that the Unicode U+8094 (ᾞ or 0x212a) which is the Kelvin symbol ( https://www.codetable.net/decimal/8490 ) is lowercased to the ascii k
letter. So I have a match! If I make evil.pK3
and .pk3
is an allowed extension, then my file will be accepted despite being a non-whitelisted extension.
But that's the only match I found, so I am wondering: are there some other known unicode caracter that turns ASCII when lowercased?
Is there a list of such "lowercasing collisions" maybe in different languages/charset/collations?
Java test used:
import org.junit.Test;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
public class UnicodeTest {
@Test
public void testLowercase() {
final Set<String> lookingFor = new HashSet<>(
// Arrays.asList("docdocxdwgpdfpptrartxtxlsxlsxzip".split("")));
Arrays.asList("abcdefghijklmnopqrstuvwxyz0123456789".split("")));
for (int i = 128; i < 0xffff; i++) {
final String utf8 = new String(Character.toChars(i));
final String lowercased = utf8.toLowerCase();
PenLog.trace(utf8 + " => " + lowercased);
if (lookingFor.contains(lowercased)) {
System.out.println(
"Unicode endpoint " + i + " (" + utf8 + ") is " + lowercased + " when lowercased!");
}
}
}
}