Is there a list of Unicode symbols that matches ASCII when lowercased (in java)?

Question

Say we have a server-side code does a whitelist validation of allowed file extensions when user uploads a file:

// Java code
if (allowedExtensionsInLowercase.contains(fileExtension.toLowerCase())) {
 // File allowed
} else {
 // Booh, bad file
}

I was wondering if a unicode-extension could be lowercased into a valid extension, while not being actually white listed, so I made up a Java test, and brute forced it.

Running it, I found that the Unicode U+8094 (ᾞ or 0x212a) which is the Kelvin symbol ( https://www.codetable.net/decimal/8490 ) is lowercased to the ascii k letter. So I have a match! If I make evil.pK3 and .pk3 is an allowed extension, then my file will be accepted despite being a non-whitelisted extension.

But that's the only match I found, so I am wondering: are there some other known unicode caracter that turns ASCII when lowercased?

Is there a list of such "lowercasing collisions" maybe in different languages/charset/collations?

Java test used:

import org.junit.Test;

import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class UnicodeTest {
    @Test
    public void testLowercase() {
        final Set<String> lookingFor = new HashSet<>(
//              Arrays.asList("docdocxdwgpdfpptrartxtxlsxlsxzip".split("")));
                Arrays.asList("abcdefghijklmnopqrstuvwxyz0123456789".split("")));

        for (int i = 128; i < 0xffff; i++) {
            final String utf8 = new String(Character.toChars(i));
            final String lowercased = utf8.toLowerCase();
            PenLog.trace(utf8 + " => " + lowercased);
            if (lookingFor.contains(lowercased)) {
                System.out.println(
                        "Unicode endpoint " + i + " (" + utf8 + ") is " + lowercased + " when lowercased!");
            }
        }
    }
}

Related question: why would you use an extension whitelist to determine allowed files? — Conor Mancone, Feb 07 '20 at 18:50
@ConorMancone To 1st prevent users from mistakenly upload a `docx` when the webapp can only deal with `pdf` and on some other cases, to also avoid a user uploading an unsafe `xlsm` — Xenos, Feb 09 '20 at 12:36
the issue is that extensions are not a reliable way of determining file type. You can make a .exe file and rename it .pdf — Conor Mancone, Feb 09 '20 at 16:57
@ConorMancone Yes, but your exe won't be run as an exe (but be shown as invalid PDF). Anyway, the goal here is to find bypasses of java's `toLowerCase` no matter how good/bad you consider the context. Having found only 1 unicode letter that does so feels very disappointing to me :c so I expected people having found more : ) — Xenos, Feb 10 '20 at 09:30

Is there a list of Unicode symbols that matches ASCII when lowercased (in java)?

0 Answers0