Additional verification, sanitization for transcoded CJK paths in URLs

Question

Suppose there is a file hosting server, and most filenames are strings with mostly CJK characters. Transmitting those characters in HTTP GET requests requires encoding them using UTF-8 (x3 overhead) then URL-encode escaping (another x3 overhead, x9 in total).

These filenames may be processed by front-end and back-end to identify and access resource metadata.

The solution I've thought of is to convert the filenames to UTF-16 encoding then base64-encode the result for use in transfer.

Currently, the filepaths are verified in our back-end for path-traversal attack (.. special file), special characters in C0 block (including tab, newline, etc.), and that's it. If verification fails, error notices are displayed.

My question being: assuming the result of decoding the "compressed" filename strings always goes through the verification in the backend, what additional verification, sanitization, or any processing should I apply in front-end and back-end?

I'm asking here because I believe the subtleties in base64 transcoding, UTF-16 Surrogates and Unicode normalization could bite me sometime.

Why is your title and some of your question about overhead? That's not a security concern. — schroeder, May 28 '21 at 13:34
OWASP has guides for how to verify and sanitise paths. Have you looked up the OWASP options? — schroeder, May 28 '21 at 13:38
@schroeder I followed OWASP cheatsheets, and determined those in the "Currently ..." paragraph are what we need for now. What advice I seek for is, additional considerations that may arise when transcoding from UTF-16 to UTF-8 and decoding Base64. — DannyNiu, May 29 '21 at 01:14

Additional verification, sanitization for transcoded CJK paths in URLs

0 Answers0