3

Suppose that we have this code (in TypeScript syntax):

function one(str: string): string {
  // do something with the string
  return str
}

function two() {
  let s = getSomeString() // returns some unknown string that may contain surrogates
  s = one(s)
  // ...
}

two()

Now suppose that when the string s is passed into the one(s) call, the runtime, instead of returning the str string with WTF characters in tact, returns a copy of the str string with some parts replaced (emphasis: it is not the implementation of one or two that does this replacement, it happens at the return statement where str is returned by the runtime). In particular, if the string contains WTF-16 isolated surrogates (invalid UTF-16), these will be replaced in the copy (returned by the return str statement) with the unicode replacement character. It is important to note that this is not common, and most developers will not be aware of this until it happens.

Imagine that the runtime did not perform any format conversions between function calls before, but it recently added this conversion of function arguments or their return values (without a function implementation knowing how it will be handled).

In such a runtime, could there be a security issue after the runtime has switched from never changing strings to now doing it sometimes? If so, what could happen?


In particular, the runtime I am thinking about is JavaScript interfacing with WebAssembly in the browser in the potential near future if the new "Interface Types" proposal passes a vote for string passing to have WTF-to-UTF sanitization.

What I could imagine as an example is some third party library updating the implementation of one or more of their functions from JavaScript to WebAssembly, and in turn, causing strings being passed from JS to WebAssembly (or vice versa) to be replaced by sanitized copies modified from their original form, causing unexpected errors, or in the worst case, a vulnerability.

Is there a potential problem here?

trusktr
  • 61
  • 9

2 Answers2

0

It's a bit difficult to speak about occasional string conversion in occasional points of a program: it's important to have a clear language specification with no undefined behavior.

Undefined behavior itself may lead to a security issue.

WTF-16 is not a real encoding, it's "maybe ill-formed UTF-16": either legacy UCS-2 (plain 16-bit values) or well-formed UTF-16 (16-bit integers and proper surrogate 16-bit pairs to encode the whole Unicode).

(Surrogates are the 16-bit integers in a range 0xD800-0xDFFF.)

JavaScript "treats" strings as well-formed UTF-16 by default, and if the string contains unpaired surrogates (ill-formed) then it treats such 16-bit integers as UCS-2 (as just some 16-bit Unicode code points).

"Conversion of WTF-16 to UTF-16" would probably mean conversion of those 16-bit unpaired surrogates ("plain" 16-bit values) to properly encoded UTF-16 surrogate pairs, in that case, the string interpretation is not changed, but the length of the string changes.

From here, "WTF-to-UTF" narrows down into 4 cases:

  1. String isn't changed. String interpretation is not changed either.
  2. String content changes having the same length. We could imagine replacing an unpaired surrogate on 0xFFFD (the Unicode Replacement Character).
  3. Making a longer string: conversion of some 16-bit integers from UCS-2 to UTF-16. (Converting to 16-bit surrogate pairs.) String visual interpretation won't be changed either as well.
  4. Making a shorter string: dropping unpaired surrogate out.

And from here there are the following issues with the string modification:

  1. If you have a code bound to the supposed length of the string (or a supposed string content), then it may lead to a security issue (unsafe code in the WebAssembly, or unsafe condition in JavaScript, e.g. "if string length >= 12 then authorize").
  2. ELSE: no security issue.
Alexander Fadeev
  • 1,244
  • 4
  • 10
  • 1
    “ELSE: no security issue” — why? There are other reasons why string conversion could be problematic. I don't know if those other reasons apply in this context but if they don't it requires explanation. For example, do WTF and UTF have different invalid byte strings? Could the conversion trick input sanity filters, the way UTF-8 can bypass SQL parameter validation? Can the non-canonicity cause unintended collisions, such as user registration treating AE and Æ as distinct user names and therefore allowing Æ as a new account, which is then granted the permissions of the AE account? – Gilles 'SO- stop being evil' Jul 29 '21 at 22:06
  • I extended my answer – Alexander Fadeev Jul 30 '21 at 10:03
  • You can't, by definition, convert an unpaired surrogate to a properly encoded UTF-16 surrogate pair - there is no "correct" interpretation of them, they are as meaningless as a string that isn't a multiple of 16 bits long. As the question says (but possibly didn't when you wrote this?) the only meaningful thing to do is replace them with the Unicode Replacement Character (U+FFFD) which does not need surrogates to represent, so would not change the length of the string. – IMSoP Jul 30 '21 at 12:44
  • @IMSoP There is one possible interpretation of the unpaired surrogate value: interpret it as a Unicode code point and encode it as a proper UTF-16 surrogate pair. Let's say, take an unpaired value 0xD999 and convert it to a pair of values (0xD8xx 0xDCxx). I wouldn't judge it as "correct" or "not correct" though. – Alexander Fadeev Jul 30 '21 at 14:18
  • @AlexanderFadeev That would just be taking one ill-formed UTF-16 string and producing a different ill-formed UTF-16 string: there is no such code point as U+D999, and never will be; just as there will never be a code point U+123456, the highest assignable being U+10FFFF. Specifically, [the Unicode Standard](https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf) would designate any attempt to represent either of those as "a code unit sequence that would correspond to a code point outside the defined range of Unicode scalar values". – IMSoP Jul 30 '21 at 14:58
  • Can string length actually change? Is it possible to make an example of this in JS, where passing an invalid string to an built-in API that sanitizes it will result in `str.length` being different than the original? – trusktr Jul 30 '21 at 17:36
  • @IMSoP You're basically right :) I can see only 3 options now 1) no conversion at all, keeping the string "as is" 2) replacing unpaired surrogate on FFFD 3) dropping unpaired surrogate shrinking the string length? I wouldn't personally expect any of the last 2 options in the real world. Actually, unless the author specifies the concrete conversion rule the questions cannot be answered properly. – Alexander Fadeev Jul 30 '21 at 17:37
  • The replacement method is the one in question in particular. – trusktr Aug 05 '21 at 23:48
0

This is similar to much older issues of whether a function or API is "8-bit clean" (as opposed to assuming 7-bit ASCII, and potentially mangling or resetting the high bit) or "binary-safe" (as opposed to interpreting a NUL byte as a terminator, or similar control-character behaviours).

In order to process a sequence of binary "words" (most commonly, we work with bytes of 8 bits, but iterating a string 16 bits at a time is no different in principle), you can do one of several things:

  • Treat the input as an opaque binary sequence, which must not be manipulated. This is obviously important when it actually represents non-textual data, such as an image or executable code. The main security risk here is accidentally passing it to an operation which interprets it differently, since you can't make any guarantees about its content.
  • Assume that the input forms a valid text string according to some encoding convention (e.g. "the high bit of every byte will be zero", "there will be exactly one NUL character, which marks the end of the string", "the sequence of 16-bit words will be a well-formed UTF-16 code unit sequence"). The security concern here should be obvious: if the assumption is incorrect, the resulting behaviour is undefined, and might lead to exploitable side effects.
  • Validate that the input forms a valid text string according to some encoding convention, and signal the caller if it is not. The security risk here is that the caller may not correctly handle the error signal, particularly if it is added to an existing API, leading to undefined behaviour elsewhere.
  • Sanitise the input so that it conforms to a particular encoding convention. For instance, a 7-bit-only API might clear the high bit on every byte; a UTF-16-only API might replace invalid code units or sequence with the Unicode Replacement Character (U+FFFD). As with validation, the risk is that the caller is not expecting this sanitisation to take place; it might for instance try to detect differences between the input and output, and see false positives when the sanitisation takes place.

The risks in each case all basically boil to a mismatch between the expectation of the API and its caller. If the original implementation of a function incidentally left the binary sequence in tact, a caller may use it in a context where that is important; a "fix" to the implementation to sanitise it would then break that caller, in potentially dangerous ways.

IMSoP
  • 3,780
  • 1
  • 15
  • 19