Why is it dangerous when an attacker can control the `n` parameter to `memcpy()`?

Question

I was reading a paper and saw this piece of code has an information leakage vulnerability. It was saying the following code will Leak memory layout information to the attackers

Could somebody please explain me how this leaks information?

struct userInfo{
    char username[16];
    void* (*printName)(char*);
} user;
...
user.printName = publicFunction.
...
n = attacker_controllable_value; //20
memcpy(buf, user.username, n);   //get function ptr
SendToServer(buf);

I can see memcpy will give exception but why should it return memory address to attacker(or whatever it is returning)?

Thanks in advance

This is also essentially the implementation bug that led to the big Heartbleed bug. The attacker controlled a length parameter that wasn't checked in any way, allowing the attacker to make the server send back big chunks of memory potentially containing secret keys, confidential data, passwords, and so on. — Linuxios, Jan 29 '15 at 16:59
"I can see `memcpy` will give exception" is exactly where your intuition fails. `memcpy` won't give any exceptions, it will happily copy as many bytes as you tell it to. (You might get an exception from your OS if you run into a non-memory-mapped address range, but that has nothing to do with the original array size). — Guntram Blohm, Jan 29 '15 at 17:04
@GuntramBlohm, yes, I learnt it s not giving any exception thanks to Michael Kjorling — smttsp, Jan 29 '15 at 20:15
Out of curiosity, why did you previously think that memcpy would throw an exception? (and why did you think that C even had exceptions?) — user253751, Jan 30 '15 at 00:14
As @Linuxios said, this is essentially what lead to the Heartbleed bug. xkcd has [an excellent visual representation](http://xkcd.com/1354/) of what happens when you use an untrusted length. — Bob, Jan 30 '15 at 01:16

haze · Accepted Answer · 2015-01-29T07:39:26.543

Assuming buf's size is either controlled by n or larger than 16, the attacker could make n any number he wanted and use that to read an arbitrary amount of memory. memcpy and C in general do not throw exceptions or prevent this from happening. So long as you don't violate any sort of page protections or hit an invalid address, memcpy would continue merrily along until it copies the amount of memory requested.

I assume that user and this vulnerable block of code is in a function somewhere. This likely means it resides on the stack. All local function variables, the return address, and other information are contained on the stack. The below diagram shows it's structure in systems using intel assembly (which most platforms use and I assume your computer does).

Stack frame

You would be able to get the return address using this method if you were to make n large enough to cause memcpy to move forward in the stack frame. user would be in the section in this diagram labeled "Locally declared variables". EBP is a 4 byte value, so if we were to read past that and them copy the next 4 bytes with memcpy, we'd end up copying the return address.

Note the the above depends on what architecture the program is running on. This paper is about iOS, and since I don't know anything about ARM, the specifics of this information could be somewhat inaccurate.

Does it also mean by changing `attacker_controllable_value` can the attacker move anywhere in the stack if s/he doesn't violate any page protection? For example can s/he get `callee save registers` if s/he knows data size of the things above it? — smttsp, Jan 29 '15 at 07:23
The attacker would only be able to go up, since memcpy can't go backwards. Anything above where memcpy started reading from would be readable. — haze, Jan 29 '15 at 07:31

score 16 · Answer 2 · edited Mar 17 '17 at 13:14

A good answer has already been given by sasha, but I want to look at this from another angle; specifically, what memcpy actually does (in terms of what code gets executed).

Allowing for the possibility of minor bugs in this quick-and-dirty implementation, a trivial implementation of memcpy() that meets the C89/C99/POSIX function signature and contract might be something not entirely unlike:

/* copy n bytes starting at source+0, to target+0 through target+(n-1), all inclusive */
void memcpy (void* target, void* source, size_t n)
{
    for (size_t i = 0; i < n; i++)
    {
        *target++ = *source++;
        /* or possibly the here equivalent: target[i] = source[i]; */
    }
}

Now, a real implementation would probably do the copying in larger chunks than one byte at a time to take advantage of the wide memory (RAM) interconnect buses of today, but the principle remains exactly the same.

For the purposes of your question, the important part to note is that there is no bounds checking. This is by design! There are three important reasons for why this is so:

C is often used as a operating system programming language, and it was designed as a "portable assembler". Thus, the general approach to many of the old library functions (of which memcpy() is one), and the language in general, is that if you can do it in assembler, it should also be doable in C. There are very few things you can do in assembler but not in C.
There is no way to, given a pointer to a memory location, know how much memory is properly allocated at that location, or even if the memory pointed to by the pointer is allocated at all! (A common trick to speed up software in the old days of early x86 systems and DOS was to write directly to the graphics memory to put text on the screen. The graphics memory, obviously, was never allocated by the program itself; it was just known to be accessible at a specific memory address.) The only way to really find out if it works is to read or write the memory and see what happens (and even then I believe accessing uninitialized memory invokes undefined behavior, so basically, the C language standard allows anything to happen).
Basically, arrays degenerate to pointers, where the unindexed array variable is the same thing as a pointer to the start of the array. This is not strictly true in every case, but it's good enough for us right now.

It follows from (1) that you should be able to copy any memory you want to, from anywhere to anywhere. Memory protection is Someone Else's Problem. Specifically, these days it's the responsibility of the OS and MMU (these days generally part of the CPU); the relevant portions of the OS themselves likely being written in C...

It follows from (2) that memcpy() and friends need to be told exactly how much data to copy, and they have to trust that the buffer at the target (or whatever else is at the address pointed to by the target pointer) is sufficiently large to hold that data. Memory allocation is The Programmer's Problem.

It follows from (3) that we can't tell how much data is safe to copy. Making sure memory allocations (both source and destination) are sufficient is The Programmer's Problem.

When an attacker can control the number of bytes to copy using memcpy(), (2) and (3) break down. If the target buffer is too small, whatever follows it will be overwritten. If you are lucky, that will result in a memory access violation, but C the language or its standard libraries doesn't guarantee that it will happen. (You asked it to copy memory contents, and it either does that, or it dies trying, but it doesn't know what was intended to be copied.) If you pass a source array that is smaller than the number of bytes you ask for memcpy() to copy, there is no reliable way for memcpy() to detect that such is the case, and it will happily barrage on past the end of the source array as long as reading from the source location and writing to the target location works.

By allowing an attacker to control n in your example code, in such a way that n is larger than the maximum size of the array on the source side of the copy, memcpy() will because of the above points happily keep copying beyond the length of the intended source array. This is basically the Heartbleed attack in a nutshell.

That is why the code leaks data. Exactly what data is leaked depends both on the value of n and how the compiler lays out the machine language code and data in memory. The diagram in sasha's answer gives a good overview, and every architecture is similar but different.

Depending on how exactly your variable buf is declared, allocated and laid out in memory, you might also have what is known as a stack smashing attack where data needed for the proper operation of the program is overwritten, and the data that overwrote whatever was there is subsequently referred to. In mundane cases this leads to crashes or nigh-impossible-to-debug bugs; in severe, targetted cases, it can lead to arbitrary code execution fully under the control of the attacker.

As a side comment, the creation of the scripting languages came about because of this constant need for memory management. — munchkin, Jan 29 '15 at 11:34
@munchkin I believe you meant to say *memory managed* languages. — user, Jan 29 '15 at 15:16
@munchkin Wikipedia: "Ada is a structured, statically typed, imperative, wide-spectrum, and object-oriented high-level computer programming language, extended from Pascal and other languages." Sounds quite far from most scripting languages. I doubt I'd call Lisp a "scripting language" either, and I really doubt that anyone did in the 1950s. And it's not like those would be the only counterexamples, either. — user, Jan 29 '15 at 15:50
yes, the concept of scripting probably wasn't invented back then. — munchkin, Jan 29 '15 at 15:54
I doubt many languages were created *specifically because* of memory management (except for Rust), but it's been incorporated into many languages that were created for other primary reasons. — user253751, Jan 30 '15 at 01:51

score 6 · Answer 3 · edited Mar 17 '17 at 13:14

I am posting another answer, because the two answers here, although both correct, miss an important point of the question in my oppinion. The question is about the information leak concerning memory layout.

The presented memcpy might always have a correctly sized output buffer, so even if the attacker controls the size, there might be no risk of stack smashing at this point. Leaking information (as in heartbleed, as already mentioned by Linuxios) is a potential problem, depending on what information is leaked. In this example, you are leaking the address of publicFunction. This is a real problem, because it defeats Address Space Layout Randomization. ASLR is topic for example in How do ASLR and DEP work?. As soon as you publish the address of publicFunction, the address of all other functions in the same module (DLL or EXE file) are published, and can be used in return-to-libc or return-oriented-programming attacks. You need a different hole than the one presented here for those attacks, though.

That is a really good point. Actually, the paper I was reading was about bypassing iOS security (as you might have seen). — smttsp, Jan 31 '15 at 08:42

Why is it dangerous when an attacker can control the `n` parameter to `memcpy()`?

3 Answers3