Why are programs written in C and C++ so frequently vulnerable to overflow attacks?

Question

When I look at the exploits from the past few years related to implementations, I see that quite a lot of them are from C or C++, and a lot of them are overflow attacks.

Heartbleed was a buffer overflow in OpenSSL;
Recently, a bug in glibc was found that allowed buffer overflows during DNS resolving;

that's just the ones I can think off right now, but I doubt that these were the only ones that A) are for software written in C or C++ and B) are based on a buffer overflow.

Especially concerning the glibc bug, I read a comment that states that if this happened in JavaScript instead of in C, there wouldn't have been an issue. Even if the code was just compiled to Javascript, it wouldn't have been an issue.

Why are C and C++ so vulnerable to overflow attacks?

[This answer](http://security.stackexchange.com/questions/95245/security-implications-of-neglecting-the-extra-byte-for-null-termination-in-c-c/95248#95248) and [this answer](http://security.stackexchange.com/questions/82750/why-are-buffer-overflows-executed-in-the-direction-they-are/82846#82846) might be interesting reads. It basically comes down to the design of the language, and the level it was implemented. — RoraΖ, Feb 23 '16 at 14:45
@RoraΖ there are tools for compiling C to javascript though, like emscripten. http://dankaminsky.com/2016/02/20/skeleton/, near the bottom is what I am referring to. — Nzall, Feb 23 '16 at 15:04
Your question is kind of like "Why do only Windows computers get Windows viruses?". Because Windows Viruses are only possible on Windows Computers. C and C++ get buffer overflow vulnerabilities from their ability to do unchecked pointer arithmetic. Most other languages don't have this capability, and thus can't have buffer overflows. Your question also doesn't consider the popularity of these languages. (Perhaps other languages are MORE problematic, but aren't used as much, thus they have less total vulnerabilities). — Alexander, Feb 24 '16 at 05:00
Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/36311/discussion-on-question-by-nate-kerkhofs-why-are-programs-written-in-c-and-c-so). — Rory Alsop, Feb 27 '16 at 15:08
In C++ one of the reasons for buffer overflows is to fail in modern C++ and ignore more safe concepts like STL. If you use C++ like C you'll get what you deserve. — juzzlin, Feb 28 '16 at 18:45
There was a brilliant comment before that seems to have been deleted: "It's because a scalpel cuts more than a safety scissors" — a20, Feb 29 '16 at 03:22
C/C++ is also most lickly to be used for software that is exposed to the most risk and attacks. — Ian Ringrose, Feb 29 '16 at 15:36
Heartbleed was for me the result of two bad practices in programming: 1. Use of goto statement and 2. lack of programming standard like "Forbid the usage of `if`statement without braces". With this, it would not happens. After a lack of unit testing is probably another reason as well... I agree that C and C++ are more prone for that attack as they are quite low level language and it is often the developer's responsibility to prevent a bad usage. — рüффп, Dec 24 '16 at 15:44

score 175 · Accepted Answer · answered Feb 23 '16 at 14:48

175

C and C++, contrary to most other languages, traditionally do not check for overflows. If the source code says to put 120 bytes in an 85-byte buffer, the CPU will happily do so. This is related to the fact that while C and C++ have a notion of array, this notion is compile-time only. At execution time, there are only pointers, so there is no runtime method to check for an array access with regards to the conceptual length of that array.

By contrast, most other languages have a notion of array that survives at runtime, so that all array accesses can be systematically checked by the runtime system. This does not eliminate overflows: if the source code asks for something nonsensical as writing 120 bytes in an array of length 85, it still makes no sense. However, this automatically triggers an internal error condition (often an "exception", e.g. an ArrayIndexOutOfBoundException in Java) that interrupts normal execution and does not let the code proceed. This disrupts execution, and often implies a cessation of the complete processing (the thread dies), but it normally prevents exploitation beyond a simple denial-of-service.

Basically, buffer overflow exploits requires the code to make the overflow (reading or writing past the boundaries of the accessed buffer) and to keep on doing things beyond that overflow. Most modern languages, contrary to C and C++ (and a few others such as Forth or Assembly), don't allow the overflow to really occur and instead shoot the offender. From a security point of view this is much better.

answered Feb 23 '16 at 14:48

Thomas Pornin

320,799
57
780
949

75

*"From a security point of view this is much better."* While this is certainly true, it also makes some types of programming -- particularly, operating system programming -- significantly more difficult. Remember that C's heritage traces back to being a programming language designed to implement Unix in a portable manner; for good reason, C is sometimes referred to as **portable assembler**. – user Feb 23 '16 at 15:32
Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/36312/discussion-on-answer-by-thomas-pornin-why-are-programs-written-in-c-and-c-so-f). – Rory Alsop Feb 27 '16 at 15:11
2

@MichaelKjörling True, but then again, there's plenty of Microsoft Research OSes that build on entirely managed (and in this one way, entirely secure) code, including static verification. Microsoft spends a lot of money on fixing this problem systematically, rather than waiting for people to wisen up. As always :D Performance is always tricky, but then again, you get plenty of opportunities for optimization with managed and reflectable code than you ever get with assembly - for a lot of server software, they even managed to get a sizeable performance increase thanks to that. – Luaan Feb 28 '16 at 11:29
@Luaan I doubt the major difference was moving from "assembly" to "managed code". If anything, it seems more likely to have been due to moving from non-JITed to JITed code. With compilation time optimizations, you have to pick a lowest baseline that you are willing to support. With JITed code, you can optimize for the specific machine you are running on. In principle, you probably could JIT code written in C; I'm not sure if anyone has tried that, though... – user Feb 28 '16 at 16:40
@MichaelKjörling Actually, many of them aren't JIT compiled. They are still compiled for the specific hardware configuration, though. But to get any effective at JITted compilation, you need a lot of extra information, and a lot of limits - C is simply way too freeform to allow a lot of meaningful optimizations even from source code, much less the compiled code. Even something as simple as those bounds checks - there's no way for a C-compiler to do bounds checking for you, since you're just manipulating some random pointers, as far as the compiler knows. The same goes for safely omitting them. – Luaan Feb 28 '16 at 18:09

DevSolar · Answer 2 · 2020-01-08T09:27:56.713

Note that there is some amount of circular reasoning involved: Security issues are frequently linked to C and C++. But how much of that is due to inherent weaknesses of these languages, and how much of it is because those are simply the languages most of the computer infrastructure is written in?

C is intended to be "one step up from assembler". There is no bounds checking other than what you yourself implemented, to squeeze the last clock cycle out of your system.

C++ does offer various improvements over C, the most relevant to security being its container classes (e.g. <vector> and <string>), and since C++11, smart pointers, which allow you to handle data without having to manually handle memory as well. However, due to being an evolution of C instead of a completely new language, it still also provides the manual memory management mechanics of C, so if you insist on shooting yourself in the foot, C++ does nothing to keep you from it.

So why are things like SSL, bind, or OS kernels still written in these languages?

Because these languages can modify memory directly, which makes them uniquely suited for a certain type of high-performance, low-level application (like encryption, DNS table lookups, hardware drivers... or Java VMs, for that matter ;-) ).

So, if a security-relevant software is breached, the chance of it being written in C or C++ is high, simply because most security-relevant software is written in C or C++, usually for historic and / or performance reasons. And if it's written in C/C++, the primary attack vector is the buffer overrun.

If it were a different language, it would be a different attack vector, but I am sure there would be security breaches just as well.

Exploiting C/C++ software is easier than exploiting, say, Java software. The same way that exploiting a Windows system is easier than exploiting a Linux system: The former is ubiquitous, well understood (i.e. well-known attack vectors, how to find and how to exploit them), and a lot of people are looking for exploits where the reward / effort ratio is high.

That does not mean the latter is inherently safe (safer, perhaps, but not safe). It means that -- being the harder target with lower benefits -- the Bad Boys aren't wasting as much time on it, yet.

score 37 · Answer 3 · edited Feb 26 '16 at 11:57

37

Actually, "heartbleed" was not really a buffer overflow. To make things more "efficient", they put many smaller buffers into one big buffer. The big buffer contained data from various clients. The bug read bytes that it wasn't supposed to read, but it didn't actually read data outside that big buffer. A language that checked for buffer overflows wouldn't have prevented this, because someone went out of their way or prevent any such checks from finding the problem.

edited Feb 26 '16 at 11:57

user1708860

103
2

answered Feb 23 '16 at 15:46

gnasher729

1,823
10
14

21

IIRC, the BSD memory allocation *would* have prevented that bug from going unnoticed, but the implementors actively circumvented that system because they considered it to be "too slow". In a way, that's the kind of choice C/C++ are all about, only that this time it was a *really* poor decision. ;-) – DevSolar Feb 23 '16 at 16:43
1

Indeed. If you did this in C# you could easily introduce the equivalent attack. – Joshua Feb 24 '16 at 19:11
4

Did they profile their code prior to making this design decision? I find it really hard to believe they were bottlenecking on `malloc(3)`. – Kevin Feb 25 '16 at 02:58
3

@Kevin memory allocations are relatively slow operations, particularly compared to allocating a buffer once and reusing it. If you are writing fast code (and stuff for web servers need to be fast as people complain) then yes, it could easily be the bottleneck after removing all other bottlenecks! This is very much true is you allocate lots of small buffers. – gbjbaanb Feb 25 '16 at 14:20
1

@Kevin: Using malloc() in response to data received from untrusted sources will make code susceptible to an attacker who triggers allocation/release patterns that will cause fragmentation. Code which uses and recycles memory pools can guard against such problems in ways which code using malloc() cannot. – supercat Feb 25 '16 at 16:36
@gbjbaanb: it reaaaally depends what we are comparing. I remember reading that jemalloc's "golden path" was about ~25 cycles, for example. Since we are talking about a crypto library, and crypto is not especially fast (unless assisted by hardware), I do think it would be worth profiling. That said, things changed a lot since this code was written, and I seem to remember they complained about specific platforms being slow (but introduced the buffer for all platforms). – Matthieu M. Feb 26 '16 at 20:25
@MatthieuM.consider EASTL, an implementation of STL for games because the usual STL has an allocation system that is not optimised enough for use in gaming. There's one real-world example where memory allocation was a bottleneck, so its not quite as "hard to believe" as Kevin thought. OpenSSL might have the same "as fast as possible" requirements, or might be poorly designed WRT memory allocs. – gbjbaanb Feb 27 '16 at 13:55

score 26 · Answer 4 · answered Feb 24 '16 at 01:49

26

First, as others have mentioned, C/C++ is sometimes characterized as a glorified macro assembler: it is meant to be "close to the iron", as a language for system-level programming.

So for instance, the language allows me to declare an array of zero length as a placeholder when, in fact, it may represent a variable-length section in a data packet or the beginning of a variable-length region in memory that is used to communicate with a piece of hardware.

Unfortunately it also means that C/C++ is dangerous in the wrong hands; if a programmer declares an array of 10 elements and then writes to element 101, the compiler will happily compile it, the code will happily execute, trashing whatever happens to be at that memory location (code, data, stack, who knows.)

Second, C/C++ is idiosyncratic. A good example is strings, which are basically character arrays. But each string constant carries an extra, invisible terminating character. This has been the cause of countless errors as (especially, but not exclusively) novice programmers often fail to allocate that extra byte needed for the terminating null.

Third, C/C++ is actually quite old. The language came into being at a time when external attacks on a software system were basically non-existent. Users were expected to be trusted and cooperative, not hostile, as their goal was to make the program work, not to crash it.

Which is why the standard C/C++ library contains many functions that are inherently unsafe. Take strcpy(), for instance. It will happily copy anything up until a terminating null character. If it doesn't find a terminating null character, it will keep on copying till hell freezes over, or more likely, until it overwrites something vital and the program crashes. This wasn't a problem in the good old days, when, a user was not expected to enter into a field reserved for, say, a ZIP code, 16000 garbage characters followed by a specially constructed set of bytes that were meant to be executed after the stack was trashed and the processor resumed execution at the wrong address.

Just to be sure, C/C++ is not the only idiosyncratic language out there. Other systems have different idiosyncratic behavior, but it can be just as bad. Take back-end programming languages like PHP, and how easy it is to write code that allows for SQL injection.

In the end, if we give programmers the powerful tools they need to do their job, but without adequate training and awareness of the security environment, Bad Things will happen no matter which programming language is used.

answered Feb 24 '16 at 01:49

Viktor Toth

439
3
4

4

The powerful tools needed to program *efficiently*. In general, direct memory access is not *necessary*; see almost any other high-level language. – user253751 Feb 24 '16 at 02:33
3

"In the end, if we give programmers the powerful tools they need to do their job, but without adequate training and awareness of the security environment, Bad Things will happen no matter which programming language is used." Bad Things can happen when any programming language is used. But, for reasons like those you superbly describe regarding C & C++, they also tend (*tend*) to happen more easily & frequently when some are used vs. others. – mostlyinformed Feb 24 '16 at 05:34
3

To make matters worse, the C Standards don't actually let programmers write code 'close to the metal'. If a compiler can determine that a certain combination of inputs would lead to situations where the Standard would impose no requirements, it may omit code which would otherwise have handled such inputs, even if there is almost no other plausible consequence for the Undefined Behavior that would be anywhere near as bad as the omission of code based upon inferences about possible inputs. – supercat Feb 24 '16 at 09:06
11

There is not such thing as “C/C++”. Most what you're talking about here is specific to C. – leftaroundabout Feb 24 '16 at 12:49
1

**All** of it is specific to C, and even then to specific implementations. There is no rule in C that says a compiler must even **accept** an attempt to access the 101th element of a 10 element array. It may abort the compilation if so-called Undefined Behavior is unavoidable. More realistically, it may simply assume the relevant code is unreachable from `main` and simply omit the whole offending function. – MSalters Feb 25 '16 at 12:56
4

@MSalters It is not specific to C, because it also applies to C++. I can't begin to comprehend how you would think C++ code can't have buffer overflows. Even `std::vector::operator[]` does no bounds checking. – user253751 Feb 25 '16 at 23:44
2

idiosyncratic - _adj._ Hack upon hack upon hack and a generous helping of legacy :D – Gusdor Feb 26 '16 at 09:33
@immibis: Unlike C arrays, std::vector always carries its own size around. It _can_ do bounds checking, and in fact with `.at(i)` it does. However, precisely because the size is so conveneiently available, bounds cheking usually is pointless. – MSalters Feb 26 '16 at 10:44
2

@MSalters "It *can* do bounds checking" - Yeah, and `operator[]` (the most natural way to index a vector) doesn't. – user253751 Feb 26 '16 at 11:08
@supercat, I've read your comment three times and I still don't get it. Mind explaining with less than ten separate clauses in one sentence? :D – Wildcard Feb 27 '16 at 00:10
2

@Wildcard: If certain inputs would cause a program to invoke Undefined Behavior, the Standard imposes no constraints with regard to what generated code might do if given such inputs. For example, given `int *p,*q`, if a program tests `if (p > q) ...` a compiler would be entitled to infer that the code will never receive input that would cause that test to be executed unless `p` and `q` are part of the same object. Even if the instructions a platform would use for normally pointer comparisons would define a globally consistent ranking for all pointers, there is no Standard-defined way... – supercat Feb 27 '16 at 00:21
1

...for a program to make use of that. During the 1990, many compilers would yield consistent and useful behaviors in many circumstances where the Standard imposed no requirements; even though the Standard never acknowledged such behaviors, programmers saw no need to have the Standard mandate that compilers must do the things they were already doing. Unfortunately, a bizarre form of historical revisionism has infested C compiler development, promoting the belief that behaviors which compilers didn't expressly document because they were so widely universal as to not be worth mentioning... – supercat Feb 27 '16 at 00:26
2

...were never really important. Proponents of that philosophy suggest that if there were a need for directives to allow pointers to alias things of different types there would be a demand for it, ignoring the fact that programmers were writing code that required aliasing and compilers were accepting such code and running it correctly. The idea that there's no demand for the things programmers and compilers routinely do/did is bizarre. – supercat Feb 27 '16 at 00:35

score 4 · Answer 5 · answered Feb 25 '16 at 23:53

I will probably touch on some things some of the other answers have already stated.. but.. I find the question itself to be erroneous and "vulnerable".

As asked, the question is assuming a lot without understanding the underlying issues. C/C++ are not "more vulnerable" than other languages. Rather, they place as much of the power of computing devices, and the responsibility of using that power, directly in the hands of the programmer. So, the reality of the situation is, many programmers write code that is vulnerable to exploitation, and since C/C++ do not go to great lengths to protect the programmer from themselves like some languages do, their code is more vulnerable. This is not a C/C++ issue, as programs written in assembly language would have the same problems, for example.

The reason why such low-level programming can be so vulnerable is because doing things like array/buffer bounds checking can become computationally expensive, and is very often unnecessary when programming defensively. Imagine, for example, that you are writing code for some major search engine, which has to process trillions of database records in the blink of an eye, so the end user won't get bored or frustrated while "Page loading..." is displayed. You do not want your code to keep checking array/buffer boundaries every single time through the loop; While it may take nanoseconds to do such a check, which is trivial if you're only processing ten records, it can add up to many seconds or minutes when you're looping through billions or trillions of records.

So instead, you "trust" that the data source (for example, the "web bot" that scans websites and puts the data in to the database) has already checked the data. This should not be an unreasonable assumption; For a typical program, you want to check the data upon input, so the code that processes the data can operate at maximum speed. Many code libraries also take this approach. Some even document that they expect the programmer to have already checked the data before calling the library functions to act upon the data.

Unfortunately, however, many programmers do not program defensively, and just assume that the data must be valid and within safe boundaries/parameters. And this is what gets exploited by attackers.

Some programming languages are designed such that they try to protect the programmer from such poor programming practices by automatically inserting additional checks in to the generated program, which the programmer did not explicitly write in to their code. Again, this is fine when you're only going to loop through the code a few hundred times or less. But when you're going through billions or trillions of iterations, it adds up to long delays in data processing, which may become unacceptable. So it's a trade-off when choosing which language to use for a particular piece of code, and how often and where you check for potentially dangerous/exploitable conditions within the data.

tl;dr: There is a tradeoff between potentially unnecessary safety checks and speed. — Wildcard, Feb 27 '16 at 00:15
"But when you're going through billions or trillions of iterations, it adds up to" - when you're iterating through an array it adds up to exactly *one* single check before the loop, because modern compilers are rather clever. The only time you pay for a bounds check is if the compiler can't figure out whether it's safe which generally means it's a random access. You pay about 1 cycle more in that case, which yes in some situations can add up (e.g. matrix operations), but for 99.9% of all code this is completely negligible. — Voo, Feb 27 '16 at 16:15
This isn't necessarily true. Yes, modern compilers are indeed very clever and can optimize a lot of code. But it's *still just a computer program*, not an intelligent being that can look at your code and know with complete certainty exactly what the programmer was intending to do. There are still cases where the compiler cannot make the "perfect choice" optimizations, and falls back to more safer ones, which may be too slow for certain purpose, programmers turn them off. People's tendency to rely on "smart compilers" to do their work for them is part of why this kind of problem persists. — C. M., Mar 02 '16 at 00:46
To add to this, there's several assumption that were made. First, "99.9%"--Where did this number come from? Sounds like "80% of statistics are made up on the spot" to me. Second, the data being processed is in a nice, neat array.. which isn't always true. Indeed, the concept of making the data conform to "safe" operations is part of the computationally expensive data manipulation that programmers try to avoid, and just assume that either the data is "safe", or that the compiler will "fix it so it is." And so on. — C. M., Mar 02 '16 at 00:50

Bing Bang · Answer 6 · 2016-02-23T17:43:25.080

2

Basically programmers are lazy people (including myself). They do things like using gets() instead of fgets() and defining i/o buffers on the stack and not looking out enough for ways memory could get overwritten unintentionally (well unintentionally for the programmer, intentionally for the hacker :).

edited Feb 23 '16 at 17:43

answered Feb 23 '16 at 17:37

Bing Bang

129
3

Having an I/O buffer on the stack isn't evil, just don't call `gets` with it. – user253751 Feb 24 '16 at 02:31
16

It's hard to imagine a programmer using C/C++ out of **lazyness**! – Dmitry Grigoryev Feb 24 '16 at 10:48
13

@DmitryGrigoryev they made me learn C/C++ at school and I'm too lazy to learn other languages :) – JOW Feb 24 '16 at 11:38
4

@JOW In my experience, coding a reasonably small frontend (a couple of buttons, HTTP requests, XML parsing, file IO) in C# with zero prior knowledge of that language was still faster than I expected coding the same app in C++/MFC to be. – Dmitry Grigoryev Feb 24 '16 at 11:45
@JOW You obviously don't do the kind of programming I get to do. I have about 10,000 lines of intense C code in production... – Bing Bang Feb 25 '16 at 17:07
3

@BingBang I felt uneasy just reading that. – Gusdor Feb 26 '16 at 09:36
1

@Gusdor No guts, no glory, man. – Bing Bang Feb 27 '16 at 10:35

score 2 · Answer 7 · answered Feb 26 '16 at 21:41

There is a large amount of existing C code that does unchecked writing to buffers. Some of this is in libraries. This code is exploitably unsafe if any external state can change the length written, and only very unsafe otherwise.

There is a larger amount of existing C code that does bounded writing to buffers. If the user of said code does a math error and lets more be written than it should, this is as exploitable as the above. There is no compile-time guarantee that the math is done right.

There is also a large amount of existing C code that does reads based off offsets in memory. If the offset is not checked as being valid, this can leak information.

C++ code is often used as a high level language for interop with C, so many C conceits are followed, and bugs from communicating with C APIs are common.

C++ programming styles that prevent such overruns exist, but it only takes 1 mistake to allow them to happen.

In addition, the problem of dangling pointers, where memory resources are recycled and the pointer now points at memory with a different lifetime/structure than it did originally, permits some kinds of exploits and information leaks.

These kind of errors -- "fencepost" errors, "dangling pointer" errors -- are so common, and so hard to eliminate completely, that many languages where developed with systems designed explicitly to prevent them from happening.

Not surprisingly, in languages designed to eliminate these errors, these errors do not occur nearly as often. They still sometimes occur: either the engine running the language has the problem, or a manual situation is set up that matches the environment of the C/C++ case (reusing objects in a pool, using common a large common buffer subdivided by consumer, etc). But because those uses are rarer, the problem happens less often.

Every dynamic allocation, every buffer use, in C/C++ runs these risks. And being perfect is not attainable.

score 0 · Answer 8 · answered Feb 26 '16 at 23:07

0

Most commonly used languages (Java and Ruby, for instance) compile to code that runs in a VM. The VM is designed to segregate machine code, data and usually stack. This means that regular language operations can't change the code or redirect the flow of control (sometimes there are special APIs that can do this, e.g for debugging).

C and C++ are usually compiled directly into the native machine language of the CPU - this gives performance and flexibility benefits, but means that erroneous code can overwrite program memory or stack and thus execute instructions not in the original program.

This typically occurs when a buffer is (maybe deliberately) overrun in C++. In Java or Ruby, by contrast, a buffer overrun will immediately cause an exception and can't (excepting VM bugs) overwrite code or change control flow.

answered Feb 26 '16 at 23:07

Rich

817
6
5

This has nothing to do with running on a VM or not. You could have a VM with the same behaviour as c, just as you can have programs that compile directly to machine code that are as safe as say java (e.g. ADA) – Voo Feb 29 '16 at 13:19
In theory. In almost all practical cases, Java runs on a VM that prevents code overwriting and C/C++ run on a bare machine that doesn't. – Rich Mar 02 '16 at 03:58
Yes. And ADA doesn't run on a VM and also prevents most of these exploits just as Java. Having a VM or not is completely irrelevant to this (what do you think is do special about a VM that it can't be done otherwise? Hell it actually only introduces a possible security vulnerability because the JIT needs writeable and executable memory!) – Voo Mar 02 '16 at 07:30

Why are programs written in C and C++ so frequently vulnerable to overflow attacks?

8 Answers8

Linked