How should I compile program for fuzz testing?

Question

I'm doing some fuzz testing of a program for which I do have source code. Should I compile the program with any particular compiler options, to make fuzz testing more effective? If so, what's the right list of command-line flags to pass to the compiler, to make fuzz testing as effective as possible?

Should I try to explicitly enable stack cookies (gcc -fstack-protector)? Length checking for stack-allocated buffers (gcc -D_FORTIFY_SOURCE=2)? Malloc/free checking (env MALLOC_CHECK_=2)? Glib's debugging allocator (G_SLICE=debug-blocks)?

Would you recommend that I use Mudflap (e.g., gcc -fmudflap -fmudflapir + MUDFLAP_OPTIONS='-mode-check -viol-abort -check-initializations -ignore-reads')?

I'm especially interested in answers for gcc on Linux, but feel free to answer for other platforms as well.

Wouldn't you want to compile it the same way you would for release, or is this not a program which will be released? — this.josh, Sep 09 '11 at 08:01
@this.josh, I don't think it's obvious. See my comment to Steve. You could imagine that if you compile with extra debug-checking flags, this might help the fuzzer find more bugs. Debug-checking flags might enable extra assertions that will crash the program if something goes wrong (for bugs that don't immediately crash the release version). i.e., there's a possibility it might work as follows: compile with release flags => find N bugs. compile with extra debugging flags => find those N bugs + M more? I'm asking whether this is indeed the case, and if so, which settings are good ones to use. — D.W., Sep 09 '11 at 16:53
Yes, I get your point about trying to find more bugs, but isn't it also possible to find less bugs? How will you know if the settings you choose produce a bug set that is a superset (more bugs), subset (less bugs), or independent (indeterminate difference)? If your experiment turns up some results please let us know. — this.josh, Sep 10 '11 at 00:15
@this.josh, yup, agreed. It's certainly possible that you might find fewer bugs this way due to some sort of non-determinism that gets tweaked by the debugging; it may not be very likely, but it could happen. It is also very possible that some of these options might make the program run slower, so you get fewer iterations of fuzz testing for a fixed time budget and thus find fewer bugs for that reason. Like I said, I don't think it's obvious what the right answer is. That's why I asked. :-) Sounds like I'll need to run the experiment to find out. — D.W., Sep 10 '11 at 19:46
Thanks, @Rook! See my answer, where I tried a very small-scale experiment with various compiler options, including mudflap. Mudflap didn't seem to help increase the number of bugs found by fuzzing, in the very limited experiment that I did. However the sample size in my experiment was too small. Perhaps an experiment with a more realistic sample size would find an effect. — D.W., Nov 24 '12 at 21:00

score 5 · Answer 1 · edited Jun 16 '20 at 09:49

No one suggested a definitive answer, so I ran a tiny experiment. Based up on this experiment, here is my recommendation so far:

Recommendation. When fuzzing, you might consider setting the environment variables LIBC_FATAL_STDERR_=1 MALLOC_CHECK_=3. This setting had no measurable performance impact in my experiment, and based upon my results, this setting might slightly increase the number of bugs you detect.

None of the other settings made any detectable difference in my experiment.

Optional. If you want, you can compile with -fstack-protector or -fstack-protector-all, with -O2 -D_FORTIFY_SOURCE=2, and/or with mudflap; and you can run with the environment variable G_SLICE=debug-blocks. None of them had any measurable performance impact in my experiment. However, none of them had any impact on the set of bugs found. So while there was no cost in my experiment, there was also no benefit.

Experimental methodology and details. In each run, I fuzzed ffmpeg with zzuf, using one seed file, for 5000 iterations. There was one run per setting of compiler flags/environment variables. I ensured that fuzzing would generate exactly the same set of variant-files in each run, so the only difference was the compiler flags/environment variables. To measure the performance impact, I measured the CPU+system time to complete fuzzing. To measure the impact on ability to detect bugs, I recorded which variant-files triggered a detectable crash.

I measured performance, but none of the options had any detectable effect on performance (the differences were < 1% in all cases, and probably due to random noise).

For bug detection power, I MALLOC_CHECK_=3 gave a slight advantage, but none of the other flags or settings made any difference in bug detection power:

MALLOC_CHECK_=3 did have an influence on which variant-files caused a crash. With no flags, 22 of the 5000 iterations caused a crash. Another 2 iterations caused a warning message (*** glibc detected *** ...) that, if you knew to look for it, could be used to detect a bug, so if you were smart enough to grep your fuzzing logs for that message, 24 of the 5000 iterations would provide signs of a bug -- whereas if you don't know to grep the logs for that particular warning message, then only 22 of the 5000 iterations provided indications of a bug. In contrast, when I enabled MALLOC_CHECK_=3, 25 of the 5000 iterations caused a crash, and there was no need to grep the logs. Thus, MALLOC_CHECK_=3 both is slightly more effective at uncovering signs of a bug, and also reduces the need to postprocess your fuzzing logs specially.

Interestingly, there was one variant-file that crashed the program with no settings but did not crash the program with MALLOC_CHECK_=3, confirming @this.josh's hypothesis that additional checking in some cases might cause us to miss some bugs -- but at the same time, there were 2 variant-files that didn't crash the program with no settings, but that did crash the program with MALLOC_CHECK_=3. Thus, the benefits of MALLOC_CHECK_=3 outweighed its costs.
Apart from MALLOC_CHECK_, none of the other settings had any influence whatsoever on which variant-files triggered a detectable crash. The set of variant-files that caused the baseline program (no special flags) to crash was exactly as the set of variant-files that caused the program to crash when compiled with special flags. Therefore, at least in this experiment, those other settings didn't cost us anything (in performance) -- but also didn't gain us anything (in bug detection power).

My experiment is far from authoritative. To do this right, one should really try it out with many different programs (not just one), and multiple different seed files (not just one). So I'd caution you against drawing too many conclusions from this one small experiment. But I thought the results were interesting nonetheless.

score 4 · Answer 2 · answered Sep 09 '11 at 01:59

4

What an interesting question. I gather it's unix only, but if it's not, the first thing I'd recommend is testing it in both environments. Windows has a really good set of tools for doing heap debugging, dangerous api management, and so on - google App Verifier if you're interested.

But in general, I'd say you should at the least try a release build, with the same optimizer settings you're planning to use in production. You want to have a configuration in the test where you do not get saved by the enhanced debug checks like app verifier or the malloc_check stuff. After all, the attacker will be targeting your optimized, release-build code.

You might even try some other linux compilers, and fuzz against those results.

Beyond that, I'm as interested in hearing answers to this as you are. :-)

answered Sep 09 '11 at 01:59

Steve Dispensa

3,441
16
20

Thanks, @Steve. Platforms: I'm most interested in Unix (specifically, Linux), but I'd be glad to hear answers for other systems as well. Debug checks: I certainly understand the motivation for testing the release build; test what you ship, and all that. On the other hand... if the debug check will crash the program if it detects a problem (rather than silently fixing it and continuing), then this might make fuzz testing more effective at finding bugs it'd otherwise miss. So I could imagine arguments on both sides here. – D.W. Sep 09 '11 at 02:56
Yeah, I agree. It'd be interesting to cross-check your results - fuzz the release build and then check the bugs it finds against a debug build, etc. – Steve Dispensa Sep 09 '11 at 03:40
1

`You want to have a configuration in the test where you do not get saved by the enhanced debug checks like app verifier or the malloc_check stuff` - but since every problem one of those two finds is (or should) be a bug that has to be fixed anyhow, wouldn't this only increase the chances to crash? You may crash at a different position (ie most likely earlier), but you'd still find the bug. The only problem I'd see is with undefined behavior that results in different code, but other than that I usually use the debug builds. – Voo Sep 10 '11 at 20:38
1

My point is that you don't always run with those enabled, so it does you no good to be saved by them. Yes, they're bugs, but fixing those bugs could mask bugs further downstream. You need to fix the downstream bugs too, in case there are other ways of triggering them. So, loose settings first, then tight settings. – Steve Dispensa Sep 11 '11 at 17:08

rook · Accepted Answer · 2012-11-24T18:56:28.003

If an application is going to crash from a test, a compiler option isn't going to save it. If your compile the application with gcc -D_FORTIFY_SOURCE=2 -O2 then the process will killed off more often due to minor memory infractions that wouldn't normally crash the process. A good example of this is that you will have an increased ability to detect adjacent heap overflows.

GCC Mudflap is a more advanced debugging tool to look for Dangling Pointers, Double Frees, Buffer Overflows and potentially other memory corruption vulnerabilities. This GCC compiler option makes the application very slow so its not meant for production. However it logs all memory violations, even violations that don't cause the application to crash. It gives you verbose information to help narrow down what type of memory violation you are working with.

however its a chicken or the egg. After you find one of these flaws you'll then have to defeat this secuirty system which is no small task (or someone who isn't using this security measure). You won't be able to find a simple "Smashing the stack for fun and profit" style attacks. Modern techniques include chaining of memory corruption vulnerabilities and the use of ROP Chains to defeat ASLR.

How should I compile program for fuzz testing?

3 Answers3

Linked