How could I train a ML system to identify vulnerabilities in code?

Question

Apologies if this is not as much of a "question" as it is a discussion. I've been thinking about this for a while.

How could I train a machine learning system to identify (new) vulnerabilities in open source code bases? Or even closed binaries? Is it possible?

Here's my proposed solution... I'm curious if anyone is familiar with work along these lines or if you have any thoughts on its feasibility.

Requirements:

A CVE database with the following properties:

1) The source diff of the patch applied to fix the vulnerability, i.e. the "before/after" of the critical section of code

2) The bindiff of the binary before and after patching the vulnerability

Goal

Use code from previously identified vulnerabilities to train the ML system to recognize "vulnerable" code, and then apply it to critical sections of code in open source projects.

It would look something like this...

DATA COLLECTION PHASE

1) Collect the before/after code of all previous vulnerabilities

2) Use the before/after code to identify the "critical section" that caused the vulnerability

3) Convert the "critical section" to its AST representation

TRAINING PHASE:

1) Determine the best ML algorithms to use for comparing AST representations

2) Using labeled inputs of "vulnerable" and "safe" AST representations, train the ML system to recognize a "vulnerable" AST

IDENTIFICATION OF NEW VULNERABILITIES PHASE:

1) Download open source code bases

2) Somehow prioritize which code to convert to AST

3) Convert code to AST and feed to ML system to determine likelihood of "vulnerability"

4) Apply some combination of static and manual analysis to verify the vulnerability

5) Use results as further feedback to train the ML system

So again, I realize this is not strictly a "question" but I hope it can foster some interesting discussion. It's an idea I've been playing with in my head for a while, but most of it falls way outside my expertise.

There are definitely a lot of challenges with it, mainly false positives (e.g. maybe a double nested for-loop with a dozen conditionals looks like a vulnerable AST, but it's in a non-critical portion of code). But I think the central idea of training ML algorithms based on existing vulnerabilities would lead to a very efficient way of finding NEW vulnerabilities. At the very least, it could provide an efficiency boost to tools like fuzzers, by directing them to the critical portions of code. Also, it does not necessarily have to work on only open source code. It could also disassemble the vulnerable binary and the patched binary, and compare their ASM instructions. In fact this might even lead to higher signal than the AST method.

I recommend to have a look at the [various publications of Fabian Yamaguchi](https://www.tu-braunschweig.de/sec/team/fabs/publications) which deal with this kind of questions, i.e. finding potential vulnerabilities with various kinds of static program analysis. — Steffen Ullrich, Apr 26 '16 at 18:34
Wired just did a nice article on an MIT study to use AI/ML to detect possible attacks. Not exactly your question, but in the same vein of mixing security and AI/ML. http://www.wired.com/2016/04/mits-teaching-ai-help-analysts-stop-cyberattacks/ — beatsbears, Apr 26 '16 at 20:14
This is quite broad, and as such it is not a good fit for this site. You might try poking around in more informal sections of the site, such as the chat rooms for CS.SE and cstheory.SE. — Ohnana, Apr 27 '16 at 11:42

How could I train a ML system to identify vulnerabilities in code?

0 Answers0