Assuming existence of sufficient number of benign inputs?

Question

I have come across multiple machine learning based security solutions that train their detectors/models using "benign" inputs. The assumption is that the operator has access to sufficiently exhaustive benign inputs (benign inputs that provide sufficient input and code coverage for typical usage).

Is that a realistic assumption in practice? Are there ways to automatically generate such benign inputs? Or are these solutions still in their academic infancy?

"academic infancy" is a linguistic co-location that I have not run across before — schroeder, Nov 28 '15 at 00:17
Can you provide more details to clarify your question? The tags seem to imply pen-testing tools, but the question itself seems more like you're referring to an adaptive firewall. I'll give you an answer based on what I *think* you're asking, but clarification would help. — Mike Ounsworth, Nov 28 '15 at 01:07
We are trying to evaluate some solutions that would be interfacing with the web. So our first thoughts were that what they are promising is unrealistic, since it's pretty hard to train on sufficient samples of benign behavior given the services we provide. Your answer, I believe, supports our thoughts. — MEE, Nov 30 '15 at 18:49

score 1 · Accepted Answer · answered Nov 28 '15 at 01:22

Is that a realistic assumption in practice? Are there ways to automatically generate such benign inputs? Or is that an academic infancy?

That highly depends on what kind of input data you're trying to simulate. So the short answer is: only someone who's familiar with your domain can decide that.

Here's what I mean: If the "benign inputs" you're trying to simulate is realistic user data from Google Location Services, or typical browsing behaviour on Amazon.com, then yes, the ability to simulate those inputs is "in its academic infancy".

On the other hand, if you're trying to pen test an application that accepts a standardized protocol - for example the Certificate Management Protocol (CMP) - which has a very small number of accepted message types (~30 for CMP), then no, it's actually quite easy to generate a complete and exhaustive set of example inputs.

So what are you trying to do? What type of input data are you trying to simulate? If you edit your question to provide more details, we can give you a better answer.

John Deters · Answer 2 · 2015-11-28T00:56:15.420

Is it realistic to assume that your clients will provide enough traffic to properly train your heuristic detector? That's implementation dependent.

Your business may be cyclic. You may have busy sales around a holiday period, and then inventory the following month. If you don't train the system with both sales and inventory data, it may falsely identify your inventory traffic as hostile. But since it's your business, you should know those cycles and account for them. No heuristic system can predict how your business works, or what kind of traffic your business would consider "normal".

So just as they can't recognize your traffic as normal, their systems are equally incapable of inherently generating the "benign" traffic. That's why you are asked to explicitly provide them with your examples.

Assuming existence of sufficient number of benign inputs?

2 Answers2