How to source training data in ML for information security?

Question

A company entrusts a Data Scientist with the mission of processing and valuing data for the research or treatment of events related to traces of computer attacks. I was wondering how would he get the train data.

I guess he would need to exploit the logs of the different devices of the clients and use statistical, Machine Learning and visualization techniques in order to bring a better understanding of the attacks in progress and to identify the weak signals of attacks... But how would he get labelled data?

He might get the logs of attacks received before, but that might not have the same signature with the attacks that are going to come later? So it might be difficult to create a reliable product?

I think this question is way too broad. First, there are ML techniques which don't need labeled data in the first place, like anomaly detection, clustering etc. What techniques and features can be used depends a lot of what is available, i.e. kind of data, amount of data, source of data, depth of data, potentially associated information like IOC, ... *"So it might be difficult to create a reliable product?"* - yes, it is difficult. And the product often is not as reliable as one might wish (i.e. high false positive and negative rates) which does not mean that it cannot be useful anyway. — Steffen Ullrich, Apr 17 '21 at 15:02

score 2 · Answer 1 · answered Apr 17 '21 at 15:51

The domain problems in cybersecurity are too-narrow to willy nilly try to apply AIML or DL to.

Not saying to throw data science out the window. Saying you need way more domain expertise to make it go.

One excellent application of data science to the cybersecurity field is to understand the indicators, or IOCs, and how they are sighted (last seen, etc) in memory, on disks, and in network traffic (and positioned where in the network traffic relative to the sources, destinations, and passthroughs -- just like any dataflow). Instead of levering ML or DL, I would instead suggest to focus first on graph algorithms. Understand the relationships of these indicators and their interpretation as time series data

A File "Hash" (e.g., a SHA256 checksum of a file or a section of memory or network traffic of a process). Here is an example of the domain problems associated with cybersecurity. The way the our signatures (e.g., Yara rules) work on-disk vs. in-memory vs. in network traffic are obviously different code and data, and code and data paths. Parameters or arguments to processes also matter especially for script code
an IPv4, IPv6 address or path, and its associated network attributes such as BGP-4 ASN if registered with a Regional Internet Registry (RIR). There is often a history associated with these objects and narrowing into them may require understanding complex RWhois and SWIP registration processes
an FQDN, or Fully-Qualified Domain Name -- sometimes hostname and with Windows Server Forest/Domain bits, perhaps older namestays such as NetBIOS or MS-RPC Named Pipes. Cloud stacks such as Azure AD are changing this nomenclature as well, moving to tenants, subscriptions, resources, et al. A set of Whois records identifying each unique Internet Domain Name can come with its own set of relationships including a rich history of timestamps, owners, name servers, and email addresses
A credential, often an email address, e.g., bertrand.russell@math.onmicrosoft.com but also a cred user/pass pair, i.e., bertrand:MathIsK00lB00ksRul3 if known (often if compromised)

As a hint of what would be possible, check out the work here -- https://threathunterplaybook.com/introduction.html -- which pivots nicely off of the fields (and parsing languages) from Azure Sentinel and M365/Azure data models

score 1 · Answer 2 · answered Apr 17 '21 at 15:03

You can't.

The attack vector is specific to the combination of hardware and software for the specific company. There are no logging standards. Even logs for the same software can be readily customized (firewall, server events, etc.)

ML isn't going to work very well for unknown attacks. Of course, if you have a ML of known, good data, then maybe the ML is going to be able to detect bad actors based on rogue data stream (e.g.: gigs of data sent out when nobody is logged in). Of course, nobody will ever offer labelled data for private corporate specific logs so you'll have to create your own over time.

How to source training data in ML for information security?

2 Answers2