4

I have been looking around the web for large data sets (specifically web related log files - MySQL, PHP, Apache, and so on) that contains data of attempted intrusions/exploits. I am doing some research on threat intelligence and I'd like to analyze log files to see what might be different (anomalies if you will) from the rest of the data.

The data sets I have found online aren't generally geared towards security threats. And by large, I mean anything that will be greater than 100MB.

Also, if the data is annotated - great. If it's not, oh well. Anything will help.

Any suggestions would be great!

user0000001
  • 141
  • 2
  • 3
    These log files can only be understood in context. I'm not sure a random dump of logs from an unknown implementation would be helpful. For instance, any connections to my webapp from Elbonia would be considered suspicious, but not so for an Elbonian webapp. You are also relying on the logging level and error handling to be consistent and correctly configured. Other than that, there are known anomalous patterns that would be common to most situations, and you don't need other's datasets for that. – schroeder Jul 22 '15 at 15:41
  • @schroeder Thank you for the response. I understand what you are saying but my research that I'm doing is not limited to what I am able to see. This will be in conjunction with software that has already been written that attempts to serialize the data and determine regular data from irregular data. This is why it would be great if the data was annotated. – user0000001 Jul 22 '15 at 15:49
  • 1
    Frankly, I'd host a small web app on a free web service and post the IP/URL in various spots (Pastebin, forums, etc.) and gather the logs from that. I've done the same thing for my honeypot projects. You would have full control over the context, and you can generate valid traffic and behaviour in a controlled manor. – schroeder Jul 22 '15 at 15:54
  • @schroeder Initially this is what I have done without much luck. The data I was receiving was small. A few weeks worth of data was maybe 100MB in size and I am looking for data in the 50-100GB. I do agree, a controlled environment is better. Perhaps I need to work on SEO with the honeypots I currently have. – user0000001 Jul 22 '15 at 16:02
  • I had a similar problem to start with, but my servers got on the "right" hacker lists at some point. – schroeder Jul 22 '15 at 16:04

0 Answers0