0

I am developing a software to monitor network traffic and I need a database server that would be very fast in storing and querying packet header fields. I know there is a .pcap file format, but it is not suitable for me since I am going to store about 10 terabytes of traffic per day. Does some sort of specialized database server for network traffic exists?

Nulik
  • 254
  • 1
  • 4
  • 13

2 Answers2

6

At that volume you are asking the wrong question. The question you should be asking is: what questions do I need to answer with the information I capture?

From that you can answer the question of storage engines. Do you really need every byte? Do you need it structured to answer ad-hoc questions, or to answer some very structured, specific questions?

Can you shard it across multiple machines, or are you confined to a single system?

Do you need to read and write simultaneously - which will more than double your IOPS - or are those done at separate times? Do you need real-time indexing, or can you build those separately? Do you need indexing at all? On what?

You are talking about over 100MB/s data store here, but is that reflective of the load? Do you have a bursty stream, or steady-state? Does it matter if you have latency between reception and storage? Do you have to commit in sequence, or can you have out-of-order visibility of data to the query side?

Anyway, to answer the specific question as well as possible, go look at the various NetFlow storage and analysis tools out there. Those are as close as you are likely to get to a generic answer to this question.

Daniel Pittman
  • 5,692
  • 1
  • 22
  • 20
  • Netflow is pretty much what I am doing myself. Now, any idea what is the database engine for example Netflow collector is using? – Nulik Feb 05 '12 at 01:16
  • Who cares about that given that it definitely is not written to capture 10tb trafficp er day. – TomTom Feb 05 '12 at 07:30
  • Ultimately, I can't, because I don't know the answers to any of the questions I posed above. Specifically, I don't know what questions you need to answer, so I have no idea what engines might suffice. You probably need to take your budget and ask IBM and Oracle if you want a RDBMS, or consider something like VoltDB. If you have more specialised questions, more specialised answers might apply. – Daniel Pittman Feb 05 '12 at 07:57
  • VoltDB seems to be exactly what I was looking for! Thanks! – Nulik Feb 05 '12 at 17:21
1

Given the volumes of data you're talking about capturing, the source of the data is almost entirely irrelevant.

First, you need to think about how you're going to stream 100MB/s of data across the network to the collection point (or, even better, points, because a distributed system is probably going to be required to handle the load).

Then you need to think about how you're going to architect your database to be handling that many incoming records. How are you going to spread the load across multiple disks? How are you going to avoid contention if multiple servers are trying to commit data at once? How much redundancy do you need in order to account for disks failing while you're writing to them, and how will you make sure your system can recover from such a failure without dropping any of the data coming in?

Then you need to think about how you're going to query the data. Running a query on the same database that's busy trying to append 100MB/s of data to its tables is probably going to cause contention issues. Are you going to do batch processing the next day? If you need realtime analysis, how are you going to handle the extra load it causes without interrupting the writes that are still coming in?

You don't need a "Specialized database server for network traffic", you need a specialised high-write-volume database system. Once you've got those challenges sorted out, figuring out the exact schema that's needed to store the data you want will almost be an afterthought

James Polley
  • 2,089
  • 15
  • 13