I'm getting ready to setup a server that will be responsible for tracking statistical data from a high volume traffic source. It will be handling requests at about 6-7mil/hour on average, all of which are small GETs. All I need is a simple server setup that can process the parameters of the get request and write them to a CSV file.

My first thought was to use lighttpd+fastcgi+php, as that's a configuration I'm already familiar with. But, given that I don't get to make these kinds of performance decisions everyday, I'd like to explore some other options and see if there might be something even better for this purpose.

  • 171
  • 1
  • 1
  • 6

7 Answers7


You want to do 6-7 million write operations to a CSV file per hour?

Seriously, a database is a better idea. A database is designed to handle concurrent writes, and can be scaled vertically (bigger machine, faster disks) or horizontally (load spread over multiple servers). Writing to a single CSV file (or any file) requires some form of locking to handle concurrency issues, and scales poorly as IO load and concurrency increases.

To work around that you'll probably end up implementing your own caching and buffering layers, then start splitting the load between multiple files, etc, etc. Use some type of database from the outset and save yourself a lot of headaches.

John Dalton
  • 931
  • 9
  • 7

Given that you're going to do about 2000 Requests/sec or 500µs/request on AVERAGE (meaning much higher peaks), CSVs are probably a no-go due to clobbered entries on concurrent writes, since nothing guarantees atomic writes in your files.

One idea would be per-process/per-writer files which are collected later, another idea would be using a database heavily tuned for high amounts of writes. You could also have a look at Message Queues or Group Communication Protocols (e.g. Spread), but I don't know if they're up for that amount of volume.

Whatever you do, throw some quick ideas up and benchmark them. Current hardware can do wonders about performance, only optimize when needed. As for PHP - be sure to have an Opcode Cache installed (e.g. APC), otherwise you'll be burning many cycles in unnecessary recompilation of the scripts.

Also keep in mind how the growth of the service looks like, it makes hardly any sense to aim for a solution which is going to be overwhelmed in a few months.

Michael Renner
  • 1,750
  • 13
  • 17

What sort of parameters are passed through GET request ? Does it need to be in CSV/Database realtime ? or do you think you could create a dummy HTML file (or PHP) and just use the web logs to get parsed and dumped into a CSV later as a batch job ? (okay .. this sounds convoluted .. but easy to handle) ..

Ram Prasad
  • 301
  • 1
  • 8
  • An example might be like: http://url/track?p=1&a=12&u=en&e=11&r=3433&d=3433 I thought about using web logs and just parsing them to get the data but thought it would be easier to just process it myself as it came in. I'm doubting that now, though. Also, the data doesn't need to be readable while it's being stored. The data will eventually be moved in bulk but 24+ hours after it's collected. – Tom Jun 04 '09 at 04:50
  • This is a good suggestion if you don't need to process the data immediately, and if a static response to the client will be adequate. You can batch process server logs and write that data into a database for later analysis if you need to. – John Dalton Jun 04 '09 at 05:58
  • Server logs need to get written too, and are often synchronized before serving each request for security reasons. That's not necessarily cheap. There's a reason why a major performance practice for e.g. pure filler image servers is to just turn off logging. If you need to do something fast, at least thinking about it and profiling the outcomes yourself is often a safer bet than just using what's there and assuming that built-in functionality is always efficient for your needs. That's not to say that you shouldn't try it first, and if it's fast enough, meh, just use it. ;) – Bernd Haug Jun 04 '09 at 10:26
  • Tom: Then, you could(assuming you are on a *nix system): 1. Remove all unnecessary modules of apache (or lighttpd) 2. Only log the URI (saves space and resources) 3. Use cronolog or logrotate to rotate log every hour or so (so you dont have to wait for end of day to process it) 4. use another box to have these logs transferred, and use bash/awk/sed/perl/regex to parse the logs and get the data into database for further analysis – Ram Prasad Jun 05 '09 at 05:38
  • This is very light weight, and you server could handle more requests .. – Ram Prasad Jun 05 '09 at 05:42

Perhaps this is outside of your control, but is a web server really the right tool for this?

  • 2,100
  • 1
  • 16
  • 22
  • I'm getting the traffic from a script tag that's sourced to my server's URL. Is there something else you would suggest for this? – Tom Jun 04 '09 at 03:05
  • There are systems available for collecting high frequency data which can do this sort of thing easily for supported data sources. I'm not aware of an easy way to collect data from a client side script, though. So the suggestions to use a standard DB are probably the most straightforward. – dmo Jun 04 '09 at 04:15

I'd take a look at server 2008 web edition and use ADO.net to write to the CSV file. You shouldn't have a thruput problem since ado.net will buffer the writes.

Jim B
  • 23,938
  • 4
  • 35
  • 58

I don't see how to (even semi-)reliably do this with a single (more-or-less inexpensive) server. If all you ever do is parse get parameters, your best bet may be to get a high-performance lightweight open source HTTP server like gatling and hack it to record the request to a fast queue like rabbit.

Then you can have a writer that reads from that queue and writes to the file in a tight loop sequentially.

This way you can make sure that writes are atomic while being able to scale the presumably expensive parts (parsing and queueing) horizontally.

This will certainly be slower in "CPU cycles per request" than having one server just write to a file, but it will stay reliable when the traffic would overwhelm one machine, and you won't even lose data if your final sequential writer gets swamped for a while.

Nota bene: (a) what's intuitively expensive need not be so, code exploratively and profile. (b) are you sure you don't want to ask the fine programming specialists at stackoverflow? We mostly do systems here.

Bernd Haug
  • 878
  • 5
  • 11

For the web part I would use Nginx (lighttpd is getting older ;)

For the datas :

The best way for this kind of job is looking to something like MapReduce. Hadoop is a free implementation of MapReduce.

Just store statistics to simple file and batch them in a key/value system like HBase (part of Hadoop).

Then you have a fully redundant (thanks to HDFS) and scalable solution that can handle petabytes of datas.