4

I'm wondering what software the web scale guys are using to monitor their n arrays of servers in the server farm(s).

What does facebook, twitter, digg use? How google does it?

I'm looking for a solution to our own monitoring requirements. Our servers sit in the cloud, AppEngine & EC2. We are looking to monitor the "application" (which is build from many small services) meaning that the end result should be a system that can monitor both response time (+alivenss and co.) and application validness: If I do X then Y should happen, then after 2 hours verify the Z was processed and T was appended to the correct log...

The ideal solution would be a system that I can deploy unit tests to, the same unit tests I'm using to test the software while developing.

Recommendations, pointers, comments are highly welcome - I'm looking for directions to attack this issue.

Thanks, Maxim.

Maxim Veksler
  • 2,555
  • 10
  • 27
  • 32
  • They do it two ways. They monitor from inside, some of which will be server performance monitoring (RAM, CPU, IO) and some will be instrumentation of the application(s), showing internal bottlenecks. They will also monitor from outside, which sounds like what you're asking more about. From the outside can be pretty easy - use something like HPOV/HP BAC, or an external hosted service, to do the desired (synthetic) transactions and record the times. Using an internal tool, you could also generate load, either against prod or test targets. – mfinni Sep 21 '10 at 17:37

1 Answers1

10

I watched this a while ago. It's 'A day in the life of Facebook operations'. They use cfengine2 (deployment), nagios (monitoring), ganglia (monitoring and trending) plus a lot of in-house tools. Funny to see some of the tools we use are used in such a massive scale (+60.000 servers)

natxo asenjo
  • 5,641
  • 2
  • 25
  • 27