I'd like to setup statsd/graphite so that I can log JS apps running on HTML devices (ie. not in a contained LAN environment, and possibly with a large volume of incoming data that I don't directly control).
My constraints:
- entry point must speak HTTP: this is solved by a simple HTTP-to-UDP-statsd proxy (eg. httpstatsd on github)
- must resist the failure of a single server (to battle Murphy's law :)
- must be horizontally scalable: webscale, baby! :)
- architecture should be kept as simple (and cheap) as possible
- my servers are virtual machines
- the data files will be stored on a filer appliance (with NFS)
- I have tcp/udp hardware load balancers at disposal
In short, the data path: [client] -(http)-> [http2statsd] -(udp)-> [statsd] -(tcp)-> [graphite] -(nfs)-> [filer]
My findings so far:
- scaling the http2statsd part is easy (stateless daemons)
- scaling the statsd part doesn't seem straightforward (I guess I'd end up with incoherent values in graphite for aggregate data like sum, avg, min, max ...). Unless the HTTP daemon does consistent hashing in order to shard the keys. Maybe an idea ... (but then there's the HA question)
- scaling the graphite part can be done through sharding (using carbon-relay) (but that doesn't solve the HA question either). Obviously several whisper instances shouldn't write the same NFS file.
- scaling the filer part isn't part of the question (but the less IO, the better :)
- scaling the webapp seems obvious (although I haven't tested) as they only read the shared NFS data
So I was wondering if anyone had experiences and best practices to share for a solid statsd/graphite deployment ?