Of the folks managing their own clusters (i.e. not using/paying for Amazon Autoscale, Rightscale, Scalr, etc.), how are you managing your instances on EC2 and handling (e.g.) failover? I'm wondering if most folks just end up writing their own boatloads of scripts against the EC2 API, as I suspect.
That's certainly our approach: whip up our own Python Boto-based monitoring/restarting daemon that runs off-site, listening for UDP keep-alives from our instances. On failure, we snapshot volumes, register images, start new instances, delete old volumes, and so on.
Every so often, when hacking on our scripts, I think there must be some open-source tools out there that deal with these issues already, and which don't have the constraints of (say) Scalr, but I always come back from Google empty-handed. (Things like Scalr have are pretty limited in the supported set/versions/configurations of software, and have specialized and IMO cumbersome ways of manipulating these setups.)
Also, the Linux-HA/Pacemaker ecosystem (Heartbeat, ldirectord, etc.) sounds like it isn't really suited for EC2. (But then I found this - though I'm not sure this is really a high-quality solution).