This is a kind-of-recurring question, yet the closest one I could find was asked 7 years ago, which was pretty much a different time.
I run a small business and we host multiple small-to-medium client websites (nothing that ever required more than a few 1G Digital Ocean droplet). The current solution (ad-hoc scripts and emails) starts to show its limits, especially with the current quick business growth.
Business problem
Thus, I need to build a new solution. Maybe not all at once, but I certainly don't want to re-do everything all over again. Requirements I could think of:
- Simple. Simple. Simple. I don't have staff, I don't have time, I don't digest bullshit well. I'm ready to allocate the resources it needs but not more.
- No SaaS. For the past years I've been using a lot of SaaS and they all eventually get more expensive, discontinue their service or get bought then disappear altogether. SaaS is a risk I don't want to take anymore.
- Ultimately, I only care about simple things:
- Is my site responding without errors and fast enough?
- Is my site getting overloaded?
- Is any of my disk getting full?
- There is an automated deployment system based on Ansible, it should be able to take care of configuring the monitoring/alerting for each site
- I want the person in charge to be woken up at 4am by all possible means if and only if useful
- All incidents/issues should be tracked somewhere and easy to move around (something like JIRA boards)
- All data should be stored somewhere for me to check later, including HTTP logs onto which I want to be able to make stats like finding the slow or error-prone pages.
- I have dozens of (Debian) servers and need to centralize all information about them
Research I have done
In order to do that, I started to dig the internet and found basically zillion of stuff you can plug to each other in circle if you'd like.
- ELK stack (and "Beats"). Seems perfect to collect & store logs/metrics. You can have pretty dashboards and look at your data, but that's about all you can do.
- X-Pack. Seems to be the perfect thing to go with ELK but looks like it's coming with a thick sugar coat around a nice bullshit cake. Plus the "subscription" model that doesn't announce the price probably means it's overpriced.
- Shinken/Nagios/Zabbix are the original contenders but are boring and complex and would require custom code and band-aid all over to work with ELK.
- Riemann looks like an excellent framework to trigger alerts but not to manage them afterwards. Plus you have to write everything yourself. And I'm not sure where to plug it in (I wouldn't want to have several probe measuring the same thing). Probably too complex for me.
- ElastAlert might be a good idea but doesn't seem to come with an actual way of managing alerts
- bosun looks like it is a bit more mature and complete than ElastAlert but with the same drawbacks and possibly more complex config
- openduty rings interesting but is apparently too immature to be considered viable
- cabot makes nice promises, is made and used by a company that allocates people to write documentation for so it's probably not going to die (though it's a bit faint)
- And of course, there is Prometheus, Graphana, Graylog, Fluentd and probably countless others.
Steps taken so far to solve it
My current understanding of the situation is that I need 2 tools (well, stacks):
- One that collects, stores and allows queries on logs and metrics. That's what is going to get me business stats, post-mortem analysis, debuging insights and so on. It seems that the perfect runner for that is ELK.
- One that analyses constantly the data in order to find irregularities and launch alerts. Now that's much less clear. I'd go for Cabot which seems simple and extensible.
Actual question
Do my requirements make sense? If so, am I right to seek those two tools (one for logs storage/access and one for alerting management)? If so, are my choices up for the task or do you recommend something else?
Not the question
I'm not asking out of thin air which is the best monitoring solution, I'm simply stating my problem and my solution and want a confirmation of it or pointers at where it fails.
Thanks everyone!