Following up on the previous post of our Monitoring Serie, we want to address the issue of detection and alerting.
Many (any) things can go wrong, from the the application level down to the system and even hardware level. While everything is worthy of consideration (to some degree), not all events and issues must be treated as equal.
We obviously are opinionated in this post, the issue at stake is the detection of problems in our systems. Monitoring solution should offer much more than just error detection. It can eventually be used to gather any metrics that would be valuable from a business stand-point. Such business logic is not addressed here and while it could be done in-house, it is worth considering third party providers that are experts in the area such as MixPanel or KissMetrics.
Let’s cut to the chase and talk about error detection and alerting logic.
Classifying errors is hard but essential, one way to do that is to separate them to soft and hard errors. Hard errors may need to trigger a notification on the spot while soft errors may require some buffering and multiple checks first.
It is to be noted that the classification of the error is in no way related to the criticality of that error. It is to the threshold component to define whether this is meaningful and should be reported.
Many projects and products are specializing in capturing data:
There is no best solution from what we’ve experienced, each have pros and cons but it is worth considering the following when choosing your monitoring system:
Some of the monitoring products are all-in-one solutions and on top of the data gathering provide also proper management interfaces, alert / notifications, escalation, but their tuning can become quite difficult in-between non-expert hands.
Based on the captured metrics, one needs to be able to set various thresholds, types of alerts and priorities, sometimes even cross-metrics thresholds are needed for a state of the art monitoring.
Some of the projects mentioned above provide “easy” to set-up triggers, some are more complex, some simply do no support alerting and thresholds and you need to integrate with other tools like cabot, graphite, etc.
As for the notification part, several blog posts already provide a nice landscape of best practices (I can’t find them - sorry - but happy to add them if a fellow reader suggests a few links).
What to do and when to do it is at stake here; you do not want your monitoring system to call relentlessly a duty engineer at 3am when an issue is not critical and can be fixed the next morning. The response must be proportionate to the criticality of the alerts.
Communication channels can range from a simple email, to an IM notification, to SMS / phone calls, to … you name it (a boat horn in the office?).
Escalation is also an important feature, as is acknolegement of issues. PagerDuty previously mentioned is good at defining such workflow. You may not want to keep on receiving a twilio SMS every 2min when you are already working on the issue.
To summarize, a good monitoring system should be able to:
See you soon for the next post!