I think it’s safe to assume we all agree that you can’t run a reliable production system without a decent alerting layer. For me, it’s one of the main reasons I can sleep calmly at night — I know that if anything goes wrong, our on-call engineer will be alerted, and the system won’t be in an unusable state until someone magically notices it in the morning.
That being said, why am I certain our alerting stack works? In an ever-changing dynamic world, where our system is constantly evolving and being upgraded, and our infrastructure is being migrated, changed, and updated under our feet, how can I be sure our alerts still work?
This is the story of how some of our alerts stopped working, the production mess it caused, the concepts we’ve used to regain confidence in our alerting stack, and the lessons learned on making reliable, forward-compatible alerts.
The Forever Changing Environment Breaking Our Alerts
The system at hand is a very simple Kafka processing pipeline that can be seen below.
We’re consuming messages from several topics, transforming them to our canonical data type, and then writing them to our system’s internal data store.
Our story starts with one of our users who complained about a partial data loss — they were seeing only partial data in our systems. After a short investigation and escalation, our on-call engineer noticed that our Kafka DLQ has millions of messages in it (Dead Letter Queue — i.e. the topic we’re sending the messages our main system failed to process).
It took us a few minutes to figure out that one of our producers introduced a breaking change into the schema we were expecting, making our consumers incapable of parsing messages — which in turn sent the messages to the DLQ.
To paraphrase the common saying — it’s Not always DNS, sometimes it’s schema evolution.
But wait, the schema change was made 10 days ago, the DLQ is nonempty for 10 days, we had a production issue raging for 10 days!!!
WHAT HAPPENED TO OUR ALERTS?????? WHY DIDN’T IT TRIGGER?!?!?
It took us another few minutes to understand that our alert was stale, a Kafka cluster migration happened lately, and changed the cluster name, which in turn changed the cluster label of our metrics in Prometheus — the alert was querying the wrong metric labels.
This is obviously easy to fix by setting the correct cluster name, but this got me thinking — what about all of our other alerts? Do they still work, or did they decay too?
Since we are fortunate enough to have very stable systems, those alerts (if they still work) should be triggered very scarcely, so we have no proof of their health.
My First Production Drill — Gaining confidence back in our alert suite
After fixing that one alert, we scratched our heads as to how to regain our confidence in our alert suite — we no longer trusted it to be reliable.
The first solution we came out with was obviously to review the alerts (we manage them as code in a Github repo), but this obviously is very manual, error-prone, and the gained confidence is limited.
The second suggested solution was for the team to arrange a production drill — to trigger all of our alerts to assert, in production, that they indeed work!
Since we wanted to assert the alerts, we can’t change the alert queries, and we didn’t want to cause actual prod outages (surprising, right?), the chosen course of action was to set the alert thresholds to the values reported by our systems in normal operation, i.e. to cause the alert to trigger under regular load.
For example, let’s take the DLQ alert previously mentioned — we just set the threshold to -1. Even an empty queue has more than (-1) messages.
We did the same for throughput (larger than 0), CPU/MEM utilisations (larger than 0), error rate (larger than -1), etc.
We’ve created a pull request for each service, and waited for the pager duty notifications :)
(If you ever try this at home — don’t forget to change the escalation policies not to trigger your backups, to stay on good terms with them ;) )
This process took us about an hour to do, we found additional 4 broken alerts and another 4 that were no longer relevant. One hour later we regained our confidence in our alert suite.
Lessons Learned — Alert Overfitting
The root cause for our alert decay was “overfitting” of the alert — i.e. querying too specific label, such that any change (and our system is constantly changing) would break the alert.
Let’s review 3 examples of such overfitting:
Example 1: Fitting to a Specific Kafka Cluster
Circling back to our original DLQ example — we were querying the cluster name in the alert:
While this topic exists only in one cluster, the cluster name is unnecessary here, and not forward-compatible. So the updated alert is:
Example 2: Strict Kafka Topic Names
Another good example is alerting on Kafka consumer lags (i.e. if the consumer starts to lag behind the messages in the topic
Here any topic name change is not forward-compatible and will cause us to lose alert coverage. So in these cases, we’ve decided to use regular expressions to make sure new topics will also be covered by the alert (obviously we need to be very conscious here not to cast a too wide web):
Example 3: Redundant k8s Namespaces
Our k8s cluster is divided into namespaces by our team structure — i.e. each team has its own namespace. In some cases, alerts were querying the namespace in addition to the service:
Those alerts are not forward-compatible when services are being handed over to other teams (and being migrated to other namespaces). Similarly to the first example, the namespace filter here is redundant since the service exists only in one namespace
Generally, we’ve learned the hard way that alerts should be just strict enough to catch only the relevant covered case, but loose enough to keep on covering it with the evolution and changes of our system
Is This Drill a One-Time Event?
This alert drill was a quite manual process — creating a PR to change the alerts, then manually checking they are being triggered. How do we scale this?
Firstly — it’s important to admit we’re not planning on scaling this. We do think of rerunning such drills after major infrastructure / architectural changes. We feel comfortable enough with smaller “day-to-day” changes.
We did think though how we could scale this if we decide to — and came up with two possible alternatives:
- Periodical manual drills as part of a new dev on-call onboarding — killing two birds at once — increasing confidence in our alerts, and giving the on-call to be a chance to experience alerts in a safe environment
- Automation — creating a PR, and checking for PagerDuty alerts can be automated with enough tooling and infrastructure. We think we don’t need this at this point due to the large effort it requires creating this tooling from scratch, but to leave this door open, we’re considering adding a parameter/configuration to our alerts for “drill threshold” to make the PR creation safer (i.e. the PR to apply the “drill threshold” to the alert)
Be happy to hear how you’re doing this in your companies under your infrastructure!
Although the production drill we had was still a manual process — it gave us undeniable certainty that our alerts suite works! Moreover — this process can be automated if we’d decide to invest the time (i.e. merge a PR that lowers alert thresholds and then checks via api that the alerts were triggered, and finally reverts the PR).
We learned a hard lesson about alert overfitting, and we did some knowledge sharing around the alerts suite — now even newly onboarded members of the team are familiar with what we’re alerting for, and how the alerts “feel” in production.
Bottom line — we see a huge ROI on the hour invested in this drill (+ it was really quite fun to execute with the team). If you have an active production system — I really encourage you to try and do it yourselves!
If you are running such production drills or automated them, I’d be super happy to hear how you’ve done it on Twitter! I’m @cherkaskyb, don’t be shy.