Production systems are constantly changing. It can be new code that is being deployed, it might be a change in usage pattern due to the onboarding of a new customer, or simple degradation in performance due to worldwide network issues.
Those changes are inevitable, and therefore our system should always be monitored, and we, as owners, should be alerted on any degradation.
A few weeks ago I got into an argument with one of my colleagues on how to implement an alert that will trigger when one of our Kafka consumers was lagging behind. My colleague wanted to implement it using a derivative, and I wanted literally any other solution and wasn’t able to convince him otherwise.
It took me almost two weeks to arrange the thoughts and arguments in my head to a coherent case on why we should avoid, or at least look for alternatives to using derivatives in alerts. This post is my public argument.
Throughout the article, we’ll look at some Prometheus metrics and Grafa charts, but no knowledge of those systems is required.
Important disclaimer before we dive in: In no way am I saying derivatives on metrics is the wrong way to go, In my opinion, though they are over-complex, while simpler alternatives exist.
What Derivatives are in the context of monitoring?
For those of you, like me, that only vaguely remember calculus 101, let’s recap what derivatives are, from Wikipedia:
…The rate at which the value y of the function changes with respect to the change of the variable x.
And in the context of monitoring a metric, a derivative of a metric is the rate at which the metric value changes, and since most monitoring is done “overtime”, derivative of a metric usually means the rate of change of the metric value over time.
Now that we remember what derivatives are, without further ado, here are my top 4 reasons why NOT to use derivatives in alerts:
1. Derivatives are not tangible
Let’s say we use the following alert in Prometheus:
deriv(kafka_gourp_topic_lag[10m]) > 5An example for the kafka_group_topic_lag metric, that we’ll use throughout the post will look something like this in Grafana:
As mentioned, our goal is to be alerted when our consumer is lagging behind. And let’s say it is lagging behind, and the alert is triggered, and you are woken up in the middle of the night (as the on-call for the service), and you see said alert.
Do you know what is a deriv value of 5? Can you imagine it? Can you assess how bad things are? Is it a catastrophe? Or only a deriv of 20+ is a catastrophe?
This brings me to the first point — the derivative in many cases is not tangible! It’s hard for us to “feel it”.
In contradiction, with absolute values — we understand what we’re dealing with! An alert of
kafka_message_lag[10m] > 100we easily understand that we have a lag of 100 messages! What is the state of the topic with the
derivbased alert? I challenge you to figure it out.
By the way: this is how it looks like (the green chart is the absolute value of the lag. The yellow one is the
2. Derivatives are Hard to set alerts for
This may not be an argument by itself, but a part of the previous argument — What is considered a troublesome derivative? A derivative of 500000 is probably a catastrophe, but a derivative of 7? Huh, I don’t know. How will you set the alert threshold?
This is the chart of the derivative, please help me figure out where everything goes sideways:
3. Derivatives are usually not shown in dashboards
Have you ever seen a chart of
derive(metric) anywhere in your monitoring dashboards? We usually put the metrics themselves in charts and dashboards (requests per seconds / CPU usages/messages in queues).
So if you’ve never seen it, do you really expect you’d know how to interpret it correctly in the middle of the night while woken up by an alert? Oh, You do? So what is the topic lag/state of the system given the next
Derivatives are second-class citizens when it comes to visualization and monitoring, we’re not used to seeing them, and therefore when we see them on a chart, it “feels weird” to us.
4. Derivative show change, not steady (bad) state
Derivatives show change — they show the deviation from a steady-state (note this is the “mathematical steady-state”, which might be different from the “system steady-state”). If the monitored state is in a steady bad state, a derivative will probably miss it.
Let’s examine the chart from the first point, and zoom in at the timeframe from 14:30. Same as before
- The green line is the absolute lat (left Y-axis)
- The yellow line is the lag derivative (right Y-axis)
You probably see it — there is a constant lag of at least 400 messages (not saying it’s good or bad, but it’s there), while the average derivative (in this case over 5 minutes) is almost constant.
Therefore an alert on derivatives will trigger only on extreme one-way changes, and not steady bad state, or periodic bad-good state change.
Even if we’ll scale the aggregation window down to 1–2 minutes the derivative chart still looks very periodic jumping around positive and negative values:
How to better alert on “change” or trend?
First, and foremost, if possible: KISS — keep it stupid simple — if you know what’s your “point of degradation” — alert on absolute values. I.e. When we know the “steady stage” is a lag of 400–800 messages, we can safely assume that a lag of 1200 messages is a degradation of performance, and if it’s not 1200, it might be 4K or 20K, but we can derive from previous experience, our required level of service and other use cases and customers what is the absolute value to declare our performance is degraded.
When absolute values are unknown or by the time the system gets to its degraded state it’s too late and we want to “catch” the trend during its degradation, we can use a comparative metric — i.e. compare current values to the values of the last 10 minutes/hour/day/week.
Let’s take the same chart from the first point, and calculate the 20-minute lag difference using this query:
avg(kafka_messages_in_topic) — avg(kafka_messages_in_topic offset 20m)
It’s now much easier to identify the point of degradation! Moreover, taking you back to my second argument — setting the threshold is much easier now, since we’re back to absolute values (of difference though), and going even further back to my first argument — the metric is tangible — and alert with a threshold of 400 means, a lag of 400 messages was accumulated in the last X minutes.
And when we put the 2 charts together (difference and absolute value) it is apparent that we have better visibility into the issue:
And a bonus — event the scale is the same, the right Y-axis is not necessary anymore.
What is being done with derivatives?
PID processes, anomaly detection tools do use derivatives to create amazing results, but those are highly complex systems.
Derivatives can be used with some weight factor in addition to absolute value to generate a metric of “stability estimation” or current state + change, but this also drifts away from the point of a simple observability solution.
They can be used to alert on extreme trend changes, as long as the traffic pattern is not periodic in short time intervals.
Lastly, in systems where the traffic pattern is really unknown, we can use a non-negative (or non-positive) derivative over time to estimate whether our trend is generally increasing over time, even when the absolute values are unknown.
I don’t like being woken up in the middle of the night by metrics I don’t fully understand, let alone ones I have a hard time simulating in my half-asleep mind. From my personal experience, we have alternative solutions to most of the useful things we can do with derivatives, and therefore the cons of using derivatives in alerts outweigh the pros.
One final note, a colleague who reviewed this article pointed me to this cool talk about Kafka-lag alerting.
As always, comments are welcome at @cherkaskyb on Twitter. I will also like to hear from you what you’re doing with derivatives in your monitoring stack.