I Don’t Alert on Apdex. It Confuses Me
There are many ways to build your observability and alerting posture for your production systems. It doesn’t matter if you build it yourself, use open source tools, or leverage one of the many observability vendors out there, you’ll have to make decisions — what to monitor, what to alert on, and how.
It’s common to use out of the box mechanisms offered by vendors, or open standards. One of those standards is the Apdex which is used to monitor application performance.
This is my personal take on something that is considered standard that I just don’t understand. So here we go — the Apdex, what it is, and why I don’t use it!
What is an Apdex
The Apdex is a single number that tries to represent the user’s “satisfaction” with the system.
It is a standard for measuring performance. It looks at systems from the end user’s perspective, bucketing the requests served by a system into three buckets — Satisfied / Tolerated / Frustrated, weighing them all into a single numerical value called “the Apdex score”.
It does that by defining a threshold called the “Tvalue”, which is the maximal latency that is considered “good”, and the bucketing works as follows:
It is also implicitly inferred that failed requests are considered Frustrated (regardless of latency).
Sounds quite good right? I don’t like it.
The rest of this post I’ll do my best to explain why.
1. No one knows what Apdex is
First and foremost, No one knows what the definition of Apdex is.
I literally learned about it at 3am while the Apdex score went below its alert threshold. If observability in general is a domain that is hard to master, then the Apdex is usually very far down the learning syllabus.
Furthermore, even when you know its definition — The Apdex score is not tangible! What does a score of 0.87 mean?! Can you imagine it? Can you plot a chart of system health and Apdex — what will be the correlation between them? Given a Score can you draft a mental model of your system’s state?
It’s an abstraction, or a summary that simplifies nothing, it just adds confusion!
As @nukemberg said when we discussed it: “people don’t have intuition for it, there’s no concrete example, there’s no simulator”.
2. It’s hard to set meaningful alert threshold
Since the numeric value of the score is hard to imagine and understand, how exactly will we set the alert threshold? Will it be a score under 0.9? 0.8? I find the alert threshold to be an arbitrary value! (Which we all know is the absolute best way to get your team into alert fatigue).
3. It “feels” arbitrary
It tries to simplify the complexity of understanding latency distribution. It takes a distribution, which isn’t a trivial concept to begin with, and breaks it down into 3 over-simplified-bucket which always feel arbitrary to me (Why is 4*T is frustrated? Why not 1.5 or 8?).
4. It’s a “default” not an optimum
It feels like a mechanism to add simple, default visibility into complex systems — it treats all requests the same, all errors the same, while reality is far more complex than that.
Moreover — different use cases or customers might have different T values. I.e. different definitions of what are satisfied and Frustrated requests. For example, satisfaction for your freemium users can be 1 second response time, while for your enterprise users it can be 400ms.
I consider it to be an attempt to create “one suite” that actually fits none
5. It has multiple dimensions
Lastly, and here it’s completely my personal preference, It has multiple dimensions — latency and error rate. They are summed into a single number so getting alerted on a low Apdex score doesn’t say what exactly the issue is — Is it Latency? or errors?
I prefer alerts to be with a single dimension — this way it’s very clear what’s the source of the alert.
If not Apdex, What should we alert on?
Appdex has 2 dimensions — Errors and Latency, so I just explicitly alert on those…it’s much simpler to grasp and understand — the alert discloses what’s going on.
It is possible to create an effective SLO that combines multiple dimensions — but this is done by defining consistent thresholds and not ambiguous scores. For example — % of requests with latency above 500 ms or failure rate. This is similar to leveraging the Satisfied or Frustrated Apdex buckets only.
Final Thoughts
In user facing applications the Apdex score can have some value as a type of “hunch engineering” — getting a general “feeling” of the system’s performance.
In service to service applications (i.e. systems that aren’t interacting directly with a human user) I really don’t see any point in using Apdex — there’s no notion of “satisfaction” — the SLA is either met or not. No middle grounds.
As with my personal experience as a not-user-facing engineer — every “low Apdex score” alert I’ve gotten, confused me! It was a struggle to understand how the system behaves, and what exactly are the symptoms.
Some credits:
@nukemberg who i’ve discussed the ideas behind this post and helped sharpen them.
@Eladleev who’s “the profound problem with terraform” post inspired me to put in writing what I think of Apdex (which was sitting in the drawer for more than a year).
As always, thoughts and comments are welcome on Twitter at @cherkaskyb