It is a known saying in the SRE world that change is the source of all instability. And it is indeed so — change is what usually disrupts service, and takes applications down.
Engineers usually think of “change” as code change, but the ugly truth is that there are many other types of change that are as dangerous to the production environment.
Unexpected changes will surely take your system down, so in this post I’ll share some hard learned lessons of such unexpected changes that took my production systems down.
I hope that reading about such less-expected types of change will make you stop for a moment and ponder upon what are the risks your production env is prone to.
Putting Types of Change on a Scale
I like to put “change” on a scale of explicitly — where an explicit change is one where there’s at least one human actor (usually dev/ops person) who are fully aware that a change is happening — they might not fully understand the change or it’s implications, but they know a change is in progress.
While on the other hand implicit change is one where no human is aware that it is happening, or no human is in interaction with the system or production environment while it’s happening.
There are a lot of ways to put change on a scale, but I found the scale of explicity to be a very good way to express the complexity of systems, and how large is the varsity of changes that modern systems goes through.
Let’s dive in and cover in detail some common types of change
Code and Configuration
This is the most obvious one — changing code or configuration is a major source of instability, this is the most obvious one, and the most explicitly type of change our systems undergo
The load our systems support and serve is also changing, and obviously affects our systems.
Load is trickier than code though. It might be an explicit planned change such as black Friday for example — load is going to increase drastically, everyone knows about it, and is prepared for it.
But load can change much more implicitly — a DDOS attack, a halfway break in the Rugby world cup that you are not aware of, or just onboarding a new large customer that the engineering team is unaware of (obviously, auto — capacity tuning is the answer, but organization take time to get there).
Unexpected load is probably the number one factor for system failure after code and configuration changes.
Population change is a nuanced topic, it can be considered as a “load change”, but I like to think of it as a separate type due to its more unexpected nature.
Population means the mixture of request you’re system is serving, some examples
- The ratio of GET / POST / PUT requests
- The throughput ratio between endpoints the system serves
- The throughput ratio between customers the system serves
Population change means the ratio of requests served changes — for example — a ratio of 20–80 POST-GET requests can change to 40–60 POST-GET — the system gets much more write heavy.
This is entirely implicit in most cases — it’s a change in how users interact with the system, or even a change in who the users are.
This is extremely difficult to predict, and even identifying this change is not always trivial.
What can population change cause? For instance, as in the POST-GET example above, it can change the nature of the system, driving it from being a read intensive system, to a write intensive system — those systems have completely different design characteristics!
Another example might be that the requests in the system becomes “heavier” — for example, if you’re serving 1K RPS with median latency of 50ms, onboarding a new customer might change the population causing you to serve 1.1K RPS with a median latency of 200ms — a 10% change in throughput causing a 400% impact on median latency.
Maybe the requests sent by this new customer are larger in size, or require heavier computations.
These things happen! They are hard to identify, and we need to be aware of them!
Time changes, and we don’t even think of it!
That means any time-based logic is prone to errors and bugs that will be manifested only in specific times or dates.
I once saw a
if request.created_at > “2023–04–22: return BAD_REQUESTin a production system. I didn’t know why someone expected the end of the world to happen in April 2023, but this was a time bomb waiting to happen.
We found it by sheer chance before it triggered.
The operating system / infrastructure applications might have jobs scheduled for specific times, during which the system behaves differently. For example — when the Postgres VACUUM job triggers — performance might be impacted severely.
It is a known issue in general purpose scheduling systems that human generated jobs tend to be scheduled on a round hour — that means your production infrastructure will have spiky load at those times.
Long Running Apps and Instances
On one of my first vacations on the job, the system went down just because we stopped deploying daily, and an instance ran for 3 days straight for the first time. It had a misconfigured log rotation configuration, causing its disk to fill up and crash.
The fact that your instance hasn’t been reset for a long period of time is a change by itself! It might be the first time it has been running for such a long time.
Generalizing the above example — a large scale system has the tendency to reach various saturation points when running over long periods of time without the proper maintenance.
Some examples i’ve encountered in the wild:
- DB that gets to some limits and starts to slow down
- Reaching the point where you get to the limit set by your improper log rotation configuration
- Getting to the point where the memory leaks you have lurking finally reach their critical point
The only way to check this is to actually run the system for a long time, under load.
The amount of such changes that can take a production system down is huge, and I’m not even pretending to be able to cover them all.
Thinking and covering all those possibilities is a time consuming effort, and until you’ve seen production systems in the wild go down on a 500 jobs scheduled at the same moment at midnight, or an instance with a memory leak that ran for the first time for a week — it’s hard to think of these scenarios.
I hope this post, and my personal experience, helped open your eyes to different ways your production system can fail, and nudge you to block some time and think about what’s relevant in your production environment.
As always, thoughts and comments are welcome on Twitter at @cherkaskyb