Data Puppy — Shrinking Data Dog Costs | Part III
The Datadog cost series:
- Part I — The costly elephant In the room is a (Data)dog
- Part II — The magic that is Datadog pricing
- Part III — Data Puppy, shrinking Datadog costs
Welcome back to the series on hacking the Datadog pricing model. In the previous post we explored the Datadog pricing model. In this part, we’ll discover the key factors to consider and unlock the hidden potential for optimization.
Without further adieu, let’s dive into what can be done.
Use Committed Prices
You already decided Datadog is your observability platform of choice (which is a good decision), so you need to be able to manage spend and commitments with your Datadog account manager
The easiest saving opportunity is to analyze your usage, and adjust your commitments or base fee accordingly — this can help you save anywhere between 10% to 30%, without any engineering involvement.
Once you set your commitment levels, You have to have the tools to monitor it on a daily basis, alert on anomalies, and proactively identify any threat to those commitment levels.
Build a process of cost government — when you understand their pricing model and how your production system behaves — it is achievable even in products as complex as DataDog.
This process is an ongoing effort and it needs to be repeated monthly / annually (depending on your plan).
Transition from End of Month “Bill Shock” to Push-Based Daily Cost Monitoring
One of the main pain points using the Datadog platform is not that it’s expensive, it’s the looming bill shock. The billing date might blind side you with unexpected usage.
That being said, the earlier you catch anomalies and cost spikes, the earlier you can address their root cause, and the more predictable your end of the month payment will be.
There are many reasons for cost spikes in the wild:
- A developer enabling debug logs in production by mistake.
- Unknowingly adding a high cardinality tag to a custom metric.
- Spinning up a testing environment and forgetting it.
- A new feature is released to production that you didn’t foresee it’s usage increase
So keep an eye on your usage, and preferably set alerts! Within the „atadog platform this can be achieved by setting Monitors on the estimated usage metrics (note that those are usage and not cost metrics), or alternatively leveraging an anomaly engine such as Finout (full disclosure — I work @ Finout).
Unit Economics of Your Observability Platform
Datadog is designed to scale with your business and infrastructure and it does that amazingly well, just keep an eye on the unit economics of the observed system.
In practice, make sure your observability spend is within a “reasonable” ratio to the infrastructure spend. Anything between 2%-6% of your infra cost is probably “healthy”.
Monitoring the trend here is key, don’t get caught with your devs adopting Datadog without an ability to verify that the increase makes business sense.
Reduce Waste and Idle Monitoring
This is the fun part — eliminate waste.
Logs
Everything in my post on logging optimization is also applicable to Datadog logging products too. Make sure your debug logs are turned off at any scalable system, and only log what you need, and with the correct retention.
Datadog offers a cool platform of “logs as metrics”. This can be leveraged to reduce the volume of indexed logs, while still keeping some of their visibility as metrics (Note that you’ll still pay for the log ingestion)
No, you don’t need all those Custom Metrics!
We love metrics, and especially custom metrics. A code review ending with “well you are sending too many metrics” is rare.
We are educated to feel that “ you could never have enough monitoring”, and that we need each and one of these metrics in case a production incident happens and this specific metric will save my Friday night.
Three months later we are left with tons of idle metrics that are being used only by the Datadog billing systems, so clean up your infrastructure and remove any unused custom metrics.
I’ve played around with this python script to identify unused custom metrics that do not exist in any Dashboard or Monitor, I encourage you to take it out for a test.
Just Delete them. You can let go of them.
Metrics cardinality — a blessing in disguise
You’ve deleted all your old, stale, unused custom metrics. You are left with only what is necessary to keep production alive, but what about the tags cardinality? Make sure the cardinality of sent tags is reasonable! Custom metrics product is billed by the metric type used (Gauge / Histogram / Counter) and the cardinality of tags sent.
A good starting point will be leveraging the list-active-tags-and-aggregations API, to get a general understanding of your stance.
Finally, use the right observability primitives for each metric — Histograms and Distributions are more expensive than Counters and Gauges, who use them only when you really going to use their added value.
Synthetic Tests
Make sure that your Synthetic tests are actually used and alerted on by at least one active Monitor! Silenced Monitors tends to be forgotten while the Synthetic test is still running and being billed. In addition — verify that it is scheduled at reasonable intervals.
Make sure you don’t have multiple tests for a single endpoint — I know this is obvious, but in a large organization with a lot of staff changing roles — these things tend to happen.
Host Monitoring and APM
This is a tricky one, but in general make sure that you only monitor hosts you care about and not all your fleet by default. Take into account that a highly containerized environment inflicts additional container costs, and in highly volatile environments such as batch processing workloads — the container costs can with the same order of magnitude as the infra cost itself.
Consider using cheaper open source tools for “general monitoring”, and use the full extent of the Datadog platform for your most valuable hosts and services (see my “I Have An APM Addiction” talk [Hebrew] that covers exactly this issue.)
One of the patterns I’ve used in the past is to create internal low maintenance observability solutions for non-critical workloads (such as dev and staging environments) by leveraging open source tools with very short retentions.
Encourage Effective Monitoring State of Mind
Many engineers, especially less experienced ones, do not necessarily understand the cost of observability — for them it’s simply a single line of code, and lines of code are free!
Create an environment where the engineering organization is aware of observability costs, and consider any log line or metric added or change, as a change with cost implications.
Focus on custom metric tag cardinality and log throughput.
End Note
I hope this series gave you some food for thought on how to address your Datadog cost, and position yourself better in your next pricing negotiations.
Datadog is an amazing observability platform (although I am more of an open source monitoring kind of a guy 😉), but its cost can get out of hand quite easily and become a hassle for your Finops or Engineering/Devops manager.
It doesn’t have to be.
As always, thoughts and comments are welcome on twitter at @cherkaskyb