The Magic That Is Datadog Pricing | Part II

Boris Cherkasky
8 min readJul 25, 2023

--

The Datadog cost series:

Welcome back to the series on Datadog pricing. In the previous post we’re covered some of the reasons why you should care about your Datadog costs. In this post, we’re going to cover how in general Datadog products get billed, and what causes this end-of-month invoices to be a bit, let’s say, unexpected.

As in the previous post, a disclaimer first — Datadog is an AMAZING product. And although throughout this series I’m covering how costly it is, it’s also very valuable, and the observability gained by it is exceptionally good. You pay a lot, but also get quite a lot in return.

Datadog Products in High Level

Generally, we can break down Datadog product pricing into 3 major buckets:

  • Host based pricing
  • Volume based pricing
  • User based pricing

Some of those pricing models usually have exceptions, and other addons, but the general scheme of things is the above three.

To best understand the end-of-the-month invoice, it’s important to understand each of those, and how they can be optimized, or forecasted.

Let’s dive into each one of those.

Host Based Products

In host base products, such as APM, infrastructure and CSM, Datadog offers a monitoring platform for hosts (instances), with various capabilities.

In those products pricing model — the customer commits to a number of monitored hosts on a monthly basis, with what seems to be a very reasonable price of a few tens of dollars, but the interesting and what might be the surprising part, is that Datadog bills by the hourly concurrently monitored hosts (concurrently active host in a 5 minute intervals, averaged by the hour). So generally, when you’re committing to X monitored hosts, you’re actually committing to an X concurrently active hosts in each monthly hour.

Why can that be surprising? Because our systems are usually elastic, and might (auto) scale — therefore the 100 hosts you’ve committed to can turn 170 in the peak hours of your system’s traffic, and 70 in the low point of your traffic, but you have to put a fixed number on your commitment. And here it becomes a game of dynamic optimization — commit to just enough hosts to not be slaughtered on on-demand rates during your peaks, while not overcommitting in your low hours.

Moreover Datadog pushes you to commit up front in this pricing model by applying quite an aggressive method: If you’re not committing to any quantity — they are billing by 99th percentile used hour and NOT by the actual usage. I.e. the number of hosts active in your 99th percentile most active hour will be your end-of-month invoice. And as engineers, we know how unpredictable our 99th percentile can be.

On the engineering side — I can understand why this pricing model was chosen — when Datadog has no estimation for your usage, they need to over-provision their system to support any load, thus having larger operational cost. Does this justify the p99 billing point — it’s up to you to decide.

Volume Based Products

In volume base products, such as Logs, Synthetics you pay per use — as simple as that.

You send Datadog 100GB of logs, you’ll pay for 100GB of logs, You execute 1M synthetic checks — you’ll pay for 1M synthetic checks.

The commitment model here is quite simple too — you commit to Volume, get a reduced price, and when you use all your committed volume, you get billed the on-demand rates.

This is the basic SAAS business model, and Datadog is not different from many other vendors.

Logs, since being a very popular Datadog product, varies a bit from other volume based products and is worth diving a bit deeper into.

It’s important to understand that in the Logs product sending logs to the Datadog platform has multiple billings applied to it:

  • Ingestion — You pay for the GB volume of logs you send (usually it’s a fixed price of 0.1$ / GB of data)
  • Indexing — You pay for the amount (count) of logs you store on the Datadog platform, and the pricing varies upon the retention for which the data is saved (the longer you save the data, the more expensive that is)
  • Rehydration — if for any reason, you want to make logs that passed their retention period available again, you need to rehydrate them, and pay for it.

Up until this point, no surprises — you send data to Datadog, therefore you pay for it, totally makes sense.

The interesting part about logs is that you pay regardless of usage — i.e. even if you never opened the Datadog log search webpage, you still pay those amounts. That makes total sense business wise for Datadog, since they have to process, store, and index all these logs anyhow, and pay themselves for processing it. But as a user, you might be paying for something you never actively use, or use very rarely. I don’t have numbers to support that, but if I had to guess by my personal experience, I’d say that less than 0.1% of the indexed logs are actually being used. The bottom line here is to think before you log (I’ve written in the past on log cost optimization).

Logs aren’t free, and aren’t cheap.

It’s important to mention, that in usage-based products it is very easy to over-use commitments and get to on-demand pricing which is >150% of the committed price (by default).

All it takes is to make a synthetic API test to run every 5 minutes instead of 10 to double the cost. Or launch a new service to production that takes another 20% chunk of logs, those are things that developers can do (and are doing) without applying much thought to it.

Active User Based Products

In user based products, such as Incident management and CI Visibility you pay per active users — as simple as that.

And to be honest, in this case I have no complaints against Datadog — they are quite gentlemen about defining what “active” users are.

Buy One Get Some for Free

The last piece of the puzzle are products such as containers, custom metrics, index spans.

For those products, you get a “free amount” for some of the “main” Datadog products you use.

For example, for each monitored host, you’ll get (for enterprise account):

  • 4 monitored containers
  • 200 custom metrics
  • 1M APM spans

Those products will generally behave as host based / usage based products mentioned above.

These products are the hardest to understand, reason about, and get their const under control. I won’t get to the dirty details of why they are so difficult to manage, but I’ll try to summarize the general theme bellow.

The Challenge of controlling container monitoring cost

In highly containerized and dynamic environments such as Kubernetes, each host (i.e. cloud provider instance), can have a huge amount of unique active containers, causing the host-monitoring cost to skyrocket. In our case it increased our cost by ~50%.

To be fair — Datadog tries to average out outliers, but in the age of Kubernetes — containers come and go and incline additional Datadog costs.

The Challenge of controlling custom metrics cost

The Custom Metrics product in general is billed with respect of the cardinality of the tag value of your metrics — the more unique combinations you report — the more expensive it is.

Since the cost is per unique combinations it means the cost CAN grow non-linearly when adding additional tags! (it depends greatly on the cardinality of each tag and whether they are codependent).

In addition using the right observability primitive is critical! Distribution / histogram custom metrics are billed at least 5 times higher than Gauges / Counters.

If you ever built modern complex micro service based systems, you can probably understand how the cardinality of those tags can get out of hand quite easily.

The Complexity of the Datadog Pricing Mental Model

Datadog has about 20–30 products, sub products, and additional fine-print add ons, all of them behave roughly in the above mentioned pricing models (with some variations), some price per GB, some price per 1M event, some 1K events, and some for 10K events, some per host-month, some per host-hour.

See the synthetics pricing page as example for how fine lined the pricing model is:

The Browser tests are not 2.4 times more expensive than API tests, they are 24 times more expensive.

Or the part about non-committed hosts priced by the p99 usage hour, it isn’t mentioned in the main infrastructure/APM pricing pages, it’s mentioned in the billing FAQ, a page that took me 5 minutes to find while writing this post, even though I know it exists and knew what i was looking for.

Someone needs to hold this mental model in their head to understand, provision, and commit to a budget.

It took me more than two weeks to get my head around all those bits and bytes and create usage to cost translation for our application. Those are two weeks that I was in full focus on solving that problem, and I already have previous experience with many other observability platforms.

A week later I forgot half of the things I learned. Expecting anyone doing this commitment budgeting once a year (at best) to understand those bits and bytes is quite an unreasonable expectation.

This is hard on the Finops engineers of the company — although they probably understand these pricing models, they usually don’t know the actual technical behaviour of the system, and aren’t involved in the technical roadmap (thus don’t know how many new services will be spawned or scrapped — which affects usage).

This is hard on the Dev team too — doing budgeting and cost allocation isn’t their core competence, and even for Devs it’s hard to understand and forecast volumes.

The bottom line is that this is a guessing game until you get experience with forecasting usage, translating it to cost, and figuring out the right cost-effective commitment.
It shouldn’t be.

Next up

I hope I was able to shed some light on how the Datadog pricing model works, and help with your understanding of what you’re paying for, and maybe where you start looking for optimization opportunities or adjusting your commitments to better suit your needs.

In the next part we’ll cover what can be done to optimize Datadog costs

As always, thoughts and comments are welcome on twitter at @cherkaskyb

--

--

Boris Cherkasky

Software engineer, clean coder, scuba diver, and a big fan of a good laugh. @cherkaskyb on Twitter