Let's Talk Numbers
These figures come from real procurement data, engineering blog posts, and conference talks where teams actually shared what they spend. Not vendor marketing—actual invoices.
- Compute credits $3.2M
- Storage tier $1.4M
- "Enterprise" add-ons $1.4M
- Cloud compute (direct) $950K
- Object storage $350K
- Engineering time $500K
for equivalent capability
Now, the vendors will say this comparison is unfair. They'll argue that their platform includes governance, security, and "operational simplicity" that's impossible to replicate. They're not entirely wrong.
But here's the thing: "not entirely wrong" at a $21 million premium deserves actual scrutiny. Not a sales call. Not a proof-of-concept that mysteriously runs faster than production will. Real scrutiny.
The question isn't whether managed platforms provide value they do. The question is whether that value justifies paying 3x more for what is, underneath all the polish, compute time and storage.
The most expensive infrastructure decision isn't choosing the wrong tool. It's never asking whether you needed the tool at all.
Where Does the Money Go?
Let's crack open the black box. When you pay $6M/year, what are you actually buying? Understanding the layers reveals where value is created versus where margins are extracted.
This is the part they don't mention in the demo. Proprietary storage formats and query engines create switching costs that compound over time. Each year of data accumulation makes migration more expensive. That's not a bug—it's the business model.
This isn't theoretical. Airbnb, Netflix, Uber, LinkedIn they all built their data platforms on open source. Not because they couldn't afford Snowflake or Databricks. Because they did the math and understood what happens at scale.
Their engineering blogs document everything. The architectures are public. The tools are open source. You can literally copy their homework.
When vendors say "enterprise features," they usually mean: SSO (which your cloud provider has), audit logging (which you can build in a day), governance (which open source tools like DataHub handle), and "support" (which is often just access to documentation you could find anyway). Ask exactly what you're getting. Get specifics.
So What's the Alternative?
Building your own doesn't mean starting from scratch. The open-source data ecosystem is mature—battle-tested at companies processing petabytes daily. Here's what actually works.
| Layer | Tool | Battle-Tested At | License |
|---|---|---|---|
| Query Engine | Trino / DuckDB | Netflix, Airbnb, Meta | Apache 2.0 |
| Table Format | Apache Iceberg | Netflix, Apple, Adobe | Apache 2.0 |
| Orchestration | Apache Airflow / Dagster | Everywhere | Apache 2.0 |
| Streaming | Kafka / Redpanda | LinkedIn, Uber, Stripe | Apache 2.0 |
| Catalog | DataHub / Unity Catalog | LinkedIn, Lyft | Apache 2.0 |
| Transformations | dbt | Thousands of companies | Apache 2.0 |
Migrated from Snowflake to Trino + Iceberg on S3. Took 4 months. Two senior engineers. No regrets.
Built lakehouse on GCS with Spark + Iceberg from day one. Compliance requirements met with open-source governance.
Hit Databricks cost ceiling at growth stage. Moved to DuckDB + Postgres for 80% of queries. Spark for the heavy stuff.
No Lock-in
Open formats mean your data is portable. Swap components without rewriting everything. Leave whenever you want.
Actual Cost Control
Pay cloud providers directly. No 3-5x markup. Use spot instances. Reserved capacity. You decide.
Real Skills
Your team learns industry-standard tools. No proprietary certification treadmill. Skills that transfer.
Open source isn't free. It requires engineering investment and operational maturity. But that investment builds an asset—team expertise, institutional knowledge, and infrastructure you control—instead of paying rent on someone else's margin.
Small Teams, Simple Needs
Under 5TB, basic analytics, small team? Managed services often make sense. The operational overhead of self-hosting isn't worth it yet. Focus on product-market fit, not infrastructure optimization. Pay the premium. Ship faster.
Mid-Scale Operations
This is where math changes. At 10-50TB with 20+ data users, the cost differential becomes real money. A dedicated data engineer or two pays for themselves in platform savings within the first year. This is where most companies should question defaults.
Hyperscale Challenges
At true hyperscale petabytes, thousands of concurrent users, complex ML pipelines with dynamic resource allocation the engineering investment is massive. Platform vendors have spent billions solving these problems. But be honest: are you actually here?
Here's the thing: most companies aren't at the extremes. They're somewhere in the middle, using enterprise tools for mid-scale problems and paying the enterprise tax without getting enterprise-scale value.
The real question isn't whether managed platforms are ever worth it. Sometimes they are. The question is whether you've actually examined the trade-offs for your specific situation, or just accepted the default because evaluating alternatives requires effort and organizational will.
Before You Sign That Renewal
Three questions worth asking. Not rhetorical actually ask them. Get answers in writing. Do the math yourself.
What's the exit?
Can you export your data tomorrow? In what format? How much would migration actually cost? Get the number.
What are you using?
Pull your usage metrics. What percentage of "enterprise features" do you actually touch? Be honest.
What's the 5-year math?
How do costs compound as data grows? What could the same investment in internal capability yield?
Look, this piece isn't anti-managed-platform. It's anti-unexamined-managed-platform. Against the assumption that this is just what infrastructure costs. Against treating vendor complexity as a feature instead of a problem to solve.
The best infrastructure decisions come from teams that understand the trade-offs and decide they're acceptable for their specific context. Sometimes that means enterprise platforms.
Often, it's worth checking if it doesn't.