Hey all,
Let me tell you a horror story:
One time, my production engineering team tightened security groups on a VPC. We had a clean plan and a smooth apply. There were high fives all around.
Until two hours later…
The application devs’ load balancer lost connectivity to backend services that quietly depended on the networking assumptions we’d just bulldozed. Nobody broke anything. Everybody broke everything.
I’m not too proud of it. But that’s infrastructure-as-code at scale. And it’s a mess we don't talk about honestly enough.
I see too many platform teams who treat their Terraform repos like independent kingdoms, when the reality is this: Every state file has upstream producers and downstream consumers. That VPC state feeds the load balancer state, which feeds the application state. If those relationships live in someone's head instead of somewhere explicit, you're not doing engineering. You're doing improv.
Improv is great for comedy, but it’s terrible for production infrastructure.
What Not to Do
The industry wants you to believe the answer is buying the right tool. Stick a shiny orchestrator on top, and your problems will dissolve like sugar in coffee. I created one of those tools, and I'm telling you:
That's nonsense.
Your infrastructure is a graph of dependencies between repos, state files, teams, and environments. If you don't understand that graph, no product on Earth will save you by itself. Not mine or anyone else's.
Your next inclination might be to throw more engineers at the problem. But that just makes it worse, for a couple reasons:
→ Every new person is a new node generating changes against a dependency graph they can't see. You’re scaling your team, but you’re also scaling your blast radius when something goes wrong.
→ It’s not a capacity issue. You have enough engineers. The trouble is, none of them can point to a whiteboard and show you what depends on what. That’s the problem you need to fix first.
What to Do Before You Even Look at Tools
So, before you touch any tooling, follow these steps:
1 - Map dependencies explicitly. If it's not written down, it doesn't exist.
2 - Establish ownership. Every piece of infrastructure should have a name on it, and downstream consumers should hear about changes before they land, not after the outage.
3 - Enforce ordering. If A must apply before B, that's a constraint, not a suggestion. Violate it and you deserve what happens next. If you have a shallow dependency graph and a few teams, GitHub Actions can get you surprisingly far here, with cross-repo workflow triggers, approval gates, and output passing between workflows.
No Magic Bullet
You probably think this is the part where I pitch Spacelift. You’re half right.
My experiences with these very same problems when I was a practitioner led me to build Spacelift. The first issue was always the same: Nobody could see the dependency graph. One repo would quietly produce outputs that another repo relied on, but the relationship wasn’t captured anywhere in a way the tooling could understand. Then there was ordering and approval, where A truly had to apply before B, and where a downstream team needed a chance to review a change before it rolled into production (not after it caused an outage). And even when everything looked clean on paper, reality would drift. Someone would click in the console, run a one-off script, or make an emergency change, and suddenly the cloud and the repo were telling two different stories.
But here's the thing I probably shouldn't say: Spacelift alone won't fix these problems, either.
Without doing the work of mapping your dependencies and agreeing on who owns what, you're just adding a very nice dashboard on top of chaos. We've seen what happens next. It's not pretty.
If you’re trying to solve the problems of IaC at scale, you have to go in this order:
Mindset first.
Process second.
Tools last. (Even mine.)
Thanks for reading, Marcin
P.S.: Spacelift is hosting a webinar on the challenges of multiplayer IaC on March 12, and I’d love for you to join us. You’ll walk away with a practical checklist for scaling infrastructure-as-code. Register here.
|