Most Kubernetes migrations don’t fail during the migration. They fail about two weeks after, when the first real incident hits and the team realizes nobody actually decided how things work day-to-day on the new platform.

The pattern#

It usually goes like this:

  1. A team spends months getting workloads running on Kubernetes.
  2. Day one goes fine — traffic is cut over, dashboards are green.
  3. Two to three weeks later, a pod starts crash-looping at 2am, or a node gets cordoned during a deploy, or a HorizontalPodAutoscaler does something nobody expected under load.
  4. Nobody can answer “who owns this” or “what’s the rollback” with confidence, because those questions were never answered before go-live.

The cluster isn’t the hard part. The hard part is everything that has to be true around the cluster for it to be operable by a team that didn’t build it.

What actually needs to exist before day one#

A few things that are easy to skip under deadline pressure, and that you will need on day two whether you’ve prepared them or not:

  • Resource requests and limits on every workload, not just the ones that crashed in staging. Without this, the scheduler and the autoscaler are both guessing.
  • A documented rollback path that doesn’t depend on the person who did the migration being awake. If your rollback is “ask Allan,” you don’t have a rollback.
  • Alerting tied to user-facing symptoms, not just node and pod health. kubectl get pods looking fine doesn’t mean users are being served.
  • An owner for cluster-level concerns — node pool sizing, upgrade cadence, PodDisruptionBudgets — that isn’t “whoever’s free.” Cluster upgrades are infrastructure work, not a side effect of an app team’s sprint.
  • A clear answer to “what happens when a node dies.” If the answer involves manual intervention, that’s the gap that will show up in your first on-call rotation.

The actual fix#

None of this is exotic. It’s the same operational discipline that any production system needs — Kubernetes just removes enough friction from deploying things that teams sometimes skip the discipline because the platform feels like it’s handling it for them.

It isn’t. The scheduler will happily keep restarting a pod with no resource limits until the node falls over. Argo CD will happily sync a broken manifest to every environment you’ve wired it to. The platform gives you primitives, not judgment.

If you’re migrating workloads to Kubernetes, budget real time — not an afterthought sprint — for the operational readiness work above. The migration itself is the easy 80%. The remaining 20% is what determines whether your team trusts the platform after the first real incident, or starts looking for reasons to route around it.