Kubernetes
Surviving Node Drains with PDBs, Topology Spread, and Graceful Shutdown
Node drains are ordinary Kubernetes maintenance, but they only stay boring when replicas are spread out, disruptions are budgeted, and pods shut down cleanly.
Kubernetes node drains are supposed to be routine. You cordon a node, evict the pods, wait for replacements, and move on with the upgrade or scale-down. In practice, drains are where a lot of otherwise healthy-looking workloads discover they were only healthy when nobody asked them to move.
The fix is not one setting. It is a combination of three things that need to agree with one another:
- PodDisruptionBudgets tell Kubernetes how much voluntary disruption you can tolerate.
- Topology spread keeps replicas from landing on the same piece of hardware in the first place.
- Graceful shutdown gives in-flight requests time to finish before the pod disappears.
If any one of those is missing, a node drain turns from maintenance into a small incident.
What kubectl drain actually does
It is easy to think of kubectl drain as “delete everything on this node”. That is not quite right. Drain is a coordinated sequence:
- cordon the node so no new pods land there
- evict the pods through the API
- wait for replacements, if the workload can be rescheduled elsewhere
- respect any PodDisruptionBudget that says “not yet”
- let the pod’s shutdown hooks and grace period run before the process exits
That means the drain is only as smooth as the workload underneath it. If the cluster has room, if the workload is spread out, and if the pod can leave cleanly, nothing dramatic happens. If not, the drain exposes every assumption you forgot to write down.
A useful starting point for a manual drain looks like this:
kubectl cordon ip-10-0-12-34
kubectl drain ip-10-0-12-34 --ignore-daemonsets --delete-emptydir-data
The flags matter. --ignore-daemonsets avoids trying to evict pods that are meant to run on every node. --delete-emptydir-data is a reminder that ephemeral data on the node is ephemeral for a reason.
PodDisruptionBudgets: useful until they are too useful
A PodDisruptionBudget is the part of the system that says how many replicas must remain available during voluntary disruption. That makes drains safer, but it can also make them hang forever if you ask for more availability than the cluster can currently provide.
A simple PDB might look like this:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web
spec:
minAvailable: 2
selector:
matchLabels:
app: web
That says: while Kubernetes is doing a voluntary operation, keep at least two pods available. If you have three replicas across three nodes, that is reasonable. If you have three replicas all sitting on one node, it does exactly what you asked and nothing else.
That is the trap.
A PDB does not create capacity. It only blocks the eviction until something else becomes available. If your cluster has no spare nodes, no room to schedule replacements, or no spread across failure domains, the drain can sit there waiting while everyone assumes “Kubernetes must be doing something.”
The first thing to check when a drain is stuck is the budget itself:
kubectl get pdb -A
kubectl describe pdb web
If the budget says no, it is usually telling you something useful: your rollout assumptions and your placement assumptions do not match.
Topology spread is what keeps drains boring
A PDB answers the question “how many can I lose?” Topology spread answers the more important question: “where will those pods land before I need them?”
If replicas are free to bunch up on one node, then a node drain, a spot reclaim, or a kernel upgrade can take out a whole service at once. You do not want to discover that after the node has already been cordoned.
For most workloads, a topologySpreadConstraints block is a better default than hoping the scheduler will “do the right thing” on its own:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
terminationGracePeriodSeconds: 30
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
containers:
- name: web
image: ghcr.io/example/web:latest
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
There are two important details here.
First, topologyKey: kubernetes.io/hostname spreads pods across nodes, which is the level a node drain cares about. If you run across zones, you may also want a zone-level spread constraint. One does not replace the other.
Second, whenUnsatisfiable: DoNotSchedule makes the scheduler refuse a bad placement instead of quietly packing all three replicas onto one node because that was the quickest option. That refusal is the point. Better an unschedulable pod now than an unavailable service later.
If you want to see whether a workload is already bunched up, this is the quickest check:
kubectl get pods -l app=web -o wide
If all the pod IPs and node names point to the same place, you already know what your next drain will look like.
Graceful shutdown is the last chance to avoid a user-visible error
Even when the drain and the scheduler do everything right, the pod still has to leave the node without dropping work in flight.
That means three things need to line up:
- the pod should stop advertising readiness before it is killed
- the process should stop accepting new connections on SIGTERM
- the grace period should be long enough for in-flight requests to finish
A lot of people only set terminationGracePeriodSeconds and call it done. That is not enough. If the pod is still marked ready until the process exits, traffic keeps arriving right up to the edge of termination.
A better pattern is to fail readiness first, then drain existing work, then exit:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: web
image: ghcr.io/example/web:latest
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "curl -fsS http://127.0.0.1:8080/drain && sleep 5"]
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
The exact shutdown hook depends on the application. Some services can flip readiness off and drain on SIGTERM with no helper script at all. Others need a small admin endpoint to stop accepting new work. The shape is more important than the exact implementation: get out of the load balancer before you disappear from the node.
If you do nothing else, make sure the pod is not still ready when the process is seconds away from exiting. That is the easiest way to turn a clean restart into a connection reset.
A concrete failure mode
A common failure mode looks like this:
- a Deployment has three replicas
- there is no topology spread constraint
- the pods all land on one node after a rollout or reschedule
- the node gets drained during a routine upgrade
- the PDB either blocks the drain or allows just enough disruption to make the service fall below safe capacity
The result is not mysterious. The cluster did exactly what it was told. The workload was never given a reason to spread out, so the drain found all the replicas in one place and asked them to move at the worst possible moment.
That is why the fix is not “make the PDB stricter”. A stricter PDB can turn the drain into a permanent queue. The real fix is to combine all three layers:
- Spread the pods so one node does not hold the whole service.
- Set a sensible PDB so voluntary disruption is limited but not impossible.
- Shut down cleanly so the last pod on the node leaves without dropping traffic on the floor.
Once those are in place, a drain becomes what it should have been from the start: a boring maintenance event.
Putting it together
If you are hardening a workload for node drains, a good checklist is:
- run at least two or three replicas, depending on the service’s tolerance for loss
- spread replicas across nodes with
topologySpreadConstraints - add a PDB that matches the real availability target
- make readiness fail before termination
- give the application enough grace period to finish work
- test the drain path before you need it in anger
The last item is the one teams usually skip. They verify the Deployment, they verify the HPA, they verify the rolling update, and then they assume the node drain will somehow take care of itself. It will not. Node drains are where the scheduler, the disruption budget, and the process lifecycle meet. If they are not designed together, they will reveal that relationship very quickly.
The takeaway
Node drains are ordinary. The outage is what happens when the workload was never built to move.
If you want drains to stay boring, spread the pods, budget the disruption, and let the process shut itself down before Kubernetes does it for you.