Back to blog

Kubernetes

KEDA on Kubernetes: Event-Driven Autoscaling and Scale to Zero

Use KEDA on Kubernetes to scale from queue depth, lag, and schedules, and understand when scale to zero is worth the cold-start trade-off.

July 3, 2026 Platform Engineering 9 min read

CPU-based autoscaling works when CPU is the thing that limits throughput. A lot of Kubernetes workloads are not like that. Queue workers wait on I/O, consumers block on upstream APIs, and event processors sit mostly idle until a burst arrives. In those cases, a HorizontalPodAutoscaler watching CPU is usually answering the wrong question.

KEDA fixes that by letting Kubernetes autoscaling follow the signal that actually represents demand. Queue depth, Kafka lag, Prometheus queries, and schedules are all fair game. For platform teams, the useful part is not just that pods scale up sooner. It is that workloads which should be idle can finally scale to zero without a pile of bespoke logic around them.

Why KEDA beats CPU-only HPA for queue workers

The usual failure mode is easy to spot: the worker deployment looks healthy, CPU stays low, and the backlog grows anyway.

That happens because CPU is often a poor proxy for work waiting in the system. A queue consumer can be busy while it is:

  • waiting on network calls
  • sitting in acknowledgement loops
  • blocked by downstream rate limits
  • sleeping between bursts

If the service is mostly I/O-bound, CPU can stay flat while the queue quietly fills up. HPA is still doing exactly what you asked it to do. You just asked the wrong thing.

KEDA changes the question from “how busy are the pods?” to “how much work is waiting?” For queue consumers, that is usually the metric you wanted all along.

A KEDA ScaledObject that follows queue depth

Imagine a worker deployment that drains jobs from SQS. The workload should sit at zero when the queue is empty, then scale up when there is actual backlog.

A simple ScaledObject looks like this:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invoice-worker
  namespace: payments
spec:
  scaleTargetRef:
    name: invoice-worker
  minReplicaCount: 0
  maxReplicaCount: 10
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.eu-west-1.amazonaws.com/account_id/invoice-jobs
        queueLength: '5'
        awsRegion: eu-west-1

A few knobs matter more than most first drafts admit.

minReplicaCount: 0 is what allows the worker to scale to zero. pollingInterval controls how often KEDA checks the trigger source; the default is 30 seconds, and here we cut it to 15 because queue latency is visible to users. cooldownPeriod defaults to 300 seconds and only applies when KEDA is scaling the workload back down to zero.

The queue target is the other part people often wave away too quickly. If one worker can comfortably process about five messages at once, queueLength: "5" is a sensible starting point. If it can clear 20 before latency gets ugly, use 20. KEDA is not doing magic. It is turning backlog into a replica count.

Where KEDA ends and HPA begins

KEDA is not a replacement for Kubernetes autoscaling. It sits in front of it.

The flow is roughly:

  1. KEDA polls the trigger source.
  2. KEDA exposes the external metric to Kubernetes.
  3. The HPA uses that metric to scale from 1 replica upwards.

That split matters operationally. KEDA handles the event source and the jump from zero, while Kubernetes still handles the familiar scheduling and rollout behaviour once replicas exist. The Kubernetes HPA controller normally asks for metrics on its own sync loop, usually every 15 seconds, so the final reaction time is a combination of KEDA polling, HPA sync, image pull, and readiness.

That is why scale-to-zero is never free. It saves idle spend, but the first event in a burst still has to wait for a pod to start.

Scale to zero is the feature people stay for

Teams often discover KEDA because they want smarter scale-up. They usually keep it because scale to zero is operationally useful.

If a worker only runs when there is work in the queue, there is no reason to keep a warm replica around overnight. KEDA lets you shut the deployment down completely and wake it back up when demand returns. For bursty background jobs, that can remove a lot of waste without asking application teams to build their own scheduler.

The trade-off is cold start. If the deployment is at zero replicas, the first few messages wait for a pod to become ready. Whether that is acceptable depends on the workload.

A batch consumer that can tolerate a short delay is usually fine. A user-facing webhook handler or latency-sensitive API is often not. That is why minReplicaCount is a design decision, not a neutral default.

There is another detail worth knowing: idleReplicaCount exists, but KEDA documents that the only supported value today is 0 because of HPA limitations. If you always need one warm pod, the clearer option is usually to skip idleReplicaCount and set minReplicaCount: 1 on purpose.

The timing and threshold knobs that matter most

The best KEDA setups are the ones where the team already knows something about workload throughput.

Start with three questions:

  • how many messages can one pod process before latency becomes a problem?
  • how quickly do you need the first extra pod to appear?
  • how expensive is a cold start for this workload?

Those answers drive the settings that matter:

  • queueLength or trigger threshold decides how much work each replica should absorb.
  • pollingInterval decides how quickly KEDA notices a change while the deployment is at zero.
  • cooldownPeriod decides how long KEDA waits after activity stops before dropping back to zero.

If the threshold is too low, you scale aggressively and spend money for little benefit. If it is too high, the backlog grows before extra replicas appear. If the cooldown is too short, the workload flaps. If it is too long, you keep idle replicas around after the burst has gone.

The sharp edges that show up in real clusters

KEDA is straightforward once it is wired in. The painful parts usually appear in the first implementation.

Pick the right trigger

Queue depth is good for queue consumers. Kafka lag is good for Kafka consumers. A Prometheus query is useful when the demand signal already exists in your metrics stack. Cron-based scaling is fine for predictable peaks, but it is not really event-driven autoscaling. It is scheduled capacity planning with fewer steps.

Do not pick a trigger because the scaler exists. Pick the one that genuinely represents demand.

Treat authentication as part of the design

The scaler has to read the queue or metrics source, so IAM and credentials are part of the architecture. On AWS, that usually means deciding whether access belongs to the operator, the workload identity, or a dedicated authentication object. If you leave that decision until after the first incident, you have already made it badly.

Remember that zero-to-one is different from one-to-many

The zero-to-one step is where KEDA earns its keep. Once the deployment is running, you are back in the normal HPA world. That means readiness probes, slow image pulls, and application warm-up still matter. KEDA cannot rescue a worker that takes two minutes to become useful.

When KEDA is a good fit

KEDA is useful when all of these are true:

  • the workload scales with something other than CPU
  • the signal you care about is visible externally
  • the workload can tolerate some scale-up delay
  • you want the workload to go idle when demand disappears

That makes it a good fit for queue workers, event processors, scheduled jobs, and bursty internal services.

It is less compelling when a service must stay hot for latency reasons or when the demand signal is still fuzzy. If you are trying to autoscale a workload because it “feels busy”, KEDA will not save you from choosing the wrong metric.

What changes once you scale on the queue instead of CPU

The immediate gain is not only cost, though scale to zero can help there. The bigger operational win is that replica changes become easier to explain.

Once the scaler follows the queue, the question of “why did replicas go up?” has an obvious answer: because work was waiting. That is much easier to reason about than a CPU graph that never really reflected demand in the first place.

For event-driven workloads, queue depth is usually the control signal. CPU is just noise.