Kubernetes signals for Root Cause Analysis (RCA)

Kubernetes gives you a lot of data—sometimes too much. When performance degrades, the goal is not to stare at every chart,
but to find the handful of Kubernetes signals that reliably explain why latency (especially p95/p99) got worse.
This guide is a practical, tool-agnostic checklist: what to look at first, what it means, and how to connect it back to
traces, logs, and application behavior.

You don’t need to become a full-time platform engineer to use these signals effectively. The trick is learning a small set of
patterns: CPU throttling, memory pressure, restarts, noisy neighbors, network issues, and autoscaling side-effects.

The mental model: Kubernetes signals explain “the environment”

Observability signals fall into layers:

  • Application metrics (latency, error rate, throughput) tell you the symptom.
  • Traces tell you where time is spent (critical path, dominant spans).
  • Kubernetes signals tell you what the environment is doing (saturation, limits, restarts, scheduling).

When p99 spikes, Kubernetes signals are often the difference between “service X is slow” and
“service X is slow because its pods are CPU-throttled on a noisy node after scaling.”

The RCA sequence (recommended order)

When a latency incident hits, use this order. It prevents rabbit holes:

  1. Scope the blast radius (which service, endpoint, region, version?)
  2. Look for saturation (CPU, memory, IO/network symptoms)
  3. Check Kubernetes “events” (restarts, reschedules, failed probes)
  4. Confirm in traces (wait vs work, downstream spans, retries)

1) CPU: usage vs throttling (the #1 Kubernetes performance trap)

CPU is the most common reason Kubernetes performance surprises people. The trap is this:
CPU usage can look “okay” while latency is terrible—because throttling limits how much CPU your container is allowed to use.

What to check

  • CPU usage per pod/container (is it consistently high?)
  • CPU throttling (are containers frequently throttled?)
  • CPU requests/limits (are limits low relative to real needs?)
  • Node CPU pressure (is the node overcommitted/noisy neighbor?)

What it means

  • High usage + high throttling: you are CPU-constrained; queues form; p99 rises.
  • Low usage + high throttling: short bursts get throttled; tail latency spikes.
  • High usage + no throttling: you may be near saturation but not hard-limited; still expect queueing.

How it shows up in traces

  • Increased “time in service” (spans longer without downstream cause)
  • More waiting/queue time inside the service (thread pool queues build up)
  • Requests get slower uniformly under load; p95 and p99 drift upward

Practical tip: if p99 spikes after a deploy and CPU throttling rises at the same time, treat that as a strong causal clue.

2) Memory: pressure, OOM kills, and restarts

Memory issues can create two kinds of performance problems:
slow degradation (GC / memory churn / caching growth) and sudden failures (OOM kills and restarts).

What to check

  • Container memory usage vs request/limit
  • OOMKilled events (containers killed by the kernel due to memory limits)
  • Restart count spikes
  • Node memory pressure (evictions, memory pressure conditions)

What it means

  • Approaching memory limit: increased GC pressure, latency spikes, CPU rises, or sudden OOM.
  • OOMKilled: the pod dies mid-traffic, retries happen, errors spike, and tail latency explodes.
  • Node pressure / evictions: pods get evicted and rescheduled; cold starts and uneven load appear.

How it shows up in traces/logs

  • In traces: missing spans due to restarts; retry patterns; increased error/timeout spans.
  • In logs: abrupt termination, startup logs repeating, “killed” messages, readiness failing.

3) Pods restarting: the silent tail-latency amplifier

Even “healthy” restarts can wreck p99. If a subset of pods restarts, the remaining pods take more load (hotter),
and new pods have cold caches and cold JIT (in some runtimes). The user-visible result is often:
p99 spikes while p50 looks okay.

What to check

  • Restart count by pod and deployment
  • Readiness/liveness probe failures
  • CrashLoopBackOff events
  • Deployment rollout timing (did the spike align with rollout?)

4) Scheduling and rescheduling: “why are my pods on different nodes?”

Kubernetes moves pods. This can change performance without any application change.
A service can become slower because it landed on a noisier node, a different availability zone, or a node with less CPU headroom.

What to check

  • Pod placement: did pods move to new nodes/az/region?
  • Node conditions: CPU pressure, memory pressure, disk pressure.
  • Resource requests/limits: are you under-requesting and getting squeezed?
  • Affinity/anti-affinity: did policy changes affect placement?

What it means

  • Performance becomes uneven: some pods are fast, some slow (watch per-pod latency).
  • Tail latency worsens because the slowest pod now defines p99 for the service.

5) HPA and autoscaling side-effects

Autoscaling can save you—or make you slower temporarily.
Scaling up creates new pods with cold caches, cold connections, and sometimes warm-up time.
If scaling is too slow, you get queueing. If scaling is too aggressive, you can cause downstream overload.

What to check

  • Replica count changes over time
  • HPA metrics (what is it scaling on—CPU, custom metric?)
  • Scale-up latency (how long from load increase to more ready pods?)
  • Queue depth / pool wait time during scale events

What it means

  • Scale-up too slow: queueing grows → p99 spikes.
  • Scale-up causes downstream overload: retries/timeouts increase → tail gets heavier.
  • Cold start effect: new pods are slower for a while, increasing variance and p99.

6) Network and ingress: the “everything is fine” trap

Network issues often look like “random slowness” because they create variability.
Even small packet loss or retransmits can create long tails, especially for chatty microservice calls.

What to check

  • Ingress metrics: request duration, 4xx/5xx, upstream response times
  • Network errors: retransmits, packet loss (if your stack exposes it)
  • DNS latency: slow DNS can create weird intermittent tails
  • Service mesh metrics (if used): retries, timeouts, upstream latency distributions

How it shows up in traces

  • Downstream spans become variable with no code change.
  • Retries appear for otherwise healthy dependencies.
  • Latency is worse cross-zone / cross-node than within node.

7) Storage and disk pressure (less common, but brutal when it happens)

If your service depends on disk (local storage, persistent volumes, logging throughput), disk latency can cause huge tail spikes.

What to check

  • Disk pressure node condition
  • Volume latency (if exposed)
  • Log throughput and blocked writes (if your logging pipeline backs up)

8) Kubernetes events that often correlate with incidents

Kubernetes emits events that are easy to ignore but often extremely informative during an RCA:

  • OOMKilled, Evicted, CrashLoopBackOff
  • FailedScheduling (no nodes fit requests)
  • Unhealthy (readiness/liveness probe failures)
  • ImagePullBackOff during rollouts

Putting it together: a practical Kubernetes RCA checklist

Here’s a short checklist you can run every time. It maps Kubernetes signals to likely cause classes.

Checklist

  1. Is the issue isolated? Compare by service, endpoint, region, version, pod.
  2. CPU throttling? If yes: increase limits, adjust requests, investigate noisy nodes.
  3. Memory pressure/OOM? If yes: confirm limit issues, check GC/churn, stop restart loops.
  4. Restarts or failed probes? If yes: correlate with p99 spikes and retry patterns.
  5. Scheduling changes? If yes: per-node performance variance, placement policies.
  6. Autoscaling? If yes: scale-up delay, cold start effects, downstream overload.
  7. Network variability? If yes: ingress/mesh retries, DNS, cross-zone latency.
  8. Confirm in traces: dominant spans, wait vs work, retries/timeouts.
  9. Explain with logs: error codes, timeouts, retry behavior, business context.
  10. Write the causal chain and define mitigate/fix/prevent actions.

Example: p99 spikes, but the service code didn’t change

A common scenario: p99 increases, but there was no deploy for the affected service.

  • Metrics show p99 worse only for a subset of pods.
  • Kubernetes shows those pods got rescheduled onto a new node pool.
  • CPU throttling is higher on those nodes.
  • Traces show increased “in-service” time (work takes longer), not downstream calls.

The RCA is no longer “mystery latency.” It’s: pod placement + CPU throttling caused slower execution on a subset of pods,
which worsened tail latency.

Conclusion

Kubernetes performance diagnosis becomes manageable when you focus on the signals that actually explain tail latency:
CPU throttling, memory pressure/OOM, restarts, scheduling changes, autoscaling behavior, and network variability.
Use metrics to scope, Kubernetes signals to understand the environment, traces to localize the critical path, and logs to explain
the failure mode. That’s the fastest path from “p99 is bad” to “here’s why.”

Next: continue with Root Cause Analysis Checklist or go back to
Traces, Metrics, Logs.

Nach oben scrollen