Root Cause Analysis Checklist

Most performance work gets stuck at the symptom: “it’s slow.” A good root-cause analysis (RCA) turns that into an evidence-based explanation: where the time goes, what changed, and why.
This checklist gives you a repeatable workflow you can run under pressure—regardless of whether the tooling is Instana, Dynatrace, Datadog, Grafana/Prometheus, or OpenTelemetry.

The goal is not to create a 30-page report. The goal is to produce a short, confident causal chain and a clear next step:
mitigate now, fix next, and improve telemetry so the next incident is faster to diagnose.

The RCA mindset: evidence beats opinions

RCA is not guessing. It’s narrowing uncertainty. You start broad (“what changed?”), then you eliminate possibilities until
you can tell a coherent story backed by telemetry.

  • Symptom: what users/clients experience (timeouts, slow endpoints, errors).
  • Cause: what in the system produced the symptom (queueing, downstream latency, saturation, GC pauses).
  • Root cause: the underlying change or constraint that enabled the cause (deploy, config, capacity, data growth).

Before you start: the minimum inputs

Try to capture these up front. You don’t need perfection—just enough to avoid wandering.

  • What is slow? Endpoint/journey name, API route, job name, or transaction.
  • When did it start? Time window + whether it’s a regression or always been like this.
  • Under what conditions? Load level, region, tenant, feature flag, deployment version, time of day.
  • Impact definition: p95/p99 threshold, error rate, timeout rate, user-visible impact.

The 60–90 minute Root Cause Analysis checklist

This workflow is designed to be run quickly. In mature setups you can do it in ~30–60 minutes. If telemetry is weak,
it may take longer—but the structure still helps.

Step 1: Scope the blast radius (metrics first)

Start with metrics because they’re the fastest way to narrow the problem.

  • Confirm the symptom: which SLI changed? p95/p99 latency, throughput, error rate, timeouts.
  • Slice it: by service, endpoint, region, env, version, pod, tenant.
  • Find the start point: when exactly did it begin? (deployment, config change, traffic shift)
  • Compare baseline vs current: same time yesterday / last week / previous version.

Output of Step 1: a precise statement like “p99 for /checkout is worse only in EU after version 1.42.0.”

Step 2: Decide: is this load-related or change-related?

This single decision saves time. Many tail latency issues are either “we’re near saturation” or “something changed.”

  • Load-related: gets worse as RPS/concurrency rises; improves when load drops.
  • Change-related: starts sharply after a deploy/config change; persists even at similar load.

Often it’s both: a change pushes you over a capacity edge.

Step 3: Check saturation signals (the “can we breathe?” test)

Before deep tracing, check whether you’re starving a resource. Saturation creates queueing, and queueing creates p99 spikes.

  • CPU: high usage, CPU throttling (Kubernetes), runnable threads, load average.
  • Memory: memory pressure, OOM kills, swapping (if applicable), heap growth trends.
  • GC/runtime: GC pause time, allocation rate, safepoints (managed runtimes).
  • Thread pools: active threads maxed, queue length rising, rejected tasks.
  • Connection pools: DB/HTTP pool saturation, wait time for connections.
  • IO: disk latency, IO wait, network errors/retransmits (as available).

Output of Step 3: a hypothesis class like “queueing due to DB connection pool saturation” or “CPU throttling on pods.”

Step 4: Trace-first localization (find the critical path)

Now take representative slow requests from the bad window (p99 period). The goal is to identify
the dominant span and the critical path—the sequence of operations that determines end-to-end latency.

  • Pick the right traces: don’t look at “average” traces; pick the slow ones.
  • Find the dominant span: what consumes most of the time?
  • Check for fan-out: many parallel calls; the slowest dependency often defines the tail.
  • Look for retries: repeated spans, timeout patterns, fallback logic.
  • Separate wait vs work: waiting on pools/queues vs CPU work.

Output of Step 4: “The critical path is dominated by PaymentService → FraudService → DB; retries add 800ms.”

Step 5: Prove the cause with correlated metrics

Traces tell you “where time went.” Now prove that your suspected bottleneck is real at system scale.

  • Does p99 correlate with queue depth, pool wait time, CPU throttling, or GC pauses?
  • Does the spike align with a deployment/version, config change, or traffic shift?
  • Is it isolated to a subset: region, tenant, pod, node pool?

Step 6: Use logs to explain the failure mode

Once you have a scoped trace, logs give you the detailed “what.” Filter logs using trace_id (or request_id),
and look for:

  • Timeouts and retries: “timed out waiting for connection,” “retrying request,” backoff behavior.
  • Errors: exception types, DB error codes, downstream error responses.
  • Business context: which tenant, which payment method, which cart size, which feature flag?

Step 7: Build the causal chain (the actual RCA)

The deliverable of RCA is a causal chain. A good chain looks like:

  • Trigger/change: “Version 1.42.0 introduced FraudService call.”
  • Mechanism: “FraudService DB pool saturates in EU under peak load.”
  • Amplifier: “Retries amplify tail latency (p99) and increase load.”
  • User impact: “Checkout p99 rises from 400ms to 2.5s; timeouts increase.”

Step 8: Decide actions (mitigate now, fix next, prevent repeat)

Every RCA should end with three categories of actions:

  • Mitigation (now): reduce load, disable feature flag, rate-limit, increase pool size, scale pods, rollback.
  • Fix (next): optimize query, remove N+1 calls, add caching, redesign fan-out, tune timeouts/retries.
  • Prevention: add missing telemetry, dashboards, SLOs, and guardrails (load tests / regression checks).

Common RCA “cause classes” (and what to check)

Most performance incidents fall into a small set of patterns. Classifying the pattern helps you move faster.

1) Downstream dependency is slow

  • Traces show one dominant downstream span (DB, cache, external API).
  • Check dependency latency metrics + error rate.
  • Look for capacity limits (connection pool, rate limits) and retries.

2) Saturation / queueing

  • p99 rises under load; queue depth or wait time increases.
  • Check CPU throttling (Kubernetes), pool utilization, thread queue sizes.
  • Look for “waiting time” in traces.

3) Retry storms and timeout amplification

  • Traces show repeated calls; logs show timeouts and retries.
  • Check retry policies, timeouts, backoff, circuit breakers.
  • Confirm whether retries increase load enough to worsen the incident.

4) Cache behavior changes (miss storms)

  • p99 spikes align with cache miss rate increases.
  • Check eviction rates, TTL changes, cache key cardinality.
  • Look for DB load increases as a downstream effect.

5) Runtime issues (GC, memory churn, lock contention)

  • GC pause time or allocation rate spikes; CPU may look “fine” but latency jumps.
  • Look for stop-the-world pauses, safepoints, thread contention.
  • Profiling (if available) helps confirm hot paths and allocation sources.

The one-page RCA report template

Keep your output short. A one-page format forces clarity. Here’s a simple template you can reuse:

  • Symptom: what is slow, how bad, who is affected.
  • Scope: endpoint/service/region/version/time window.
  • Evidence: key metrics + one representative trace screenshot/summary + relevant log excerpt.
  • Causal chain: trigger → mechanism → amplifier → impact.
  • Actions: mitigate now / fix next / prevent repeat.
  • Confidence: high / medium / low (and what data would increase it).

Observability gaps checklist (what to improve if RCA is hard)

If RCA takes too long, it’s usually because context is missing. These are the most common gaps:

  • Missing trace context propagation across services (broken traces).
  • Logs don’t include trace_id or consistent request identifiers.
  • Key dimensions missing: version, region, endpoint, tenant.
  • No visibility into pools/queues (DB pool wait time, thread pool queue length).
  • No histogram-based latency metrics (percentiles are unreliable or mis-aggregated).

Conclusion

A solid RCA workflow is mostly about order: scope with metrics, localize with traces, explain with logs, then prove the cause
with correlation. When you consistently produce a causal chain and clear next actions, you move from “it’s slow” to
“it’s slow because of this—and here’s what to do.”

Next: continue with Kubernetes Observability (Signals that matter) or go back to
Traces, Metrics, Logs.

Nach oben scrollen