Start Here

If you’re new to observability (or you’ve used tools but still feel slow during incidents),
this is the fastest path to build real RCA skills.
The idea is simple: learn the fundamentals first, then apply them using repeatable checklists and labs.

Step 1 — Understand the signals

Before tools, learn what each signal is good for. This prevents “dashboard wandering” and makes your investigation
structured from the start.

Traces, Metrics, Logs — what they are and when to use each
p95 vs p99 — tail latency, percentiles, and why averages lie

Step 2 — Learn the RCA workflow

This is the core skill: going from “it’s slow” to a causal chain with evidence.
Use this checklist in your real day-to-day work.

Root Cause Analysis Checklist — a tool-agnostic, repeatable workflow

Step 3 — Add Kubernetes signals (if you run on K8s)

Kubernetes adds a whole new set of failure modes: throttling, restarts, rescheduling, autoscaling side-effects.
These signals often explain why tail latency spikes.

Kubernetes signals for RCA — what to check and what it means

Step 4 — Understand the bigger picture

Once you can use signals and run a clean RCA, zoom out and understand where observability fits compared to classic APM.

APM vs Observability — what’s different, and why it matters

Step 5 — Practice with labs

The fastest way to build intuition is to debug intentionally. Labs are reproducible scenarios where you can practice
metrics → traces → logs, then write a short RCA summary.

Labs — hands-on exercises (coming soon)
OpenTelemetry Demo Lab — the first lab (coming soon)

What to do if you’re in a real incident right now

Start with the RCA checklist and run it in order. It’s designed for speed.

Root Cause Analysis Checklist

Tip: bookmark this page—over time, it will evolve into a full learning path with more playbooks and labs.