If you’re new to observability (or you’ve used tools but still feel slow during incidents),
this is the fastest path to build real RCA skills.
The idea is simple: learn the fundamentals first, then apply them using repeatable checklists and labs.
Step 1 — Understand the signals
Before tools, learn what each signal is good for. This prevents “dashboard wandering” and makes your investigation
structured from the start.
- Traces, Metrics, Logs — what they are and when to use each
- p95 vs p99 — tail latency, percentiles, and why averages lie
Step 2 — Learn the RCA workflow
This is the core skill: going from “it’s slow” to a causal chain with evidence.
Use this checklist in your real day-to-day work.
- Root Cause Analysis Checklist — a tool-agnostic, repeatable workflow
Step 3 — Add Kubernetes signals (if you run on K8s)
Kubernetes adds a whole new set of failure modes: throttling, restarts, rescheduling, autoscaling side-effects.
These signals often explain why tail latency spikes.
- Kubernetes signals for RCA — what to check and what it means
Step 4 — Understand the bigger picture
Once you can use signals and run a clean RCA, zoom out and understand where observability fits compared to classic APM.
- APM vs Observability — what’s different, and why it matters
Step 5 — Practice with labs
The fastest way to build intuition is to debug intentionally. Labs are reproducible scenarios where you can practice
metrics → traces → logs, then write a short RCA summary.
- Labs — hands-on exercises (coming soon)
- OpenTelemetry Demo Lab — the first lab (coming soon)
What to do if you’re in a real incident right now
Start with the RCA checklist and run it in order. It’s designed for speed.
Tip: bookmark this page—over time, it will evolve into a full learning path with more playbooks and labs.