Average latency can look fine while users complain. That’s because real pain often lives in the tail:
the slowest 5% or 1% of requests. This guide explains p95 and p99 from the ground up—then shows
how tail latency behaves under load, why it spikes, and how to investigate it using metrics and traces.
Tail latency in plain English
When people say “latency,” they often mean an average. But users don’t experience an average—they experience their own request.
If 99% of requests are fast and 1% are very slow, the average can still look “fine” while real users hit timeouts, spinners,
or failed checkouts.
Tail latency describes those slow outliers. It matters because the tail is where incidents start, where timeouts happen,
and where distributed systems amplify small slowdowns into big failures.
What p95 and p99 actually mean
Percentiles answer the question: “How fast are requests for most users?”
- p95 means 95% of requests are faster than this value; 5% are slower.
- p99 means 99% of requests are faster than this value; 1% are slower.
A helpful shorthand:
- p50 (median) = typical request
- p95 = experience for almost everyone
- p99 = the slow tail where user pain and timeouts often appear
Why p99 can explode while p50 stays calm
In many systems, most requests follow a fast path. Only a small fraction hits edge conditions:
a cache miss, a slow database shard, a noisy neighbor, a lock wait, a retry, a cold start.
These rare paths barely affect p50—but they dominate p99.
Common reasons tails get heavier
- Queuing: requests wait for a worker thread, connection, or queue slot.
- Saturation: CPU throttling, IO contention, memory pressure.
- Downstream variability: one dependency has occasional slow responses.
- Retries/timeouts: one slow call becomes multiple calls (amplification).
- Cache misses: fast hits vs slow lookups create a long tail.
- GC/runtime pauses: managed runtimes (like Java) can pause or churn under pressure.
- Fan-out: one request triggers many calls; the slowest often defines end-to-end time.
The key mental model: queuing makes tails worse
Tail latency is often a queuing story. As a resource approaches capacity, small fluctuations create waiting time.
Even if “work time” stays similar, the waiting time grows non-linearly near saturation.
Practical takeaway: if p99 spikes mainly under load, suspect queues, pools, or saturation.
Tail latency is a sensitive indicator that you’re close to the edge.
p95 vs p99: when to use which
p95 is useful when
- You want a stable indicator of “most users” experience.
- You’re tracking gradual regressions and want less noise.
- Traffic is moderate and p99 looks jumpy.
p99 is useful when
- You’re investigating incidents where a small fraction of requests breaks.
- You care about timeouts, retries, and worst-case experience.
- You have high traffic or strict tail-performance targets.
A practical default is to monitor both: p95 for stability, p99 for early warning and RCA.
Percentiles can lie (if you don’t know how they’re computed)
Low sample size
If you only have 100 requests in a window, p99 is basically the single slowest request.
That makes it extremely sensitive to outliers. With higher traffic, p99 becomes more stable and more meaningful.
Bad aggregation across instances
Percentiles don’t aggregate nicely. “p99 of p99s” is not the same as “global p99.”
A safer approach is histogram-based latency metrics, where percentiles are computed from buckets of observations.
Why microservices amplify tail latency
Tail latency grows in distributed systems because end-to-end requests depend on many downstream calls.
Even if each dependency is “usually fast,” the chance that at least one is slow increases with fan-out.
End-to-end p99 becomes dominated by the slowest dependency.
Retries make tails heavier
Retries are often invisible at p50 but show up at p99. A single slow call triggers a retry, which adds latency and load,
which increases slow calls—a feedback loop.
How to investigate a p99 spike (tool-agnostic)
Here’s a repeatable workflow you can use in any observability stack.
- Scope it: is it limited to an endpoint, region, version, pod, tenant, or user segment?
- Check saturation: CPU throttling, memory pressure, IO wait, queue depth, pool utilization.
- Open slow traces: pick traces from the p99 window and find the critical path + dominant spans.
- Classify the pattern: downstream-dominant, pool/queue waiting, retries/timeouts, GC pause, fan-out.
- Prove it with correlation: does p99 move with a metric or event (deploy, scaling, config)?
- Turn it into actions: mitigation now, fix next, observability gap to prevent repeat pain.
What to look for in traces when p99 is high
- Dominant span: which span accounts for most of the end-to-end time?
- Waiting vs working: time spent waiting on pools/queues vs executing CPU work.
- Retry patterns: repeated spans, timeouts, fallback paths.
- Downstream variability: one dependency occasionally much slower than the rest.
- Cold paths: cache miss, cold containers, JIT warmup, cold DB buffers.
An example: “p99 got worse after a deploy”
You deploy version 1.42.0. p50 stays around 120ms. p99 jumps from 400ms to 2s.
Metrics scope
The spike is concentrated in region=EU and only affects /checkout.
CPU looks okay, but queue time increases in a database connection pool.
Traces localize
Slow traces show the critical path dominated by FraudService → DB. Several traces include retries around 800ms timeouts.
Root cause chain
A code change introduced a new dependency, increasing DB traffic in EU.
The pool saturates during peak load, causing queueing and timeouts.
Retries amplify the tail—p99 explodes while p50 stays stable.
Turning tail latency into an SLO
Percentiles become most useful when tied to reliability targets:
SLOs (Service Level Objectives). For example:
- SLI: “% of requests under 300ms” (a measurable indicator)
- SLO: “99.9% under 300ms over 30 days” (a target)
- Error budget: the allowed unreliability while still meeting the SLO
Common misconceptions
“p99 is always better than p95.”
Not always. p99 can be noisy at low traffic. p95 is often better for operational stability,
while p99 is excellent for investigating tail problems and timeouts.
“If p50 is good, users are fine.”
In distributed systems, a small tail can still trigger timeouts and cascading failures. Averages can hide pain.
“Tail latency is random.”
Tail latency usually has structure: queueing, saturation, retries, or specific downstream variability.
Your goal is to find the pattern and prove it with telemetry.
Conclusion
p95 and p99 help you see the user experience that averages hide. p95 is a stable “most users” indicator;
p99 captures the slow tail where timeouts and retries live. In modern systems, tail latency is often driven by queueing,
saturation, downstream variability, retries, and fan-out—exactly the things observability helps you prove.
Next: read Traces, Metrics, Logs to learn which signals to use when,
or continue with Root Cause Analysis Checklist.