Traces, Metrics, Logs: What They Are—and When to Use Each

The “three pillars” of observability—metrics, traces, and logs—sound simple.
In practice, they often confuse people because each signal answers a different kind of question.
If you use the wrong signal, you’ll waste time collecting noise.

This guide explains traces, metrics, and logs from the ground up (beginner-friendly), and shows how to combine them
into a fast, repeatable workflow for root-cause analysis—especially in microservices, Kubernetes, and cloud-native systems.

The quick mental model

  • Metrics tell you that something changed. (Detection, alerting, trends.)
  • Traces show you where time is spent across services. (Localization, critical path.)
  • Logs explain what happened in detail. (Errors, events, business context.)

A practical default workflow is: Metrics → Traces → Logs. Metrics detect, traces localize, logs explain.

Telemetry and observability (plain English)

Telemetry is data your systems emit so you can understand what they’re doing.
Observability is the ability to answer meaningful questions about system behavior from that telemetry—especially
when failures are complex, distributed, or unfamiliar.

The three pillars are the most common telemetry types. Some people also add profiling as a “fourth pillar,”
which we’ll cover later in this article.

Metrics

Metrics are numeric measurements over time. They’re designed to be aggregated and graphed.
Think of them as a system’s “vitals”: request rate, latency, error rate, CPU usage, memory usage, queue depth, and so on.

What metrics are great for

  • Dashboards: quick overview of health, capacity, and trends.
  • Alerting: “something is off” (thresholds or SLO-based alerts).
  • Regression detection: compare baseline vs current after deploys.
  • Capacity planning: understand growth and saturation signals.

Common metric families

  • Traffic: requests per second (RPS), throughput, message rate.
  • Errors: error rate (5xx), timeouts, exception counts.
  • Latency: p50/p95/p99, histograms.
  • Saturation: CPU, CPU throttling, memory pressure, IO wait, queue depth.

Percentiles (p95/p99) in one minute

Percentiles describe tail latency—the slower portion of requests.
p99 means: 99% of requests are faster than this value, and 1% are slower.
Tail latency matters because the tail is where timeouts happen and where user pain usually starts.

What metrics cannot do well

Metrics are excellent for “what changed” and “when,” but they often struggle to explain “why.”
A p99 chart can show a spike, but it won’t tell you which dependency caused it or which code path triggered retries.
That’s where traces and logs come in.

Logs

Logs are records of events: a request was validated, a retry happened, a job finished, an exception occurred.
Logs are high-detail and extremely useful—especially for understanding failures and edge cases.

Structured logs

Plain text logs are readable but hard to query reliably. Structured logs (often JSON) include fields you can
search, filter, and aggregate: service, env, version, region,
tenant_id, order_id, trace_id.

What logs are great for

  • Error detail: stack traces, exception messages, error codes.
  • Business context: order IDs, payment method, tenant, user segment.
  • Audit trails: “what happened, in what order?”
  • Rare edge cases: issues too uncommon to show clearly in metrics.

Log hygiene (so logs stay useful)

  • Prefer structured logs over long free-text messages.
  • Avoid logging huge payloads by default.
  • Use INFO for meaningful events, ERROR for failures, keep DEBUG short-lived.
  • Include identifiers that help correlation: trace_id, request_id, and domain IDs (e.g. order_id).

Traces

Distributed tracing shows the path of a single request across services and dependencies.
A trace is composed of spans. A span represents one operation—like an HTTP call or a database query—
with timing data and metadata.

What traces are great for

  • Critical path: which operations dominate end-to-end latency.
  • Dependency visibility: which service called which dependency.
  • Localization: where time is spent (DB, downstream API, queue wait, retries).
  • Correlation: jump from a slow trace to related logs using trace_id.

Common trace patterns you’ll see

  • Downstream-dominant: one dependency span dominates (DB/external API is slow).
  • Fan-out: one request triggers many downstream calls; the slowest often defines the tail.
  • Wait vs work: time is spent waiting for a pool/queue vs doing CPU work.
  • Retry amplification: repeated spans indicate retries/timeouts escalating latency.

Context propagation (why traces sometimes “break”)

Traces only work if services pass trace context along. This is called context propagation:
carrying identifiers across boundaries (usually via HTTP headers or message metadata) so spans connect into one trace.

A practical rule: ensure trace IDs appear in logs. That lets you go from “this trace is slow” to
“show me the exact log lines for this request.”

The fourth pillar: profiling (optional, but powerful)

Profiling collects low-level performance data—where CPU time is spent, where memory is allocated,
which code paths are hot. Traces can tell you “the service is slow.” Profiling helps explain “which code path is slow.”

How they work together (the RCA loop)

A fast, repeatable workflow for performance incidents looks like this:

  1. Metrics: detect and scope (which service/endpoint/version/region?).
  2. Traces: localize (critical path, dominant spans, retries).
  3. Logs: explain (errors, exceptions, business context, exact failure mode).

The most common mistake is starting with logs for everything. Logs are high-detail but low-signal if you haven’t narrowed
the scope first.

An incident example: checkout p99 spikes

Checkout latency suddenly gets worse. p99 jumps from 400ms to 2.5s.

Metrics: scope the blast radius

You see the spike starts after a deployment (version=1.42.0) and is concentrated in region=EU.
It affects /checkout disproportionately. Error rate is slightly up.

Traces: localize the critical path

Slow traces show the critical path dominated by a new call introduced in 1.42.0:
PaymentService → FraudService. The FraudService span is slow and often contains retries.

Logs: explain the failure mode

Filtering logs by trace_id reveals FraudService timeouts on a database call—only in EU.

Back to metrics: prove the cause class

Metrics show the EU FraudService database connection pool saturating (active connections max out; queue time rises).
Now you have a causal chain: a code change introduced a dependency, which exposed a regional capacity bottleneck,
and retries amplified tail latency.

Instrumentation: auto vs manual

Auto-instrumentation (agents/libraries) quickly captures common spans and metrics (HTTP, DB, outbound calls).
It provides fast time-to-value with minimal code changes.

Manual instrumentation adds domain context: custom spans, business metrics, structured log fields
(tenant, feature flags, cart size, payment method). This is often where the real “why” comes from.

A strong approach is to start with auto-instrumentation and then add targeted manual instrumentation where it matters most.

Cardinality, sampling, and cost (so observability stays sustainable)

More telemetry is not always better. Observability becomes expensive and noisy if unmanaged.

Cardinality (simple definition)

Cardinality is how many unique dimension combinations your telemetry produces.
High-cardinality fields like user_id can explode cost—especially in metrics.
A practical rule: avoid user-level labels in metrics; use traces/logs for that level of detail.

Sampling (especially for traces)

Sampling means storing only a subset of traces. Two common approaches:

  • Head-based sampling: decide at the start whether to keep a trace (simple, can miss rare failures).
  • Tail-based sampling: decide after seeing the outcome (keep errors/slow traces; better for RCA, more processing).

OpenTelemetry in one paragraph

OpenTelemetry (OTel) is an open standard and set of libraries/agents for generating and exporting metrics,
logs, and traces. It reduces vendor lock-in (instrument once, send to multiple backends) and encourages consistent metadata
and context propagation across services and runtimes.

Conclusion

Metrics, traces, and logs are complementary. Metrics are best for detection and scope. Traces show where time is spent across
dependencies. Logs provide detailed explanation and business context. When you connect them via consistent metadata and
correlation IDs, you get the fastest path from “it’s slow” to “here’s why.”

Next: read APM vs Observability for the bigger picture, or continue with
p95 vs p99 (Tail Latency).

Nach oben scrollen