Observability · Logs, Metrics, Traces — The Three Pillars

01What observability actually means

Monitoring tells you something is wrong. Observability tells you why. The distinction matters because the system you can monitor (CPU, memory, error rate) is not the system you need to debug (the specific request that failed for that specific user in that specific code path).

A system is observable when you can answer questions about it that you didn't think to ask before. That requires three different types of signals working together.

02Logs — high cardinality, expensive at scale

Logs are timestamped records of discrete events. They're flexible (any structure you want), high-cardinality (every log line can include any combination of fields), and expensive at scale (every event is stored).

Use logs for:

Specific events worth recording. User signed up. Payment processed. Background job failed.
Error details with full context. Stack traces, request bodies, user IDs.
Audit trails. Who did what, when. Compliance-driven.

Don't use logs for:

Counting things (use metrics — vastly cheaper)
Measuring distributions of latency (use metrics)
High-volume per-request data on the hot path (kills your log bill)

✓ structured logging

logger.info({
  event: 'payment.failed',
  user_id: 'user_abc',
  amount_cents: 5000,
  reason: 'card_declined',
  stripe_error_code: 'card_declined',
  request_id: 'req_xyz',
});

Structured logs (JSON, key-value) beat unstructured logs every time. They're queryable, indexable, and machine-readable. Free-form text is for humans only — useless for alerting or analysis.

03Metrics — aggregate, cheap, dashboard-friendly

Metrics are numeric measurements aggregated over time. They're cheap (you store the aggregate, not every event), fast to query, and ideal for dashboards and alerts.

Four types of metrics matter:

Counter: a number that only goes up. Total requests, total errors, total signups.
Gauge: a number that can go up or down. Current memory usage, connections open, queue depth.
Histogram: distribution of values. Request latency, payload sizes. Lets you compute p50, p95, p99.
Summary: similar to histogram but pre-computed quantiles. Less flexible but cheaper.

✓ Prometheus metrics

http_requests_total{method="POST", route="/api/users", status="500"} 47
http_request_duration_seconds_bucket{le="0.1"} 12453
http_request_duration_seconds_bucket{le="0.5"} 14201
http_request_duration_seconds_bucket{le="1"} 14580

Cardinality matters: every unique combination of labels creates a new time series. user_id as a label = millions of time series = your monitoring database falls over. Use IDs in logs, not in metric labels.

04Traces — the path of a single request

Distributed tracing follows a single request as it moves through your system. Each service it touches creates a "span"; spans nest to form a "trace."

Traces are how you debug "why was this specific request slow?" The dashboard shows you average latency increased; the trace shows you the specific 5-second database call in the middle of a chain of 12 service hops.

✓ trace structure

Trace: req_abc123
├─ http_request POST /checkout  (1.2s)
│  ├─ auth.verify_token  (12ms)
│  ├─ cart.get_items  (45ms)
│  │  └─ db.query SELECT cart  (43ms)
│  ├─ pricing.calculate  (15ms)
│  ├─ payment.charge  (980ms)  ← bottleneck
│  │  └─ stripe.create_charge  (970ms)
│  └─ order.create  (110ms)
│     ├─ db.query INSERT order  (95ms)
│     └─ email.send_confirmation  (12ms)

Sampling matters: tracing every request is expensive. Most teams sample 1-10% in production, plus 100% of slow or errored requests. Modern tools (tail-based sampling in Honeycomb, Tempo) can decide which to keep after seeing the whole trace.

05OpenTelemetry — the standard

OpenTelemetry (OTel) is the open standard for instrumentation. It defines a single API for emitting logs, metrics, and traces — then a single agent collects, processes, and forwards them to your backend of choice (Datadog, Honeycomb, Grafana, Jaeger, anything).

Use OTel even if you're committed to a specific vendor. It future-proofs you: switching backends later means changing one config file, not re-instrumenting every service.

✓ OTel auto-instrumentation in Node

// One line, instruments everything
require('@opentelemetry/auto-instrumentations-node/register');

// Now HTTP, Express, Postgres, Redis, fetch — all instrumented

06What to actually instrument

The minimum viable instrumentation:

Every HTTP endpoint: request count, error count, latency histogram, request size, response size.
Every database query: latency, rows returned, query type.
Every external API call: latency, error rate. These are often the largest source of slowness.
Every background job: duration, success rate, queue depth.

Add custom business metrics: signups per hour, checkout conversion rate, search-to-result latency. These are what product managers actually care about; technical metrics alert on health, business metrics alert on value.

07The killer feature — correlation

The three pillars become powerful when they're correlated. Every log line, every span, every metric exemplar carries the same trace_id and request_id. When something breaks, you can:

See a metric spike
Click into the spike to see the specific traces contributing
Open a slow trace to see which span was bad
Jump from the span to the logs emitted during it

That's the debugging flow that takes you from "something's wrong" to "here's the failing line of code" in under a minute. Without correlation, the same investigation takes hours.

08The cost trap

Observability bills routinely exceed compute bills at scale. Why: every log line, every metric label, every trace span has a cost. Without discipline, the bill grows faster than your traffic.

Cost-control habits:

Use metrics for high-volume measurements. Don't log every request; emit a metric.
Sample traces. 100% in dev, 1-10% in prod, 100% of errors.
Set log retention. Most logs aren't useful after 7-30 days.
Audit cardinality. Unbounded label values (user IDs, URL paths with IDs) are the #1 cost driver.

09Alert on symptoms, not causes

Alert when users are hurt, not when internal systems are noisy. "CPU at 90%" might mean nothing; "error rate above 1% for 5 minutes" definitely means something.

The four golden signals (from Google SRE) are a great starting point for every service:

Latency: time to serve requests (split by successful vs failed)
Traffic: requests per second
Errors: failed requests per second / error rate
Saturation: how full the service is (memory, CPU, queue depth)

10Dashboards — designed for incidents

Dashboards are read during incidents, not browsed casually. Optimize for that:

Top of dashboard = most important. Health summary, current errors, current latency.
One screen per service. Don't make the on-call scroll through 14 panels to find the symptom.
Time range last 1h by default. Most incidents start recently.
Annotate deploys. Vertical line on every chart when a new version ships. The #1 question during incidents: "did something change?"

∞The shift

Observability is not a feature you bolt on at the end. It's a property of well-designed systems — emitted naturally by code that respects future debuggers. The teams that ship reliable systems treat observability as first-class engineering work, not as someone else's problem.

Start small. Instrument HTTP. Add database calls. Add external APIs. Correlate everything with trace IDs. Within a quarter you'll have visibility you didn't know was possible — and the next incident will be measured in minutes, not hours.