The On-Call Survival Guide

01The bar for sustainable on-call

Three numerical thresholds, beyond which on-call is broken regardless of how heroic the engineers are:

No more than 2 pages per night, max. Beyond that, sleep is destroyed and judgment degrades.
No more than 1 in 4 weeks on-call. Beyond that, recovery time is insufficient.
No more than 10% of pages should be false positives. Beyond that, alert fatigue means real pages get ignored.

These aren't aspirations. They're operational thresholds. A team consistently violating them is a team where the actual problem is the alerting system or system reliability, not the rotation itself.

02Before the rotation starts

Twenty minutes the day before saves hours during the rotation:

Check tool access. Can you actually log in to PagerDuty, the monitoring dashboard, the deployment system, the database read replica? Test before the rotation, not at 3am.
Phone setup. PagerDuty/Opsgenie app installed. Notifications bypassing Do Not Disturb. Volume up. Phone on charger.
Family/partner aware. "I'm on-call this week, may get woken up." Sets expectations and prevents resentment.
Read the recent incidents. What happened in the last 7-14 days? What's likely to recur?
Read the runbooks. Not memorize — just know what exists. When the page fires, you'll know whether to consult one.

03When the page fires

The standard sequence:

Acknowledge within the SLA. Even if you can't fix it yet, ack the page within 5 minutes. Stops the escalation chain.
Open the runbook for the alert. Every alert should link to its runbook. If it doesn't, add that to your followups.
Open the relevant dashboards. What does the system look like right now? What's normal? What's not?
Stop the bleeding before diagnosing. Rollback, failover, traffic shift — restore service first, root cause later.
Communicate. In the incident channel: what you're seeing, what you're trying, what you need. Even if you're alone, the channel becomes the timeline for the postmortem.
Escalate if stuck for 30 minutes. Bring in another engineer. Two minds are better than one and tired.
After resolution, fill out the incident summary. Don't wait for tomorrow. You'll forget details.

04Runbook discipline — the format that works

Most runbooks are useless. They were written once, never updated, and contain abstract guidance instead of specific commands. The format that actually helps:

✓ runbook template

# Alert: [exact alert name]

## What it means
[Plain English. What's actually happening when this fires.]

## Severity
SEV-2: user-facing errors on checkout endpoint

## First steps (do these first)
1. Open dashboard: [direct URL to the right dashboard]
2. Check error rate trend over last 30 minutes
3. Check recent deploys: [URL to deploy log]

## Likely causes (most common first)
- Database connection pool exhausted: [link to fix runbook]
- Bad deploy: rollback with [exact command]
- Downstream service down: check [URL]

## How to mitigate
[Exact commands. Not "scale up the service" — the actual kubectl command.]

## Who to escalate to
- Service owner: @name
- Database team: #db-oncall channel

## Last updated
YYYY-MM-DD by name

Runbooks should be updated after every incident, while it's fresh. The on-call who used the runbook is in the best position to improve it.

05The handoff

Most rotation problems compound across handoffs. The incoming engineer doesn't know what's in flight, what's been escalated, what's been deferred. Two pages later, they're behind.

The handoff document, written at the end of each shift:

Open incidents: what's still being worked on, who's involved, what's the current state.
Recurring alerts: what fired multiple times this week, what's been done about it.
Deferred followups: action items from incidents that need to be picked up.
Things to watch: "deploys planned tomorrow," "DB migration mid-week," "new service launching Friday."
Anything weird: patterns you couldn't explain, alerts you suspected were false but couldn't verify.

A 5-minute live conversation beats any document. If schedules allow, do a brief handoff call between the outgoing and incoming on-call.

06Recovery — taking it seriously

If you were paged at 2am and worked until 5am, you should not be in standup at 9am. Sleep deprivation degrades judgment significantly — the kind of judgment errors that cause more incidents.

Healthy teams have explicit policies:

Time-and-a-half off (or comp pay) for after-hours incidents.
Permission to skip non-essential meetings the day after a bad night.
Mandatory comp days after particularly brutal weeks.

If your team doesn't have these, advocate for them. Manager pushback usually means "it's never come up." Make it come up.

07Alert quality — the highest leverage fix

The single best on-call improvement is fewer, better alerts. Every false alert costs sleep, attention, and trust in the alerting system.

Audit alerts regularly. For each one, ask:

Is a human action required when this fires? If no, demote to dashboard.
Does this fire on a symptom users feel, or on an internal metric? Prefer the former.
How often does this fire false-positively? Above 10%: tune or delete.
How often is the runbook just "wait, it'll fix itself"? Delete.

A pager that only fires when something real is wrong is trusted absolutely. A pager that fires 5 times a day is increasingly ignored. The transition from one to the other is one bad alert away.

08Self-care during the rotation

Plan around the rotation. Don't volunteer for extra projects the same week. Don't make travel plans you can't break.
Set up your environment. Laptop near the bed, on charger. Headphones nearby (for the inevitable 3am Zoom). Notebook by the bed.
Use the swap channel. Most teams allow swaps. Need a Tuesday off? Swap it. Don't suffer through.
Honest reporting. If you didn't sleep, say so in standup. The team can't help you recover if they think you're fine.

09When to push back

If your rotation consistently violates the thresholds at the top — pages every night, broken sleep weeks in a row, no recovery — that's not "this is how on-call is." That's a broken rotation, and it's a management issue.

The escalation path:

Bring data to your manager. "I've been paged X times this week, at these hours. Here's the pattern." Specific data beats general complaints.
Frame it as a system problem, not a personal complaint. "The team's on-call sustainability metrics are X — that's below the bar." Manager can act on a team problem.
Propose specific changes. Fewer alerts? More on-call engineers in the rotation? Better runbooks? Engineering investment to fix the root cause of the most common pages? Don't just present the problem.
If nothing changes, that's a signal. Companies that can't fix obviously-broken on-call rarely fix anything else. Look elsewhere.

10What on-call teaches you

Despite the difficulty, on-call done well is one of the best engineering teachers. You see:

What actually fails in production (vs. what you thought would)
The cost of bad abstractions and missing observability
How different parts of the system interact under stress
What good runbooks look like — and the discipline to write them
The patterns that separate reliable systems from fragile ones

Engineers who've done substantial on-call build different intuitions. They design for failure modes they've seen. They write logs that help debuggers. They ship slow when shipping fast costs the next on-call their sleep.

∞The bar

On-call is not heroic. Companies that treat it as heroism are companies that haven't invested in the systems that make it sustainable. The bar is: you can be on call without burning out. If you can't, the rotation is broken — and that's a fixable problem, not an unfortunate reality.

Push for the fix. Build the runbooks. Tune the alerts. Take the recovery time. The compound effect on the team, the systems, and your career is enormous.