Zero-Downtime Deployments

01Why downtime is now optional

Twenty years ago, deploying meant stopping the server, replacing the binary, restarting. Users got an error page. Now we have load balancers, container orchestrators, and infrastructure that can run multiple versions side by side. The technical capability exists; the patterns just need to be applied.

Three strategies cover almost every real-world case. Each has trade-offs in cost, complexity, and risk.

02Rolling deployment — the default

Replace instances one (or a few) at a time, gradually shifting traffic to the new version. While instance #1 is being replaced, instances #2–#10 serve traffic. When #1 is healthy with new code, replace #2. Repeat.

Pros: simple, low resource overhead (don't need a full duplicate fleet), built into every orchestrator (Kubernetes, ECS, Nomad).

Cons: both versions run concurrently during the rollout. Your code must tolerate that. Rollback is also a rolling deploy of the previous version — not instant.

✓ Kubernetes rolling update

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1           # up to 1 extra pod during update
      maxUnavailable: 0     # never lose capacity

maxUnavailable: 0 means you maintain full capacity throughout the rollout. maxSurge: 1 means you may briefly run one extra pod (the new version) while the old pod is draining.

03Blue-green — instant cutover

Run two complete environments: blue (current) and green (new). Deploy to green while blue serves traffic. When green is healthy, flip the load balancer. All traffic instantly switches.

Pros: instant cutover, instant rollback (flip the load balancer back), the new version is fully warmed up before users see it.

Cons: 2x resource cost during deployment (both environments running). Database migrations are tricky because both environments share a database.

Best for: high-stakes deploys where you need a fast rollback option and can afford the temporary 2x cost.

04Canary — gradual confidence

Deploy the new version to a small subset of users (1%, 5%, 25%, 50%, 100%). Watch metrics. If errors spike, halt and rollback. If metrics stay clean, expand.

Pros: smallest blast radius for bugs. Real production traffic on the new version with real users, but only a small fraction affected by issues. Lets you catch problems that staging never reproduced.

Cons: complex routing (which users get canary?), need a metrics/alerting setup that auto-detects regression, requires patience (canary deploys take hours, not minutes).

Best for: high-traffic services where the cost of a bad deploy is high. Stripe, GitHub, and most large platforms canary by default.

✓ canary stages

Stage 1: 1%   for 10 minutes  → check error rate, p99 latency
Stage 2: 5%   for 20 minutes  → check business metrics
Stage 3: 25%  for 30 minutes  → check resource usage
Stage 4: 50%  for 30 minutes
Stage 5: 100% — full deployment

Modern service meshes (Istio, Linkerd) and feature flag platforms (LaunchDarkly, Unleash) make canary routing trivial. The hard part is the metrics evaluation, not the routing.

05Database migrations — where zero-downtime breaks

The deployment strategies above all assume your application code can be deployed independently. Database migrations break that assumption when they're not done carefully.

The principle: both old and new code must work with the database in either schema state. A migration that breaks the old code while the new code is still rolling out causes downtime.

The pattern: expand, migrate, contract.

06Expand → migrate → contract

Three deploys instead of one. Each step is safe to roll back independently:

Expand: add the new column / table / index. Old code ignores it. New code writes to both old and new locations.
Migrate: backfill any existing data. Run as a background job, idempotent, restartable.
Contract: remove the old column / table / index. Now only the new schema is used.

✓ renaming a column safely

// Deploy 1: ADD new column 'full_name', keep 'name'
ALTER TABLE users ADD COLUMN full_name TEXT;
// App writes to both name AND full_name on every update
// App reads from name (unchanged)

// Deploy 2: backfill, switch reads
UPDATE users SET full_name = name WHERE full_name IS NULL;
// App reads from full_name now
// App still writes to both columns (for rollback safety)

// Deploy 3: stop writing to old column
// App writes only to full_name

// Deploy 4 (later): drop old column
ALTER TABLE users DROP COLUMN name;

Four deploys to rename a column. Yes, it's tedious. It's also the difference between zero downtime and 30 minutes of customers seeing errors.

07Locking migrations — the deadlock trap

Some database operations lock the table while they run. ALTER TABLE ADD COLUMN NOT NULL DEFAULT X rewrites every row — minutes on a large table, the entire time blocking writes.

Safe alternatives:

Add nullable columns first. ADD COLUMN NULL is fast (metadata only). Add default and NOT NULL constraint in a later migration.
Add indexes CONCURRENTLY (Postgres) or with online options (MySQL InnoDB). Doesn't block writes; takes longer but doesn't take down service.
Use a job for large updates. Don't run UPDATE users SET x = y as one statement. Batch in chunks of 1000-10000 rows.

✓ index without blocking

-- Postgres: doesn't lock
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

-- Batched updates
DO $$
DECLARE
  batch_size INT := 1000;
  updated INT := batch_size;
BEGIN
  WHILE updated = batch_size LOOP
    UPDATE users SET normalized_email = lower(email)
    WHERE id IN (
      SELECT id FROM users WHERE normalized_email IS NULL
      LIMIT batch_size
    );
    GET DIAGNOSTICS updated = ROW_COUNT;
    PERFORM pg_sleep(0.1);  -- give other queries breathing room
  END LOOP;
END $$;

08Rollback — the question to answer first

Before deploying, answer: "if this breaks, how do I roll back?" The deploy strategy should make rollback obvious:

Rolling: rollback = roll out the previous version. Takes the same time as the original deploy.
Blue-green: rollback = flip the load balancer back. Seconds.
Canary: rollback = stop promoting, scale canary back to 0%. Fast and limited blast radius.

Database migrations complicate rollback. If you deploy a migration that changes data, rolling back the application might leave data the old version can't read. This is why expand-contract is essential.

09Feature flags — decouple deploy from release

The most powerful pattern: feature flags. Deploy code with new features disabled. Enable them gradually after the deploy is stable.

Benefits:

Decouples deploy risk from release risk. A broken feature can be turned off without redeploying.
Enables true canary at the feature level. Roll out to 1% of users for the new feature only, without changing the rest of the app.
Allows "deployed but not released" code. Trunk-based development with safety.

Discipline required: every flag should have an expiration plan. Flags accumulating in code for years become tech debt.

10Connection draining and graceful shutdown

When an instance is replaced, it needs to:

Signal the load balancer "stop sending me new traffic" (readiness probe returns false)
Wait for the load balancer to actually stop (typically 10-30 seconds)
Finish in-flight requests
Close connections cleanly
Exit

Skipping any step means dropped connections, mid-flight database transactions getting killed, half-completed responses. Configure your application to handle SIGTERM with this sequence.

∞The compound

Zero-downtime deployments are a discipline, not a feature you turn on. Each piece — connection draining, expand-contract migrations, canary stages, feature flags — is a small investment that compounds. Teams that put in the work can deploy 50+ times a day with no user-visible impact. Teams that don't deploy once a week at midnight and pray.

The frequency you can safely deploy at is one of the best predictors of engineering velocity. Build the patterns once; reap the benefits forever.