cd ../writing
// engineering culture · prioritization

Tech debt — when to pay, when to live with it.

"We have too much tech debt" is the most common complaint in engineering, and one of the most poorly handled. Some debt compounds dangerously and must be paid down. Some debt is fine forever. Treating them the same is how teams end up either drowning in legacy code or "refactoring" things that didn't need touching. The framework for deciding which is which — and how to make the case for the work that matters.

4 debt categories 3 tests real framing © use freely

01The metaphor — and where it breaks

The original "tech debt" metaphor (from Ward Cunningham) was specific: you ship something quickly to learn from users, then refactor as you understand the domain better. The debt is the gap between your current understanding and the code's structure.

In practice, "tech debt" has expanded to mean "anything I'd write differently." That's too broad to act on. The useful version requires distinguishing between debt that costs you continuously and debt that's just aesthetic.

02Four categories of debt

  1. Active interest: the code slows you down every time you touch it. Every feature in this area takes 3x longer than it should. Every change has a 50% chance of breaking something else.
  2. Dormant interest: the code is fine as long as nobody touches it. It's old, it's not pretty, but it works and nobody's modifying it.
  3. Compounding: the code is getting worse on its own. More features being bolted on, more workarounds accumulating, more people scared to touch it. The cost is growing.
  4. Cosmetic: you'd write it differently, but it works fine. Different naming convention, slightly outdated framework, mild inefficiency.

Active interest and compounding debt: pay it down. Dormant interest debt: ignore until you need to touch it. Cosmetic debt: never pay it down explicitly.

03Three tests to determine which

Test 1: How often does this code change? Code touched weekly with high debt is hemorrhaging time. Code touched once a year, even with terrible internals, costs you almost nothing. Look at git log frequency over the past year before deciding.

Test 2: How long does a typical change take? If a one-line behavior change requires reading 500 lines to feel safe, the debt is real. If routine changes are routine, the debt is theoretical.

Test 3: How often does it cause incidents? Code that has shown up in 3 postmortems in 6 months is high-priority debt. Code that's never broken is low-priority debt, however ugly it looks.

Run any candidate refactor through these three tests before committing to it. Most "tech debt" candidates fail at least one — they're cosmetic concerns dressed up as technical urgency.

04When to pay down debt

Three legitimate moments:

  • When you're about to build something new in that area. Refactoring the legacy module you're going to extend is justified by the work that's about to land. "Boy Scout rule": leave the code cleaner than you found it.
  • When the debt has caused real incidents. Two outages on the same fragile module is data, not opinion. Invest.
  • When metrics show ongoing velocity loss. Sprint after sprint, feature work in this area takes 2x longer than estimated. Track this; use it as evidence.

Illegitimate moments:

  • "I just don't like how it's written."
  • "It uses an old library."
  • "There's a newer framework available."
  • "It would look better if..."

These are aesthetic preferences. They might be right, but they're not urgent. Don't burn political capital on them.

05Making the case to leadership

Engineers usually argue for tech debt work in engineering terms ("the code is bad," "we're using Express 4 and 5 is out"). Leadership doesn't care. They care about velocity, risk, and cost.

Reframe in business terms:

✗ engineer framing
The payment service is using callbacks everywhere. We should
refactor to async/await. The code would be much cleaner.
✓ business framing
Last quarter, 3 of 7 payment-related bugs traced to the
callback hell in the payment service. Each took ~2 days
to debug because the control flow is opaque.

A focused 2-week refactor would:
- Cut payment-related debug time by an estimated 60%
- Make it safer to add the new payment methods on the roadmap
- Reduce the on-call burden (payment service generates
  40% of our P1 pages)

ROI in 6-8 weeks based on debug-time savings alone.

Same work, completely different reception. Leadership can approve the second. They can't justify approving the first.

06The 20% rule

Teams that maintain healthy codebases dedicate a fixed portion of capacity to debt work — typically 15-25% of engineering time. Not as a separate "cleanup sprint" but as ongoing background work.

How it plays out:

  • Every sprint, 1-2 dedicated debt tickets alongside feature work.
  • Each engineer expected to leave touched code slightly better than they found it.
  • Quarterly: explicit "tech health" review identifying the next priorities.

Teams that don't reserve this time end up with debt that eventually demands a complete rewrite. Teams that reserve too much (40%+) end up under-shipping. 20% is the long-run sweet spot.

07The big rewrite trap

At some debt levels, engineers start proposing rewrites. "Let's rebuild this from scratch in Rust/Go/whatever." This is almost always wrong.

Rewrites are expensive:

  • You spend 6-18 months building the rewrite while the existing system still needs maintenance.
  • The existing system has features and edge cases you've forgotten about.
  • The cutover is a high-risk event with weeks of dual-running.
  • Your team is working on the rewrite instead of new features. Business stalls.

The Joel Spolsky rule from 2000 still applies: "the single worst strategic mistake that any software company can make is to rewrite the code from scratch." Strangler-fig migration (gradually replacing pieces while the old system serves traffic) almost always beats rewrites.

08Estimating debt-paydown work

Engineers chronically underestimate refactoring work. Reasons:

  • You don't know what edge cases the original code handles until you re-encounter them.
  • Tests probably don't exist; adding them takes longer than the refactor itself.
  • Production has weird inputs and corner cases that staging doesn't.

Rules of thumb:

  • Multiply your initial estimate by 2-3x.
  • Add explicit time for writing tests against current behavior before refactoring.
  • Plan for a staged rollout with feature flags, not a big-bang switch.
  • Allocate 25-50% extra for post-launch debugging of cases you missed.

09Patterns for paying debt without breaking things

Strangler fig. Build the new system alongside the old. New code paths use the new system; existing code paths gradually migrate. The old system shrinks over time until it can be deleted.

Boy Scout rule. Every PR leaves the touched code slightly better than before. No big bang refactors; just continuous improvement. Compounds dramatically over years.

Branch by abstraction. Introduce an interface between the consumers and the implementation. Migrate consumers to the interface. Then swap the implementation. Lets you migrate gradually without breaking changes.

Feature flag the new path. Build the replacement; route a small percentage of traffic to it; verify; expand. Same principle as canary deploys, applied to internal refactors.

The discipline

Tech debt is real, but "we have tech debt" isn't a project plan. The discipline is identifying specifically which debt costs you measurably, framing it in business terms, paying down the high-cost items continuously, and ignoring the cosmetic stuff that doesn't pay back.

Teams that do this well maintain healthy codebases for decades. Teams that don't either drown in legacy or burn cycles refactoring things that didn't need touching. The framework above is what separates them.