Debugging Production Issues

Production debugging is a different discipline from local debugging. You usually cannot reproduce the problem, you cannot attach a debugger, and every minute you spend exploring is a minute the issue is live. The instinct to start changing things is exactly the instinct to resist.

Stabilize before you diagnose

The first question is never "what is the root cause." It is "how do we stop the bleeding." Roll back the deploy, flip the feature flag, scale up the pool - whatever buys you a calm system to investigate. A root-cause analysis done under active incident pressure is how you turn one outage into two.

Form a hypothesis, then test it cheaply

Scattershot changes destroy your ability to reason. Write down what you think is happening, then find the single cheapest observation that would confirm or kill it. A log line, a metric, one targeted query. Each test should split the search space, not wander through it.

The data usually already exists

Most production mysteries are solved by data you are already collecting and not looking at. Correlate the spike against your deploy timeline. Check whether error rates track a specific tenant, region, or version. The phrase "it started at 14:32" plus a deploy log is often the entire investigation.

Write it down while it is fresh

The postmortem is not paperwork. It is how the same incident stops happening to the next engineer. Capture the timeline, the actual cause, and - most importantly - what signal would have caught it sooner. The fix is the part you remember; the missing alert is the part you forget.

Debugging Production Issues

Stabilize before you diagnose

Form a hypothesis, then test it cheaply

The data usually already exists

Write it down while it is fresh

Continue reading

Building Developer Tools That People Love

Understanding Memory Management in Go

The Art of Code Review