Blameless postmortem doc — 5-whys, action items with owners, customer impact statement, and the email you send affected users.
Production was down for 38 minutes on Wednesday. Stripe webhooks queued up, three enterprise customers slacked you, and the on-call ran the rollback at 9:14pm. It's Thursday morning, the customer-success person wants a comms email, the board observer wants a postmortem, and you've never written one. The blog templates online are either Google SRE-grade 14-page docs or one-paragraph Notion pages that say 'we apologize for the inconvenience'. This tool generates the blameless postmortem a real series-seed company actually publishes internally — clean timeline, 5-whys root cause, contributing factors (because it's never just one thing), customer impact stated in real numbers, action items with owners and deadlines, and the email you send to affected customers in the next two hours. Blameless framing throughout — names of systems, not names of people.
Brief used: “B2B SaaS, AI meeting notes product. Last night ~8:36pm-9:14pm PT, transcription jobs were silently failing — customers' meetings recorded but no transcript generated. Returned to user as 'still processing' indefinitely. Root cause looked like: deployed a new version of the transcription worker at 8:31pm, the new code had a config key change we missed, the worker started but threw on first message and silent-restarted in a crashloop. Recovered by rolling back at 9:14pm. Affected: ~340 meetings, ~120 customers. We re-ran the failed jobs overnight, all recovered. No data loss. Need a postmortem to share with the team and the 8 enterprise accounts who flagged it.”
# Incident Postmortem — Transcription Worker Crashloop
**Date:** [Date]
**Duration:** 38 minutes (20:36 - 21:14 PT)
**Severity:** SEV-2 (silent feature failure, no data loss)
**Author:** [Name]
## Executive Summary
At 20:36 PT, a routine deployment of the transcription worker introduced a config key change that caused the worker to crashloop on every incoming job. From 20:36 to 21:14, 340 meeting transcriptions failed silently — the meeting was recorded and stored, but no transcript was generated and customers saw an indefinite 'processing' state. Root cause: a config key was renamed in the codebase but the corresponding env-var change wasn't applied to the worker fleet, and our deploy alert only watched for HTTP errors, not for worker exit codes. Recovery was a rollback at 21:14 and a batch re-run of all 340 failed jobs overnight. No data loss. ~120 customers were impacted; 8 noticed and reached out directly.
## Timeline
| Time (PT) | What happened | What we observed |
|---|---|---|
| 20:31 | Deployed transcription-worker v1.42.0 to production | Deploy succeeded, all green |
| 20:36 | First job hits new worker, throws on missing config key, worker exits | No alert (worker restarts on exit) |
| 20:36-21:00 | Workers crashloop continuously; jobs pile up in queue | Queue depth alert silent — threshold was 500, depth hit 340 |
| 21:02 | Customer support pings #eng in Slack ('user says transcript stuck') | First human signal |
| 21:04 | On-call engineer opens incident channel | Investigation starts |
| 21:09 | Identified crashloop in worker logs | Root cause hypothesis: config key |
| 21:14 | Rolled back to v1.41.3 | Queue starts draining, transcriptions complete |
| 21:38 | Queue fully drained, all 340 jobs successfully reprocessed | Confirmed no data loss |
| 23:00 | Customer comms email sent to 120 affected accounts | — |
## Root Cause (5 Whys)
1. **Why did transcriptions fail?** The new worker crashed on every job.
2. **Why did the worker crash?** It threw on a missing config key (`TRANSCRIBE_MODEL_VERSION`).
3. **Why was the key missing?** The codebase renamed it from `MODEL_VERSION` to `TRANSCRIBE_MODEL_VERSION` in PR #847, but the env-var update wasn't applied to the worker deployment manifest.
4. **Why wasn't the env-var update applied?** Our deploy process updates the API service manifest but the worker manifest is in a separate file; PR #847 only touched the API manifest.
5. **Why didn't pre-prod catch it?** Our staging environment doesn't run the full worker fleet — only a single worker as a smoke test. The smoke test was deployed with the old env-var, which happened to still be set in staging.
## Contributing Factors
- **Alerting gap:** worker crashloops don't page. We page on HTTP 5xx (API errors) but worker exits-and-restarts are 'normal' behavior, so we filter them out.
- **Manifest drift:** API and worker deployment configs live in two files; no test enforces consistency between them.
- **Staging coverage:** staging worker is a single instance that doesn't catch fleet-level failure modes.
- **Silent UX:** the 'still processing' state has no timeout — a stuck job looks identical to a slow job for hours.
## Customer Impact
- 340 meeting transcriptions delayed by 38 minutes - 2 hours (depending on when they were reprocessed).
- 120 unique customer accounts impacted.
- 8 customers reached out directly (3 enterprise, 5 self-serve).
- 0 data loss — all meeting audio was preserved and reprocessed successfully.
- 0 paid refund requests as of this writing.
## Action Items
| # | Action | Owner | Due |
|---|---|---|---|
| 1 | Add alert for worker crashloop (>3 exits/min for >2min) | [Eng-lead] | +5 days |
| 2 | Add 'stuck transcription' UI state — show error after 10min, not 'processing' | [PM + Eng] | +14 days |
| 3 | Add CI check enforcing API + worker manifest config-key consistency | [Eng] | +7 days |
| 4 | Expand staging to run 3-worker fleet, add a synthetic transcription job every 2min | [Eng] | +21 days |
| 5 | Add deploy runbook step: verify worker env-vars match codebase after deploy | [On-call rotation] | +3 days |
## Lessons Learned
- Worker failures are silent because we treat exits-and-restarts as normal. They aren't, at the rate this incident produced.
- Two deployment manifests with overlapping config is a structural problem, not a process problem. Adding a checklist is a worse fix than enforcing parity in CI.
- A 'processing' state with no timeout hides outages from users until they Slack us. The UI is part of the incident detection surface.
## Customer Communication Template
**Subject:** Transcription delay last night — what happened and what we did
Hi [Name],
Last night between 8:36pm and 9:14pm PT, transcriptions on our platform were delayed for about 38 minutes due to a deployment issue on our side. Your meeting was recorded and saved without any loss, but the transcript took longer than usual to appear.
We rolled back the change at 9:14pm and re-ran all delayed transcriptions overnight. Everything is now processed and available in your account.
What happened: a config change in our transcription worker wasn't applied correctly, causing the worker to fail on incoming jobs. We caught it 26 minutes in (later than we want) because the failure mode was silent — we now have an alert in place to catch this within 2 minutes if it ever happens again.
We're sorry for the delay. If any of your transcripts are still missing or look wrong, reply to this email and we'll fix it directly.
— [Founder name]Static example — your run uses Claude live on your specific brief.
Engineering teams writing their first formal postmortem, founders whose company just had a meaningful outage and the customer comms is overdue, ops leads at small SaaS companies who want a repeatable template, anyone whose 'incident response' is currently a Slack thread. Not for: large companies with established SRE programs (you already have your template), or B2C consumer products that don't usually publish postmortems.
A complete blameless postmortem doc: (1) executive summary (3-5 lines an exec or investor reads first), (2) clean timeline of events with timestamps and what was observed vs what was happening, (3) 5-whys root cause analysis that gets past the surface, (4) contributing factors section — the systemic things that made this possible (alerting gaps, runbook missing, deploy process), (5) customer impact statement with real numbers (how many customers, what they saw, for how long), (6) action items in a table with owner, due date, and link-to-ticket placeholder, (7) lessons learned (changes to systems and process — not feelings), (8) ready-to-send customer comms email tuned to severity. All in markdown — paste into Notion, Linear, or your status-page incident doc.
Things broke for the first time in a way customers noticed. Get a doc you can publish internally and an email you can send within the SLA your contracts require.
It wasn't down — it was slow or partly broken for 3 days. Get a postmortem structured for ambiguous incidents, not just hard failures.
Your design-partner customer's procurement team wants a postmortem doc as part of their vendor review. Deliver one that doesn't look improvised.
Same alert fired three times this month and nothing's been written down. Treat the near-miss as an incident and get the action items recorded before the real one happens.
Yes — the doc names systems and process, not individuals. 'The deploy process' instead of 'the engineer who deployed'. 'The staging environment' instead of 'whoever set up staging'. The 5-whys explicitly stops at systemic factors, not at human factors.
Then the contributing-factors section asks why the system allowed that human error to cause the outage. A typo in a config is a human error; the absence of validation that catches typos is the systemic factor. Both go in the doc.
Up to you and your contracts. Many seed-stage SaaS companies share postmortems with affected enterprise customers but not publicly. The customer-comms email is the minimum; the full doc is for the team and for any enterprise customer who asks.
Wait until it's resolved to write the postmortem — but write the customer comms email now (the tool can generate a 'we're investigating' version too if you describe the live state). Postmortem is retrospective; comms is real-time.
Templates give you the headers. This gives you a real 5-whys analysis applied to your specific incident, an action items table with realistic due-date framing, contributing factors that go past the surface cause, and a customer comms email tuned to the severity. The hard part of a postmortem is the analysis, not the formatting.
Yes. You get an anonymous preview instantly with no signup. Drop your email and you unlock 3 full-length runs per month for Incident Postmortem Template — no credit card. Unlimited runs are $29 one-time, or $19/mo for every tool.
Paid ($29 one-time) unlocks unlimited runs for Incident Postmortem Template, longer outputs from Claude Sonnet, full exports, and priority generation. $19/mo unlocks every tool on JustNeeda.
Free runs render in-browser and can be copy-pasted. Paid unlocks copy-to-clipboard, Markdown, and plain-text exports — and history of every run tied to your account.
No. Every run hits Claude live with your specific input. We don't reuse outputs across users. Your input stays private to your session and account.