Email providers don’t care whether a send came from production or your test runner — they grade you on bounce rate, spam complaints, and engagement. A single CI run that sends 200 emails to test+1@example.com can drop you into provider review for weeks. Here’s why automated test sends are dangerous, how to detect if your codebase is doing it, and the one-prompt fix.
The Problem Nobody Notices Until They’re in Postmark Jail
A new engineer adds an e2e test that exercises the signup flow. The signup flow sends a welcome email. The test runs in CI 30 times a day. Every run sends a real email through your real Postmark / Resend / SendGrid account, to whatever fixture address Playwright happens to be using — e2e-user-2@example.com, test@yourapp.dev, the address of an account you created during local debugging six months ago.
Some of those addresses bounce. Some get marked as spam (Gmail’s filters don’t distinguish “real customer” from “fixture data”). Some go to addresses you used to own but no longer do, which now belong to spam traps run by deliverability monitoring services. Postmark publishes the thresholds: above a 5% hard-bounce rate or a 0.08% spam-complaint rate, your sending server is paused for review. Your real welcome emails start landing in spam folders, then never arriving at all.
The reason this happens is structural. Most teams integrate their email provider long before they add a serious test suite. The first version of the email code uses real API tokens because that’s the only kind anyone has. Tests inherit the same config. Nobody writes a guard because, individually, each test send looks harmless. The damage isn’t from one send — it’s from 200 of them, week after week, with no feedback loop loud enough to notice until customer-facing email goes quiet.
The damage isn’t from one send — it’s from 200 of them, week after week, with no feedback loop loud enough to notice until customer-facing email goes quiet.
The questions your setup should answer in 30 seconds:
- If a developer adds a new e2e test that touches email, can it accidentally fire through the real provider?
- Can a misconfigured CI run send 100 test emails before anyone notices?
- Is there a single code path that all email sends go through, or are sends scattered across handlers?
- What stops a real production token from being used in the test runner by mistake?
If your answer to any of these is “we trust the developer to remember” or “we have an allowlist,” you have a latent bug — see the gotchas below for why an allowlist alone is not enough.
What “Protecting Your Deliverability” Actually Looks Like
Three pieces, layered so any one of them alone is sufficient:
1. A single sending chokepoint. Every email send in the codebase goes through one function (e.g. sendTemplateEmail, sendPlainEmail). No raw postmark.send(...) calls scattered through handlers, no curl in a one-off script. Without this prerequisite, guards have to be repeated everywhere and one will inevitably be missed.
2. An env-gated dry-run mode at the chokepoint. When EMAIL_DRY_RUN=1 is set, the function returns a provider-shaped success response (status: sent, message id dryrun_*) without making the API call. All downstream code — your email log table, your retry logic, your test assertions — keeps working. The provider is simply never contacted. Deliverability is unaffected by construction, not by convention.
3. A boot-time guard in the test harness. The test runner (Playwright, Cypress, whatever) forces the dry-run flag on every test process AND the spawned dev server. The only way to override it is an explicitly named escape-hatch env var (e.g. EMAIL_E2E_REAL_SEND=1) that logs a loud warning when set. Silent escape hatches get accidentally enabled in CI.
A recipient-side allowlist (which most teams already have) blocks based on who the email is addressed to. The dry-run guard blocks based on what environment the code is running in. Either alone is incomplete. Both together are effectively bulletproof.
The One Prompt That Audits and Fixes It
The hard part is doing the audit correctly — finding every send path, not just the obvious ones, and routing them through the chokepoint. Paste this into Claude Code (or Codex) at the root of any codebase that sends email:
Audit this codebase for test-suite email-deliverability risk and, if any
exists, implement an EMAIL_DRY_RUN guard layered with a test-harness
boot-time assertion. Do not change any business logic — only add the
guard.
═══ PHASE 0 — DISCOVER FIRST ═══
Before writing any code, inspect the codebase and write your findings
to `docs/email-test-safety.md`. Cover:
- Email provider in use (we used Postmark; yours may be Resend,
SendGrid, Mailgun, AWS SES, Brevo, etc.). Identify which SDK
package is imported and where.
- Every send call-site. Grep for the SDK's send methods AND for
direct HTTP/fetch calls to the provider's API. List file paths.
Flag any that bypass the chokepoint.
- The current chokepoint (if any). Identify the single function that
every send SHOULD go through. If there isn't one, flag this — the
guard depends on it existing first.
- The existing recipient-side allowlist (if any). Locate it. Note
that this is NOT a substitute for the env guard — see #1 below.
- The test runner(s) in use (Playwright, Cypress, Vitest, Jest,
Pytest, RSpec). Note any test-time env-var setup files.
- Any sample fixture addresses used in tests (test@example.com,
e2e-user-1@yourapp.dev, etc). These are the bouncing-time-bomb
addresses.
Write findings to `docs/email-test-safety.md`. Then write the
implementation plan in the same doc. Stop. I'll confirm before you
implement.
═══ PHASE 1 — IMPLEMENT (after spec approval) ═══
1. WHY THE ALLOWLIST ISN'T ENOUGH — document this in the doc, not
just the code. Recipient-side allowlists block based on who the
email is addressed to. Tests usually use allowed domains because
they have to authenticate as real-looking users. Even when the
allowlist blocks the recipient, the SDK is still imported, still
parses arguments, and depending on library version may still
make a metadata API call. The env-side guard is what guarantees
ZERO network contact with the provider.
2. CHOKEPOINT — ensure one exists. If sends are scattered, FIRST
centralise them into a single sendEmail() / sendTemplateEmail()
module before adding the guard. Without this prerequisite the
guard would have to be repeated at every call site.
3. DRY-RUN STUB at the chokepoint. The send-client factory function
(e.g. getClient()) checks `process.env.EMAIL_DRY_RUN === '1'` and,
if true, returns a stub object with the same method shape as the
real SDK. Each stub method resolves to a synthetic success response
with a recognisable id like `dryrun_<timestamp>_<counter>`. The
downstream code (email-log row inserts, retry logic, return values)
stays unchanged.
EXCEPTION / GOTCHA: do NOT auto-trigger on NODE_ENV=test. Most
test runners (Vitest, Jest) set this and existing unit tests
commonly mock the SDK module directly (vi.mock('postmark', ...)).
If dry-run fires under NODE_ENV=test, the mock's assertions about
the SDK being called will stop matching, and you break N existing
tests. Use an EXPLICIT flag (EMAIL_DRY_RUN=1) only.
4. CACHE-SAFETY. If the send client is module-cached (singleton
pattern), the cached real client persists across the process
lifetime even after the env flag is set. Either check the env
flag on every getClient() call, or invalidate the cache when
isDryRun() flips. For one-shot processes this matters less; for
long-running dev servers it matters a lot.
5. TEST-HARNESS BOOT GUARD. In the e2e test config file
(playwright.config.ts, cypress.config.ts, or equivalent), set
process.env.EMAIL_DRY_RUN = '1' BEFORE defineConfig() is called.
This ensures the flag is set for both the test runner process AND
the spawned dev server.
EXCEPTION / GOTCHA: Playwright's `webServer.reuseExistingServer:
true` means a long-running locally-started `npm run dev` will NOT
pick up the env var from playwright.config.ts — only a newly-
spawned dev server inherits it. ALSO set `webServer.env: {
EMAIL_DRY_RUN: '1' }` to propagate to the spawned process, and
note in the doc that developers must restart their long-running
dev server before running e2e if they want guaranteed coverage.
6. ESCAPE HATCH. Allow EMAIL_E2E_REAL_SEND=1 to disable the dry-run
for occasional manual deliverability tests (e.g. verifying a new
template renders correctly in a real inbox). When set, log a loud
console.warn naming the override AND the provider. Silent escape
hatches get accidentally enabled in CI and nobody remembers why.
7. REGRESSION TEST — THIS IS WHAT KEEPS THE GUARD ALIVE. Add a unit
test that:
- Sets EMAIL_DRY_RUN=1
- Calls each chokepoint method (sendEmail, sendTemplateEmail, etc.)
- Asserts the SDK's send method was NEVER called (use vi.mock or
equivalent)
- Asserts the return value's message id starts with `dryrun_`
Without this test, the dry-run branch is "a comment" — well-
intentioned, easy to delete during refactoring, invisible damage
when removed. The test makes deletion CI-visible.
8. DOC UPDATE. Add a section to your email-system or testing doc
describing:
- That EMAIL_DRY_RUN exists and what it does
- That the test harness sets it automatically
- That the escape hatch (EMAIL_E2E_REAL_SEND=1) should never be
used in CI
- That dry-run rows in the email log have id `dryrun_*` and
should be filtered out of "what did we send this week" queries
═══ PHASE 2 — VERIFY before shipping ═══
- Unit tests pass, including the new regression test
- Type-check clean
- Run the e2e suite once; confirm via process-network logs that NO
request was made to the email provider's API host (greppable via
the provider's domain, e.g. `api.postmarkapp.com`)
- Verify the new doc section is committed and linked from any
higher-level engineering README
- Confirm the email_log table (or your equivalent) shows new rows
with `dryrun_*` ids during the e2e run
Ship as a single PR. The PR's value is the safety guarantee, not the
LOC count — keep the change small and focused.
Adjust the specifics to your provider (Resend, SendGrid, Mailgun, SES) and your test runner (Cypress, Playwright, Pytest). The structure is what does the work.
Notice What the Prompt Is Doing
- Discovery first. Phase 0 forces the agent to enumerate every send-site before assuming the chokepoint is already in place. If sends are scattered, the agent flags it as a prerequisite — guards on a non-chokepointed system require N repetitions instead of one.
- The allowlist explanation is in the doc, not just the code. Phase 1 #1 makes the agent write down WHY a second layer is needed. This is the part that gets challenged in code review (“we already have an allowlist”) and an in-doc justification prevents the guard from being removed by a well-meaning future engineer.
- Three independent failure modes named.
NODE_ENV=testtriggering the wrong thing (#3), module-level client caching (#4), and Playwright’sreuseExistingServer(#5) are each footguns that would silently defeat the guard. The prompt calls them out by name with a one-line lesson attached. - The regression test is the live spec. Phase 1 #7 isn’t optional. Without it the dry-run branch is decorative. With it, anyone who removes the guard breaks CI immediately.
- The escape hatch is loud, not silent. Phase 1 #6 specifies a
console.warn, not just a different code path. Silent escape hatches become accidental defaults in CI.
What the agent does from this prompt: greps every send site, identifies the chokepoint (or creates one), adds the env-gated stub, wires the test harness boot guard, writes the regression test, documents the whole thing. Two to three files touched; about 80 lines of code added.
What Actually Goes Wrong (Real Gotchas From a Real Setup)
The prompt above came out of an audit against a production codebase. Four findings worth knowing about before you start.
A recipient-side allowlist alone does not protect deliverability.
Our codebase already had an email_config.allowlist_enabled gate that rejected any non-internal recipient with a status='blocked' row in the email log and no Postmark call. Tests use allowed domains too — test@yourcompany.com is exactly the kind of address a Playwright fixture would use, and once that mailbox is deleted (or full, or marked spam) every test run silently degrades sender reputation. Postmark’s documented thresholds are 5% hard-bounce and 0.08% spam-complaint; an e2e suite running 30x/day with a stale fixture address blows past 0.08% within a week.
Lesson: the allowlist is recipient-side; the dry-run guard is sender-side. Both necessary, neither sufficient. The bug pattern is “the test ran, the email went out, the recipient bounced, and reputation degraded silently because no log line said ‘we damaged something.’”
Auto-triggering on NODE_ENV=test breaks unit tests that mock the SDK.
The first instinct is to make dry-run fire on either EMAIL_DRY_RUN=1 OR NODE_ENV=test. That looks safer — Vitest sets NODE_ENV=test, so unit tests would automatically be protected too. It immediately breaks any existing unit test that already uses vi.mock('postmark', …) and asserts that the SDK was called with specific arguments. With dry-run firing under NODE_ENV=test, the mocked SDK is never reached, and every “did the SDK get called” assertion stops matching.
Lesson: use an explicit opt-in env flag, not an implicit NODE_ENV check. Unit tests that mock the SDK directly are a valid pattern; the dry-run guard is for integration/e2e scenarios where the real SDK is loaded. Don’t let the two layers shadow each other.
Provider “sandbox” tokens are not free.
The first-instinct fix many teams reach for is “use the provider’s sandbox/test token.” Postmark has POSTMARK_API_TEST; Resend has a test mode; SendGrid has sandbox mode. These are real API calls that hit a real endpoint at the provider. They count against rate limits, they fail when the provider has an outage, and on some providers they consume a sliver of your message quota. They also still log a real request in your provider dashboard, which makes filtering “real sends this month” harder. A local in-process stub is faster (no network), free (no quota), offline-safe (CI runs without provider connectivity), and produces a cleaner provider dashboard.
Lesson: test-mode tokens from the provider are a fallback, not the default. If you control the chokepoint, an in-process stub is strictly better — and your tests run in milliseconds instead of seconds.
Playwright’s reuseExistingServer: true means your env var might not reach the dev server.
Setting process.env.EMAIL_DRY_RUN = '1' at the top of playwright.config.ts puts the flag in the test-runner process. But reuseExistingServer: true means if you already have npm run dev running locally, Playwright skips spawning a new server entirely — and the existing dev server’s env was set when YOU started it, not by Playwright. The result: your test process knows it’s in dry-run mode, but the dev server happily makes real Postmark calls because nobody told it. The fix is twofold: set webServer.env: { EMAIL_DRY_RUN: '1' } so newly-spawned servers inherit the flag, AND document that developers must restart their long-running dev server before running e2e.
Lesson: env-var-based guards depend on the env being set in every process that runs the guarded code, not just the orchestrator. Wherever the boundary between processes is, the env propagation must follow.
Env-var-based guards depend on the env being set in every process that runs the guarded code, not just the orchestrator. Wherever the boundary between processes is, the env propagation must follow.
What This Actually Costs
About two hours of an engineer’s afternoon, with the agent doing the implementation and the human doing the review. Three files touched: the email client module, the test-harness config file, one regression test. Around 80 lines of code added, none changed in business logic.
The asymmetry is what makes this work valuable. Two hours of work prevents a class of incident that, when it lands, takes 2–6 weeks of warming up the sender reputation to recover from. Postmark’s “your account is being reviewed” emails are not fun to receive on a Friday afternoon.
The Broader Point
The deliverability-from-tests bug is structural, not careless. Email integration usually predates the test suite, the chokepoint discipline is added later if at all, and the per-test mental overhead of “did I remember to mock this?” is invisible until the day it costs you. By the time you notice, the damage is already weeks deep.
This is the kind of “small invariant, big blast radius” guarantee that AI agents are good at building. The work is well-defined: find every send call-site, identify or build the chokepoint, add the env-gated stub, wire the test harness, write the regression test. The valuable judgement — should the flag be NODE_ENV-triggered or explicit, should the escape hatch be silent or loud, how should dry-run rows be named in the email log — sits with the human reviewer. The agent does the wiring; the policy is yours.
Built on PlanB, a Bubble.io backup service. Stack: Next.js 16 App Router, Supabase, Postmark for transactional email, Playwright for e2e, Vitest for unit, deployed to AWS ECS. The work was done with Claude Code (Opus 4.7) in a single session — three files changed, twelve tests pass, type-check clean — with about two hours of human review and design discussion on top.