Every production SaaS has 25–40 secrets spread across six or more stores. Most teams could not, if asked at 11pm on a Saturday, enumerate them all from memory — and that’s why secret rotations get deferred indefinitely. Here’s the business case for fixing it and the one-prompt audit that turns “we should document this” into a 4-hour reality.
The Question Every CTO Eventually Hits
It’s a Tuesday. Stripe emails you that one of your API keys has been seen in a public GitHub commit (yours or a contractor’s — doesn’t matter). You have 24 hours to rotate it before they auto-expire. You think you can do it in an afternoon.
You can’t. Here’s the sequence that actually plays out:
- Where’s the live key stored? AWS Secrets Manager? SSM? Vercel env?
.env.productionchecked into a private repo? All of the above? - Which environment(s) does it live in — prod, staging, admin sub-app?
- Is it ALSO mirrored as a GitHub Actions repo secret for CI?
- Is it baked into the Docker image at build time, or read at runtime?
- Is the
STRIPE_TEST_KEYin CI a different value fromSTRIPE_SECRET_KEYin prod? (Yes — and crossing them once is a customer-data incident.) - Does any teammate’s
.env.localneed the new value? - Does Stripe support an overlap window, or is this an instant cutover?
- What’s the redeploy choreography — SSM update first, then ECS task swap, or the other way around?
This isn’t theoretical. Every SaaS founder I know has had at least one of: “rolled the key, broke webhook signature verification for 4 hours,” “rotated Postmark, two days of welcome emails silently failed,” “found out we had three copies of the Supabase service-role key and only updated one.”
The reason isn’t ignorance. The reason is that the inventory doesn’t exist. Nobody has ever sat down and written a single document that lists every secret, every store, every reader, every rotation procedure. So when a rotation lands at 4pm on Friday, half the work is rediscovering the system from scratch.
The reason rotations get deferred isn’t ignorance. The inventory doesn’t exist. When a rotation lands at 4pm Friday, half the work is rediscovering the system from scratch.
What “Solving It” Actually Looks Like
The deliverable is one document — typically docs/secrets-management.md or similar — that answers these five questions cold:
- Where does each secret live, in every store? (SSM, GitHub secrets, local env, application database, third-party dashboards, hardcoded literals — yes, hardcoded literals; more on that below.)
- What classes are they? Platform-issued (Stripe, Postmark — they own the lifecycle), internal-minted (you mint them; you rotate them), per-tenant (customer credentials, encrypted-at-rest in your DB).
- What does each secret unlock? The exact file paths in your codebase that read it. When the rotation breaks something, you want to know which symptom maps to which value.
- How do you rotate each one without downtime? Per-issuer procedure with the overlap-window detail. Stripe gives you 12 hours; Postmark gives you forever (multiple-token-support); your own webhook bearer gives you zero seconds.
- What’s drifted? Which secrets exist in prod SSM but not staging? Which CI secrets are referenced by workflows but missing from the secret store? Which production secrets are older than a year?
This kind of audit looks like it should take a week. It doesn’t, because the underlying mechanics are mechanical: grep the codebase, list the SSM keys, read the deploy workflow’s secrets:[] block, cross-reference with .env.example. It’s exactly the kind of “tedious but well-defined” work AI coding agents handle in an afternoon — if you frame the prompt right.
The One Prompt That Does It
Copy this into Claude Code (or Codex) at the root of any production codebase:
Build a complete secrets inventory + rotation procedures + drift
detection for this codebase. Output is a single living document at
`docs/secrets-management.md`.
═══ PHASE 0 — DISCOVER FIRST ═══
Before writing anything, inspect the codebase and write your findings
to `docs/secrets-management.md`. Cover:
- Every secret STORE the system uses. For our reference SaaS this
was six:
- A cloud secret store with per-environment namespaces
(we used AWS SSM Parameter Store: `/planb/`, `/planb-staging/`,
`/planb-admin/`)
- CI-platform repo secrets (we used GitHub Actions repo secrets)
- Local `.env.local` files (developer machines)
- A `private.secrets` table in the application Postgres for
per-tenant customer credentials (we used Supabase with
SECURITY DEFINER RPCs and a `private` schema)
- Third-party issuer dashboards (Stripe, Postmark, Cloudflare,
Supabase, OpenRouter — the actual source of truth)
- Hardcoded source literals in any sibling repo (we found two
in our AWS Lambda repo — see #4 WORST-CASE below)
Your project may have fewer stores or different ones (Vercel env,
Doppler, 1Password, Kubernetes Secrets, AWS Secrets Manager,
HashiCorp Vault). List what's actually there.
- The deploy mechanism that READS the secrets at runtime. For us
this was the ECS task definition's `secrets:[]` block declared in
`.github/workflows/deploy-app.yml`. That YAML block is the de-facto
contract for "what env vars does the running container expect."
Yours might be a Vercel/Netlify env dashboard, a Kubernetes Secret
mount, a Pulumi/Terraform module, an SST stack file.
- The IAM principals (or platform equivalents) that can READ each
store. For us this was the ECS task execution role, the OIDC
deploy role, a `planb-ops` IAM user with `AmazonSSMFullAccess`,
and a legacy IAM user with over-broad permissions still attached
(one of the drift items the audit found).
- Cross-repo coupling. List every secret that is duplicated between
this repo and any sibling repo, Lambda, Worker, Cloud Function,
or external service. This is where the worst drift hides.
Write all findings to `docs/secrets-management.md`. Then write the
implementation plan in the same doc. Stop. I'll confirm before you
implement.
═══ PHASE 1 — IMPLEMENT (after spec approval) ═══
1. INVENTORY MAP — a table at the top of the doc listing every secret
STORE with: what lives there, how secrets are issued INTO it, what
code READS from it. One row per store. Five-to-eight rows. Make it
scannable.
2. CLASS-THE-SECRETS — split every secret into three buckets:
- PLATFORM secrets (one value per environment; lifecycle owned by
the third-party issuer — Stripe, Postmark, Supabase, etc.)
- INTERNAL-MINTED secrets (we generate them; they live entirely
within our stores — webhook bearers, signing secrets,
health-check tokens)
- PER-TENANT secrets (customer-supplied, stored encrypted-at-rest
in the application database; never displayed back; only hints
shown — first 3 + last 2 chars)
3. PER-SECRET INVENTORY — for each secret, a row in a per-class table
listing: cloud-store presence per environment, CI-secret presence,
`.env.example` entry, exact code-reader file paths, and the issuer
(dashboard URL or "internal-minted, see Rotation §"). The reader-
path columns are not optional — when rotating, you need to know
which code will fail.
4. CROSS-REPO COUPLING — THE WORST-CASE PATTERN to find and document.
If any secret is stored both in your cloud secret store AND ALSO
as a hardcoded string literal in another repo (Lambda, Worker, etc),
call it out by name with the exact file path and line number. We
found two: `LAMBDA_CALLBACK_TOKEN` and `LAMBDA_GETSCHEDULES_TOKEN`,
each as a 69-character string-literal default in our Lambda repo's
constructor. Rotation becomes a multi-store, multi-deploy, zero-
overlap-window operation instead of a one-shot SSM update.
Document the rotation choreography step-by-step. Recommend moving
the cross-repo side to read from env-supplied-from-SSM instead of
a string literal, and file a follow-up issue for it.
5. ROTATION PROCEDURES — one subsection per secret CLASS, then per
issuer where it differs. For each, document:
- The dashboard or CLI command that mints the new value
- Whether the issuer supports an OVERLAP WINDOW (Stripe: yes, 12h.
Postmark: yes, indefinite — multiple-token-support. Internal
bearers: NO — instant cutover.)
- Every store that needs updating, in dependency order
- The redeploy required after the SSM update (and which services)
- The verification step BEFORE expiring the old value
6. ACCESS CONTROL — list every IAM principal (or platform equivalent)
with read access to each store. Note any legacy permissions that
should be cleaned up. Cross-reference with the IAM-rationalisation
plan if one exists.
7. DRIFT GAPS — name every inconsistency between stores as a numbered
item (D1, D2, ...). The patterns to watch for:
- "Secret X exists in `/prod/` but not in `/staging/`. The Y code
path will fail-soft (worse than fail-hard — it goes unnoticed)."
- "Secret Y is in SSM but missing from the bootstrap-uploader
script (`aws-push-secrets.sh` in our case) — a fresh environment
provisioning would silently omit it."
- "Secret Z's `LastModified` date is unknown — cloud-store upload
time is NOT the same as issuer rotation time. Manual audit
against the issuer dashboard required."
Each gap gets a severity (High / Medium / Low) and a named owner.
8. OPS CHECK — write a `scripts/ops/check-secret-drift.sh` (or your
project's ops-script equivalent). It should diff:
- `.env.example` ↔ cloud-store keys per environment
- cloud-store keys ↔ deploy-workflow `secrets:[]` block
- CI-secret references in `.github/workflows/*.yml` ↔
`gh secret list`
- Per-tenant secrets older than 180 days (proactive nudge fuel)
Wire into the daily ops report. Any new drift becomes visible
within 24 hours, not 6 months later when a customer reports
it broken.
═══ PHASE 2 — VERIFY before shipping ═══
- Each Dx drift gap in the doc has a clear severity + owner
- `check-secret-drift.sh` runs cleanly and produces actionable
output (no false positives that drown the signal)
- At least one rotation procedure is end-to-end tested in staging
(or, if no staging exists, documented well enough that a
third-party engineer could execute it cold)
- The cross-repo hardcoded-literal items (if any) have a follow-up
issue filed for the source-literal removal
- The doc references the deploy workflow / IaC files by exact path
so future-me can find them
Ship as a single PR. The PRD-style document is the deliverable — don't
split it. The ops-check script can be a separate small PR.
Adjust the specifics to your stack — swap AWS SSM for Doppler or Vault, Stripe/Postmark for whatever issuers you use, Supabase for your DB. The structure of the prompt is what does the work.
Notice What the Prompt Is Doing
- Discovery first. Phase 0 forces the agent to map the actual store topology before assuming anything. It names our reference stores (SSM, GitHub secrets, Postgres, hardcoded literals) so the agent has concrete patterns to match against — but explicitly tells it to adapt to whatever’s actually there. The reader’s stack will differ; the audit shape won’t.
- Three secret classes, not one bucket. Platform / internal-minted / per-tenant secrets have different rotation choreographies and different blast radii. Mixing them into one “secrets” list hides the gradient. Naming the classes in Phase 1 #2 forces the agent to keep them separate.
- The worst-case pattern named upfront. Cross-repo hardcoded literals (#4 in the prompt) are the secret-management failure mode that costs teams hours of downtime when nobody knew they existed. Calling out the exact pattern — “stored both in your cloud secret store AND ALSO as a hardcoded literal in another repo” — guides the agent to look for it explicitly, with file paths and line numbers.
- Reader paths are mandatory, not optional. The single most useful column in the per-secret inventory is “what code reads this value.” Without it, the inventory is decoration. With it, every rotation gets a 30-second pre-flight check.
- Drift detection as a script, not a one-shot scan. Phase 1 #8 wires drift detection into the ongoing ops report. Without that, the audit is only as fresh as the day someone last ran it. Drift is the half-life problem of secrets management.
What Actually Goes Wrong (Real Findings From Our Audit)
The prompt above didn’t come out of theory. It came out of an audit that found ten genuine drift items in a production system. The instructive ones:
Two production bearer tokens were string literals in another repo.
Our cross-repo Lambda backend authenticated callbacks to the main app using a bearer token. The token was in AWS SSM on the main-app side; on the Lambda side it was a 69-character string literal at line 9 of bubble-callback/index.js. Every constructor call defaulted to the literal. To rotate, we’d have to: change the literal, redeploy the Lambda stack via SAM, push the new value to SSM, force a new ECS deploy of the main app, and accept a several-second window where callbacks fail auth between Lambda-deploy-complete and ECS-task-swap-complete.
Lesson: the riskiest secrets are the ones where rotation requires coordinated deploys across multiple repos. Audit for hardcoded literals in any sibling repo by name. They’re invisible to every drift-check tool that operates on store-to-store comparisons.
Staging was missing two secrets that production had.
UNSUBSCRIBE_TOKEN_SECRET and GA4_API_SECRET existed in /planb/ SSM but not in /planb-staging/. The email-send code reads UNSUBSCRIBE_TOKEN_SECRET to HMAC-sign one-click unsubscribe URLs (RFC 8058). In staging it would either throw at startup OR — much worse — silently sign with undefined, producing URLs that look valid but fail verification. The GA4 helper was fail-soft and just warned. Both were invisible until the audit ran.
Lesson: staging-vs-prod SSM drift is the most common production-secrets bug, and it always favours silent failure over loud failure. Any audit that doesn’t explicitly diff per-environment is missing the most important comparison.
The bootstrap script was three secrets behind the deploy workflow.
Our aws-push-secrets.sh script reads .env.local and pushes to SSM — the entry point for any environment provisioning. It hadn’t been updated when three new secrets (UNSUBSCRIBE_TOKEN_SECRET, GA4_ID, GA4_API_SECRET) were added to the deploy workflow’s secrets:[] block. A fresh environment provisioned with the script would silently come up missing those three values.
Lesson: the bootstrap script and the deploy workflow are two ends of the same contract. They must stay in sync, and the only way they will is if a drift-detector tells you when they don’t.
“Last rotated” is a lie everywhere.
SSM’s LastModifiedDate tells you when the value was uploaded TO SSM — not when the issuer rotated the underlying secret. We had Stripe live keys that SSM thought were “rotated 2026-04-23” but Stripe’s dashboard showed the actual key issuance was 2024. The cloud-store-modified-time and the issuer-rotated-time are the same number only on day one.
Lesson: if you care about rotation hygiene, you need to either record the issuer-side rotation timestamp manually, or accept that “last rotated” is a known-unknown. There’s no automated way to bridge this gap that I’m aware of.
Public values still need rotation discipline.
NEXT_PUBLIC_SUPABASE_ANON_KEY is technically public — it’s baked into the browser bundle. People assume “public” means “no rotation needed.” It doesn’t. The anon key is the keychain that RLS policies authorise; rotating it without a coordinated bundle rebuild + cache-bust will break every live browser session.
Lesson: “public” and “rotates with no fuss” are different properties. The audit doc should track build-time-public values alongside server-side secrets, with the additional note that browser caches will hold the old value until they reload.
Staging-vs-prod SSM drift is the most common production-secrets bug, and it always favours silent failure over loud failure.
What This Costs
Four hours of an engineer’s afternoon, with the agent doing the mechanical inventory work and the human doing two things: providing access (so the agent can inspect SSM, list IAM principals, read the deploy workflows) and reading the draft to catch the things the agent couldn’t see (issuer dashboards, MFA configuration, team-access reality vs documented).
The output is a permanent document that pays back the next time anyone needs to rotate any secret. Which is a continuous activity — the half-life of a production secret is somewhere between six months and never, and “never” is the wrong answer.
The Broader Point
Most production systems accumulate secrets debt the same way they accumulate every other kind of debt: one well-justified addition at a time, with no overall view of what’s been added, where, or how the pieces interact. The day you need that overall view — for a rotation, an incident, an audit, a team handover — is also the day you don’t have time to build it.
An AI coding agent is exactly the right tool for this work. It’s tedious-but-mechanical: grep for process.env.*, list SSM parameters, parse the deploy workflow YAML, cross-reference with .env.example, generate a table. The valuable judgement — what to do about the gaps — sits with the human. The agent does the inventory; you do the policy. Four hours of combined attention turns a permanent “we should document this” backlog item into a permanent reference.
The secrets-debt problem isn’t that it’s hard. It’s that nobody on the team has ever had the four hours to do it cold. Now they do.
Built on PlanB, a Bubble.io backup service. Stack: AWS ECS Fargate + SSM Parameter Store, Supabase Postgres, Stripe Checkout, Postmark, deployed via GitHub Actions OIDC. The audit ran against a system with 27 secrets across six stores and surfaced ten drift items. The work was done with Claude Code (Opus 4.7) in a single afternoon, with about four hours of human review on top to validate findings against the issuer dashboards.