From $284 to $88: What Happened When We Asked Claude Code to Cut Our AWS Bill

A real case study in AI-assisted infrastructure work — where the interesting part wasn’t the cost saving. It was what we found along the way.

The Question

“Can you analyse my AWS bill — what can I do to reduce my Lambda charges?”

That was the prompt. One sentence, typed into Claude Code on a Tuesday morning. What followed was a four-day investigation that cut our Lambda bill by 73%, uncovered a production reliability issue nobody knew about, and fundamentally changed how we think about AI-assisted infrastructure management.

This isn’t a story about prompt engineering. It’s a story about what happens when you give a capable AI agent direct access to your AWS account, your codebase, and your CloudWatch logs — and then let it follow the evidence wherever it leads.

The System

We run a Bubble.io backup platform on AWS serverless architecture. Lambda functions pull data from customer Bubble apps via the Data API, convert it to CSV, and store it in S3. SQS queues coordinate the work. CloudFormation manages the infrastructure. Hundreds of customer apps, running hourly backup schedules, processing thousands of tables.

The Lambda bill for March 2026 was $284. Not enormous. But we knew it was higher than it needed to be — we just hadn’t had the time to investigate.

What We Expected

The obvious play: find over-provisioned Lambda functions, reduce their memory allocations, save money. This is standard cloud cost optimisation. Run the numbers, find the waste, trim it. We expected Claude Code to identify a few functions with too much memory, suggest lower values, and we’d deploy the changes in an afternoon.

That’s not what happened.

What Actually Happened

Claude Code started where any good engineer would — querying AWS Cost Explorer to see where the money was going. Within minutes, it had the answer: 79% of the Lambda bill was one function, backupFromDateFunction, running at 4000 MB with 82,000 invocations per month. The obvious next step was to check whether 4000 MB was actually needed.

So it ran a CloudWatch Logs Insights query against the function’s REPORT lines to measure real memory usage. This is where the story stopped being about cost optimisation and became something else entirely.

The Discovery Nobody Expected

The memory distribution was bimodal. 77% of invocations used less than 500 MB. But 1.7% were hitting the 4000 MB ceiling — and some were crashing with V8 FatalProcessOutOfMemory errors.

121 crashes in four days. Silent. No alerts. Customer backups failing without anyone knowing.

Claude Code traced the crashes to a specific customer backing up their internal log table with ten years of history. The function was accumulating every paginated row from the Bubble API into a single in-memory array, then serialising the entire thing into one CSV string. For a table with hundreds of thousands of rows, that’s gigabytes of memory. The 4000 MB Lambda allocation wasn’t over-provisioned — it was insufficient.

The cost question had become a reliability question.

And the reliability question had a code-level root cause that required a real engineering fix, not a configuration change.

The Pivot

Claude Code didn’t just report the finding and wait for instructions. It proposed a streaming refactor: replace the buffer-everything-then-PUT pattern with a pipeline that streams each page of 100 rows directly to S3 via multipart upload as it arrives. Memory usage drops from O(total rows) to O(page size) — roughly constant regardless of table size.

We discussed the approach. It wrote failing tests first — TDD, with a byte-equivalence test proving the streaming output exactly matches the original Parser.parse() output, and a 100,000-row memory test proving peak heap stays under 50 MB. Then it implemented the minimal code to pass those tests, integrated it into the 1,075-line production handler, ran the existing test suite to confirm zero regressions, and verified the SAM build completed cleanly.

We deployed. Within five minutes, CloudWatch showed the first successful streaming upload: 20,374 rows, 6.6 MB CSV, 185 MB peak memory. The same function that had been crashing at 4000 MB was now comfortable at 185.

Then the Cost Work Happened

With the streaming fix in production, the memory distribution collapsed. No more bimodal tail. P99 dropped to 301 MB. Now the memory allocation could be safely reduced.

But we didn’t just drop the memory. Claude Code walked through a disciplined sequence:

48-hour soak. We monitored the streaming fix for two full daily workload cycles before touching memory. Zero OOM crashes. Zero streaming failures. 2,625 successful uploads with a perfect open/complete ratio.

Data-driven target. The original plan was 512 MB. The measured P99 of 301 MB meant 512 MB gave only 1.7x headroom. Claude Code revised the recommendation to 1024 MB (3.4x headroom) and explained why — showing the actual production numbers, not theoretical estimates.

Stacked changes. Once the main function was right-sized, three other functions got the same treatment based on their measured P99 values: copyToSupabaseFunction (4000 → 2048 MB), downloadAppJsonFunction (4000 → 512 MB), downloadAppFilesFunction (4000 → 768 MB).

arm64 migration. Every function was moved from x86_64 to Graviton arm64 for a flat 20% compute discount. This had been blocked by a Lambda Insights layer incompatibility that Claude Code discovered during an earlier pilot — the fix required removing the monitoring stack first, which we did by preserving only the authentication error alarm in a standalone CloudFormation stack.

Legacy consolidation. We discovered hundreds of customer apps were still routed through an old v1 backup handler running at 4000 MB on x86_64. A one-line queue routing change moved them all to the v2 streaming handler.

Each change was a separate commit. Each was verified before proceeding to the next. Each had an explicit rollback plan. This wasn’t a big-bang rewrite — it was a disciplined sequence of small, safe changes, each building on the verified success of the previous one.

The Numbers

Before (March 2026):

Daily Lambda cost: $10.93 (weekday average)
Monthly Lambda cost: $284
Peak function memory: 4000 MB (x86_64)
Silent OOM crashes: 121 in 4 days

After (April 10, 2026 — first full post-change day):

Daily Lambda cost: $2.94
Monthly projection: ~$88
Peak function memory: 1024 MB (arm64)
OOM crashes: zero

73% cost reduction. From $284/month to $88/month.

The compute line item in Cost Explorer literally changed category — from EU-Lambda-GB-Second (x86_64) to EU-Lambda-GB-Second-ARM (Graviton). Every dollar of production compute now runs on arm64.

What Made This Work

It wasn’t the AI writing code. Any competent engineer could write a streaming CSV uploader. What made this effective was the combination of capabilities that Claude Code brings to infrastructure work:

Direct access to production signals. Claude Code queried CloudWatch Logs Insights, Cost Explorer, and Lambda metrics directly. It didn’t rely on secondhand descriptions of the problem — it looked at the actual data, found patterns we hadn’t seen, and revised its approach when the evidence contradicted its assumptions.

Willingness to pivot. The initial hypothesis — “memory is over-provisioned, just reduce it” — was wrong. The AI recognised this when the data showed the opposite (memory was under-provisioned for the tail workload) and pivoted to a code-level fix instead of pushing the original plan.

End-to-end execution. From cost analysis through code investigation, test writing, implementation, deployment, and post-deploy monitoring — the entire arc happened within Claude Code sessions. No context was lost between “what’s the problem?” and “verify the fix is working in production.”

Disciplined verification. Every change was deployed separately, verified with CloudWatch queries, and given soak time before the next change. The AI didn’t rush to deploy everything at once. When we asked it to shorten the soak period from 7 days to 48 hours, it adjusted — but only after confirming the workload pattern justified the shorter window.

Honest revision. The original memory target was 512 MB. When production data showed P99 at 301 MB (higher than the unit-test-based prediction), Claude Code revised the target to 1024 MB and explained the discrepancy — the unit test measured the streaming delta alone, but the production function has runtime overhead from Node.js, imported modules, and per-request state that the test didn’t capture.

What It Got Wrong

Not everything was smooth. Worth documenting:

The arm64 pilot crashed. The first attempt to flip a function to arm64 failed because SAM’s Globals.Function.Layers property merges with function-level layers rather than replacing them. The x86_64-only Lambda Insights extension was still attached despite a function-level Layers: [] override. The function crashed at extension init. Claude Code diagnosed the root cause, reverted the pilot, and filed it as a dependency for later (remove Insights before attempting arm64 again).

Misread a Lambda metric. Lambda’s Max Memory Used is reported per-container-lifetime, not per-invocation. Claude Code initially interpreted high memory reports on short invocations as evidence of an ongoing OOM crisis. It was actually warm-container-reuse artifacts — containers reporting peak memory from a previous heavy invocation. Once identified, Claude Code added a duration filter to subsequent queries and documented the gotcha.

Overestimated memory savings. The unit test showed 30–50 MB peak memory for streaming. Production showed 150–300 MB. The gap was the function’s fixed overhead (Node runtime, module imports, per-request state) which the unit test didn’t capture. Not a bug, but the initial projection of $195/month savings had to be revised down to $167/month.

These weren’t disasters. The pilot crash was reverted in minutes. The metric misread was caught within the same analysis session. The savings estimate was revised before any deployment decision was made. But they’re real examples of where human oversight mattered — recognising when the AI’s initial read was off and guiding it to a better interpretation.

The Broader Point

This wasn’t a moonshot. It was bread-and-butter infrastructure work — the kind that sits on every team’s backlog because it’s important but not urgent, because it requires diving into CloudWatch and reading code and understanding production behaviour, because it’s never quite worth pulling someone off feature work to investigate.

Not replacing engineers, but collapsing the investigation-to-fix cycle from “someday when we get to it” to “Tuesday afternoon.”

The cost saving is real and material — over $2,000 per year on a single Lambda function group. But the more significant finding was the 121 silent OOM crashes that nobody knew about. Customer backups were failing without alerts. That’s the kind of thing that only surfaces when someone actually looks — and having an AI agent that can query CloudWatch, read code, trace execution paths, and follow evidence across multiple tools makes “actually looking” dramatically more accessible.

Cloud infrastructure accumulates debt quietly.

Functions get provisioned generously and never revisited. Monitoring stacks are built and then ignored. Legacy code paths persist because nobody remembers why they exist. The technical knowledge to fix these things isn’t scarce — what’s scarce is the sustained attention to investigate, plan, test, deploy, verify, and document. That’s what Claude Code provided.

Four days. Six deploys. Seven plans written and executed. 73% cost reduction. And a production reliability fix that we didn’t know we needed.

The backup system runs on AWS Lambda, SQS, and S3, managed with SAM/CloudFormation. Claude Code sessions were conducted using Claude Opus 4.6. The full investigation is documented in the project repository at docs/lambda-cost-investigation-2026-04-07.md.