From idea to income: how I built an AI-powered murder mystery business with Claude

A live product, paying customers, and a seven-stage pipeline that writes 30-page printable mystery kits in five minutes. The architecture, the ten Claude calls, and the one sentence that fixed everything.

I shipped a product with Claude. People are paying for it. The product is murdermysterygameai.com: you type a theme — "1920s Berlin cabaret, seven friends, four women and three men" — and five minutes later you download a 30-page printable kit with character dossiers, a host script, an evidence packet, name cards, and a written solution. No human in the loop. A pipeline writes it.

What follows is the architecture that sits behind those two sentences. Most of it isn't about prompting. The answer to almost every hard problem turned out not to be "a bigger prompt."

A Claude-generated character dossier — Sterling Silverton, period photo, case-file framing, secret-keeping language. — A real Claude-generated character dossier — verbatim from a recent kit, not a mock-up. The case-file framing and the secret-keeping language are the schema the prompts are written against, not a layout template applied after the fact.

The quiet little market most engineers have never looked at

The global escape-room and live-mystery market was about $5.6 billion in 2025, growing at a 14.4% CAGR, projected to reach $14 billion by 2032. About eight in ten US adults played a tabletop or party game in the last year, and mystery and social-deduction titles are among the fastest-growing categories. "Murder mystery dinner" has been a top-five Pinterest party-planning search every quarter for three years running.

A hosted kit on competitor sites sells for $16–$48. My average sale is around $32. My cost per kit, in Claude tokens, is $0.10 to $0.50. That's a ~98% gross margin on a product nobody at any AI conference is fighting me for. The status quo is what makes this work: pre-written kits are slow, expensive, and never quite the theme you want. Catalogues hover around 120 themes, refreshed once a year. Player counts are fixed at 6, 8 or 10 — if your party is seven, you buy a different game. What people actually want, from the support tickets, is a 1920s Berlin cabaret mystery for the exact seven friends coming over on Saturday.

Why pure-LLM generation breaks

A murder mystery is not a generation problem. It's a graph problem dressed up as one. You have 4 to 32 characters; each one has secrets, motives, and alibis; the relationships are pairwise — n² edges that all have to be coherent — and one of those edges has to logically point to a single killer the players can deduce. Zero contradictions are tolerated by the host or the players.

My first attempt was a giant prompt. Beautiful Claude prose came out the other end. The mysteries didn't actually resolve. Three repeatable failure modes:

•Clue contradictions. Character A's secret said she was in the library at 9pm. Character B's said she was on the lawn at 9pm. Both got printed.
•Lopsided objectives. Two characters got six interaction goals each; three characters got nothing to do for the night.
•Unconnected plot. The victim's lawyer had never met the victim. No motive, no suspicion, no game.

The instinct is to tighten the prompt. Add more instructions. Use more tokens. That's the wrong move. The structure has to be guaranteed before the prose is written. Otherwise you're asking the model to be both the architect and the dialogue writer in the same pass, and the architect part isn't actually language-shaped.

The one sentence that fixed everything

Use deterministic code for the things that have to be true. Use Claude for the things that have to feel true.

Everything else is a consequence of that. The cast has to be the right gender mix — that's a SQL query and a two-pass match, not a prompt. The killer has to be exactly one person — that's a single tagged JSON field, not a polite request. Each character has to interact with three or four others per phase — that's a matrix built by code. Claude only writes the line of dialogue once code has decided who's talking to whom about what.

The model's job is the part where rules don't help: the voice of a manipulative cabaret singer being interrogated by a nervous suspect, the texture of a 1920s newsroom article that names every guest at the party without giving the killer away. That part is genuinely hard for code, and trivial for Claude.

The seven-stage Cloudflare Workflow

The full pipeline is seven stages, each checkpointed by a Cloudflare Workflow so a failure at stage four doesn't burn the previous three. Every stage validates with Zod, writes its output to Postgres, and pushes a progress event over a WebSocket via a Durable Object so the user can watch the kit build in real time. The stages are: match cast (code only), tailor plot (Claude), invitation copy (Claude), core mystery (Claude), per-character secrets and bios (Claude), interaction matrix (code) + objective lines (Claude), and finally clues and the in-world newsletter (Claude, with a self-critique gate).

Every stage rebuilds its context from validated database state, not from raw prior output. Claude never reads what Claude just wrote — it reads the cleaned, schema-checked version that has been through the gates. That matters more than it sounds. If a downstream prompt sees raw Claude prose from stage four, it can hallucinate corrections to it, or carry forward subtle mistakes. If it sees a validated JSON object where killerName === 'Lady Pemberton', there's nothing to drift on.

Code wires the conversation; Claude makes each line sound like the speaker

Once the cast is set, every character needs to interact with three or four others per phase. There are two phases: the introduction phase (mingling before the body is found) and the murder phase (questioning after). Per character, per phase, code picks two or three targets that aren't already over-asked, prefers related characters, never picks the same target twice in one phase. That's a 40-line function.

The output is a stack of interaction-pair shells — one per arrow on the social graph — that name the from-character, the to-character, the phase, and the topic. The topic comes from a prior Claude pass over the secrets so the line has something specific to be about. Then Haiku at temperature 0.3 gets a batch of ten shells and writes ten in-character objective lines in parallel. The system prompt tells Claude to read the speaker's bio, infer their persona — manipulative, direct, nervous — and pick phrasing that fits. "Casually mention" for the manipulator. "Demand answers" for the suspicious one. The model is doing one thing it's good at: writing a single line in a single voice with a single goal. It is not picking who speaks, deciding when, choosing the topic, or deciding whether the line should exist at all. All of that is decided in code before the call goes out.

A printed character objectives sheet with bullet-pointed phase-by-phase goals. — A character objectives sheet — the page where the interaction-matrix output lands as printed prose. The bullets under each phase header come from ten parallel Haiku calls, dressed up in case-file frames.

Twelve deterministic gates between Claude and the database

Every Claude call goes through a structured-output parser wired to a Zod schema. Invalid JSON is a retryable error — the workflow catches it and retries the same step without burning previously checkpointed work. The schema is the contract: bios are 100–350 words; secrets are first-person, past tense, no future-intent verbs; eight regex patterns block any secret that says "X is the killer"; the newsletter must mention every cast member.

If a call fails validation hard enough that retrying won't help, the workflow throws a non-retryable error, the game is marked failed, and credits are refunded automatically. Meanwhile a Durable Object keeps the WebSocket open to the browser, streaming "Generating Mrs. Whitcombe's secrets…" / "Wiring the murder phase…" so the user sees motion. Each validator is short. Each one closes off a class of failure I've seen in production.

Ten small calls beat one giant prompt

The pipeline calls Haiku ten times per game. Each call does one thing. Each call has its own temperature and its own token budget. Creative work runs hot (0.7–0.9); structured work runs cold (0–0.5). One giant prompt has to pick one temperature for both and gets neither right.

•Tunable temperature. Per-call, per-stage, never a compromise.
•Targeted prompts. Each call is short, focused, easy to reason about. No 5,000-word mega-prompt where you can't tell which paragraph caused a regression.
•Per-stage evals. One scoring suite per call. You can tune each independently and see the diff.
•Fail small. One bad clue is regenerated, not the entire kit. The workflow checkpoint between every stage means a 12-minute generation doesn't restart from scratch when the newsletter call hiccups.
•Right-sized context. Each stage rebuilds context from validated database state, never from raw earlier output. The context window stays small, the relevant facts stay sharp.

Show four good examples and two bad ones

This is the most useful prompt-engineering lesson I learned the whole year. When a theme is short on female characters, Claude has to invent one or two on the fly. I want them to have themed pun names — that's the house style. I tried explaining the rule: "make the name a pun related to the theme." Got a 40% pun-rate. Tightened the rule: "the pun must be unambiguous." 45%. Wrote three more sentences. 50%.

Then I stopped describing the rule and showed examples instead. Good: "Jack Pott," "Mae Day," "Sue Flay," "Russell Sprout." Bad — with the diagnosis attached: "Thaddeus Marrow" (the surname is themed but the joke doesn't snap), "Sienna Fathoms" (moody, not punny), "Nathaniel 'Compass' Ashford" (the pun is hidden in a quoted nickname; it must live in the actual name). The pun-rate jumped above 90%. The model copies what you put in front of it. Show it the failure modes you have already seen, with the diagnosis attached, and it stops producing them.

Evals are how you tune ten prompts without losing your mind

The eval suite was the single highest-leverage thing I built. It's what let me change a prompt at 11pm without praying. One scoring suite per Claude function. Mock data is real prompts against a fixed Phase-1 mystery and a fixed cast, so the prompt is the variable. The scorers split into two camps.

Code scorers run on every prompt change in CI — Zod-shape and word-count checks, first-person past-tense format checks, secret counts, and eight regex patterns that catch any "X is the killer" phrasing. These catch about 80% of regressions before you pay LLM-judge cost. LLM-judge scorers are slower and cost real money — they ask things rules can't: "Would a careful player suspect this speaker is the murderer?" "Given the cast and the clues, can you pick the killer?" "For each red herring, who does it point at — and does that point at the actual killer?" Run code-first, judge-second. Don't pay for an LLM to count words when a regex will.

Three orderings I had backwards

If I were starting today, I would do these in this order. I did them in the opposite order the first time and lost months to it.

•PDF first. Design the printed kit before writing any prompt. The layout is the schema. If the page has a Public Secrets section with a 100-word slot, the prompt has to know that shape. Writing the prompt first means rewriting it three months later.
•Evals next. Write the solvability eval before the generator. "Given these clues and this cast, can an LLM pick the right killer?" — that scorer is the thing that tells you whether your generator is even working. Build it before you build the thing it's grading.
•Prompts last. The prompt is the thinnest layer. Build the database schema, the workflow, the validators, the PDF layout, the eval harness first. By the time you write the prompt, half of what it has to do is already enforced by code. The remaining job is so well-scoped that even Haiku does it well.

This pattern travels

Murder mysteries are a specific application. The pattern is general. Anywhere creative output has to obey hard rules, the same architecture works: curricula where prerequisites have to be respected before the lesson is written; compliance docs where plain-language output has to map to clause IDs; tabletop and LARP scripts where quests have to converge but keep player agency; game-show formats with timing constraints and a fixed reveal order; onboarding flows that hit every required gate per persona; recipe systems where allergens, ingredient swaps, and timing all have to fit.

In every one, the trap is the same: hand the whole problem to the model and trust the prose. The way out is the same: deterministic code for what has to be true, Claude for what has to feel true, and a small eval suite measuring each piece in isolation. A murder mystery kit is a stupid product to build at an AI lab. It turned out to be a fine business. Go build something weird and small. Yours will be different and stupider and that is the entire point.