12-Factor Agents: How We Run 12 Agents
The 12-factor agents framework hit GitHub trending with 736 stars. We'd already been running 12 production agents. Here's how each principle maps to our stack.
When Dex at HumanLayer published the 12-factor agents framework, it hit GitHub trending with 736 stars in a single day. The timing wasn't accidental. The community had been building agent systems for two years and nobody had written down the production patterns.
We had been running 12 production AI agents across three droplets for months. The framework landed like a mirror — not because it taught us anything new, but because it validated architecture decisions we'd already made in the dark.
Here's how each factor maps to our actual stack.
The 12 Factors, Mapped to Production
Factor 1: Natural Language to Tool Calls
The core loop. An agent receives a task in natural language, determines the next step, and executes it via a structured tool call.
In our stack, the CEO agent plans in natural language. It writes an objective, identifies which specialists to dispatch, and defines success criteria. The specialists receive those instructions and execute through tool calls — file operations, API queries, database reads. The CEO never touches a tool. It only delegates.
The LangGraph StateGraph underneath handles the routing. CEO node → parallel specialist nodes → synthesis node → final output. The graph is the track. The agents are the train.
Factor 2: Own Your Prompts
Prompts are software. They need version control, review, and deterministic behavior.
We maintain LLM presets as version-controlled JSON files at /home/tacavar/bailian/configs/<preset>.json. Four presets: gpt, claude, qwen, best_mix. Each preset overrides per-agent model choices. Switch by editing .env.app → LLM_PRESET=claude → systemctl restart bailian. No prompts embedded in application code. No hardcoded model strings.
When we need prompt changes, we edit the preset file, not the agent logic. Separation of concerns at the prompt level.
Factor 3: Own Your Context Window
The context window is the most expensive resource in an agent system. Every token you put in it is a token you're paying for and a token the agent can hallucinate on.
Our handoff protocol between agents uses three fields: what changed, what the next agent needs to know, what risks remain. No free-text handoffs. No "here's what I think" prose. Raw context passing between agents bloated token counts and introduced surface area for errors. Structured handoffs eliminated both problems.
This is the opposite of the instinct to give downstream agents everything. More context is not better. Less context, carefully chosen, is better.
Factor 4: Tools Are Structured Outputs
A tool call is a function that returns structured data. The LLM selects it. Deterministic code executes it. The result re-enters the context window.
Our tool layer returns parseable JSON. When an agent needs to extract files from its output, prose recovery runs — _used_prose_recovery flag prevents false exits but also logs that recovery was needed. That log is itself a signal: if prose recovery fires too often, the tool's output format is wrong.
Tools that return raw API dumps pollute context. Tools that return exactly what the agent needs keep the loop tight.
Factor 5: Unify Execution State and Business State
The state of the agent's execution and the state of the business operation should be the same thing. No dual bookkeeping.
Our agents use Postgres with SELECT FOR UPDATE for atomic writes. Agent state — what task it's working on, what step it's at, what it produced — lives in the same database as the business data it operates on. The governor cron pattern (agents wake on schedule, complete a bounded task, sleep) ensures agents don't accumulate stale internal state. Each run starts fresh from the database.
Factor 6: Launch/Pause/Resume With Simple APIs
Agents should start, stop, and resume through clean interfaces — not through killing processes and praying.
The kanban dispatcher is our agent lifecycle manager. Tasks land on the board. The dispatcher claims them, spawns an agent, monitors it, and collects the result. If an agent times out, the task re-queues. If it crashes, the task re-queues. The agent doesn't need to know about the dispatcher. The dispatcher manages the agent.
This pattern — cron wakes the agent, the agent does one thing, the agent exits — prevents the idle-burn that makes most agent fleets expensive.
Factor 7: Contact Humans With Tool Calls
Autonomous agents need a structured way to ask for human input. Not a Slack DM. Not an @mention. A tool call that blocks execution until a human responds.
Our approval gates are mandatory checkpoints between agent output and production. The keywords that trigger them — deploy, production, database, billing, delete — are hardcoded. The CEO agent can also set requires_approval on any task. Rejected approvals get terminal rejected status.
This isn't caution. It's infrastructure. Each gate answers one question: should this output proceed to the next stage? The answer requires judgment. Judgment doesn't scale to agents.
Factor 8: Own Your Control Flow
Don't outsource the agent's decision loop to a framework you can't debug.
Our control flow runs through LangGraph's StateGraph — we define the nodes, the edges, the parallel fan-out, the synthesis step. It's a visible graph, not a black-box agent loop. Non-streaming LLM calls (we tried streaming, OpenClaw returned empty deltas). CEO plans → dispatches to specialists in parallel → synthesis node → final report. Every transition is explicit. Every state is inspectable.
When something breaks, we can trace exactly which node produced the bad output. That's not possible with a framework loop.
Factor 9: Compact Errors Into the Context Window
When an agent fails, the error should be structured and compact — not a 400-line stack trace that consumes the context window.
We wrap every agent in a budget governor: a hard token cap, a time limit, and a circuit breaker. When the agent hits any boundary, it stops. No exceptions. No negotiated overrides. No "just one more turn."
Failures produce structured error states, not raw tracebacks. The next agent in the chain sees: what failed, why it failed, and what the recovery path is. Three fields. Same as the handoff protocol. Consistency under failure.
Factor 10: Small, Focused Agents
One agent, one role, one bounded context. Twelve small agents outperform one big one.
Our roster: CEO, Dev Lead, Dev-1, Dev-2, Frontend Engineer, SEO Manager, Marketing Lead, Content Writer, Research Lead, Data Analyst, Grok Researcher, GPT Analyst. Each has a defined role, a concurrency limit, and a batch size. They don't negotiate territory. The boundaries are defined before they start.
Two swarms: Engineering (8 agents) and Growth (8 agents). Some agents serve both swarms but through designated channels only. No agent freelances outside its role.
Factor 11: Trigger From Anywhere, Meet Users Where They Are
Agents shouldn't require someone to open a dashboard and press a button.
Our agents trigger from cron schedules (weekly SEO audits, daily research briefs), webhook events (new signal detected, competitor change), and kanban board state changes (task ready → dispatch). Results deliver via Telegram, Discord, or back to the kanban board.
The trigger mechanism is separate from the agent logic. An agent doesn't know whether it was triggered by cron or webhook. It receives a task and executes. The infrastructure handles distribution.
Factor 12: Make Your Agent a Stateless Reducer
Each agent run should be a pure function: input → process → output. State lives in the database, not in the agent's memory.
Our governor cron pattern enforces this by design. An agent wakes, reads its task from the database, processes it, writes the result back, and exits. It doesn't maintain a running context between invocations. It doesn't remember the last task it ran. Each run is independent.
This is the hardest principle to maintain as systems grow. The temptation is to let agents accumulate context — let them "learn" across runs. That path leads to state corruption and non-deterministic behavior. Stateless agents are boring. Boring is reliable.
What This Looks Like in Practice
Twelve agents. Two swarms. Three droplets. A kanban dispatcher that manages lifecycle. Approval gates that block production writes. Budget governors that enforce hard limits. Structured handoffs that keep context clean.
It's not the only architecture. It's the one that survived contact with production. Your stack might look different. But it should share one property: the infrastructure layer is thicker than the agent layer.
The 12-factor agents framework validated something we'd learned the hard way: production agents are mostly infrastructure, not mostly AI. The models are the exciting part. The infrastructure is the part that keeps them from burning your budget at 3 AM.
You built it. We optimize it.