Most multi-agent AI demos end with a chat log. Ours ends with code deployed to three production domains. Bailian is our agent orchestration system: 12 specialized AI agents running inside Docker, coordinated by LangGraph, backed by a full observability stack, and capable of executing real work — not just planning it.
Architecture: Docker Compose All the Way Down
Bailian runs as a Docker Compose stack with seven services: a FastAPI application server, a LiteLLM proxy for model routing, PostgreSQL for state persistence, Redis for caching and message passing, Prometheus for metrics, Grafana for dashboards, and Jaeger for distributed tracing. One docker compose up --build and the entire system is running.
The coordination layer is built on LangGraph StateGraph. Each agent is a node in the graph. Edges encode who can delegate to whom, what approval gates exist, and where results flow. State is shared through a reducer pattern that makes parallel writes safe — multiple agents can update the shared state simultaneously without overwriting each other.
The CEO Delegation Pattern
Every task starts with the CEO agent. You submit a run — engineering, growth, or hybrid — and the CEO evaluates what needs to happen, breaks the work into subtasks, and dispatches each one to the appropriate specialist. The CEO does not write code. The CEO does not write blog posts. The CEO plans and delegates.
This mirrors how effective human organizations work. A CTO who is writing React components is not doing their job. The CEO agent assigns the Frontend Engineer to build the UI, the Dev Lead to architect the backend, the SEO Manager to optimize the content, and the Content Writer to produce the copy. Each agent works independently and in parallel.
12 Specialists, 12 Skill Sets
The agent roster is deliberately modeled on a startup team:
- CEO — task planning, delegation, priority management
- Dev Lead — architecture decisions, code review, technical direction
- Dev-1 — hard engineering tasks: complex algorithms, system integration, debugging
- Dev-2 — bulk engineering: scaffolding, boilerplate, repetitive implementation
- Frontend Engineer — UI/UX, HTML/CSS/JS, responsive design
- Marketing Lead — campaign strategy, messaging, positioning
- SEO Manager — technical SEO, keyword research, on-page optimization
- Research Lead — market research, competitor analysis, technology evaluation
- Content Writer — blog posts, landing pages, documentation
- Grok Researcher — real-time information gathering and synthesis
- GPT Analyst — data analysis, pattern recognition, quantitative research
- Data Analyst — metrics, reporting, performance evaluation
Each agent has a dedicated system prompt, a specific set of available tools, and a preferred model. Dev-1 runs on GPT-Codex for maximum code quality. The SEO Manager runs on Qwen 3.5-plus for cost efficiency on text-heavy tasks. The system matches the model to the work.
Real Tool Execution: Agents That Do, Not Just Plan
This is where Bailian diverges from most agent frameworks. Our agents have access to workspace tools that affect production systems:
write_file— create or modify files in the shared workspaceread_file— inspect existing code and configurationcpanel_upload— deploy files directly to production hosting via FTPcloudflare_purge— invalidate CDN cache after deploymentweb_fetch— pull live data from URLs for research and validationssh_exec— execute commands on remote servers
When the CEO assigns "deploy an SEO-optimized landing page for avoidtravelscam.com," the workflow is not hypothetical. The Content Writer produces the HTML. The SEO Manager adds structured data and meta tags. The Frontend Engineer reviews the markup. Then cpanel_upload pushes it live and cloudflare_purge clears the cache. The page is on the internet. Real users see it.
A prose recovery system handles a common LLM failure mode: sometimes agents describe code in natural language instead of writing it as a file. The recovery system detects this, extracts the code blocks, and writes the files that the agent meant to create.
The Observability Stack
Prometheus scrapes metrics from every agent execution: latency per agent, tokens consumed, tool calls per run, success and failure rates. Grafana dashboards visualize these in real time — you can watch 12 agents processing a complex task and see exactly where time is being spent.
Jaeger tracing provides distributed trace visualization across the full request lifecycle. When the CEO delegates to the Dev Lead who delegates to Dev-1, you can follow that chain in Jaeger and see exactly how long each step took, what tools were called, and where failures occurred.
This is not optional instrumentation. When 12 agents are running in parallel, making LLM calls, writing files, and deploying code, you need observability. Without it, debugging is guesswork.
The Model Routing Layer
LiteLLM sits between the agents and the language models, providing a unified API regardless of the underlying provider. The routing logic is layered:
- Primary: GPT-Codex via OpenClaw subscription ($0 per token for 10 of 12 agents)
- Fallback 1: Qwen 3.5-plus via DashScope (preferred for Dev-2 and SEO Manager)
- Fallback 2: GLM-5 or MiniMax-M2.5 via DashScope
If OpenClaw is slow or down, the agent automatically retries on DashScope. The agent does not know or care which model answered. Runs that start on GPT-Codex can finish on Qwen without any loss of context, because the state graph preserves the full conversation history.
Bugs That Made It Real
Theory and production are separated by bugs. Three nearly killed the project:
The streaming bug. Agent output was arriving as a stream but being consumed before it was complete. Result: empty strings passed to the tool layer. The fix was buffering the full response before processing — obvious in retrospect, invisible until we traced it.
The race condition. Multiple agents writing to PostgreSQL simultaneously were occasionally overwriting each other's state. The LangGraph reducer pattern solved this, but only after we rewrote the state schema to support concurrent updates with proper merge semantics.
The tool loop. Agents would sometimes call a tool, misinterpret the result, and call the same tool again in an infinite cycle. We added a tool-call deduplication layer that detects repeated identical calls and forces the agent to move on.
What It Produced
In a single orchestrated run, Bailian deployed SEO-optimized content to three production domains. The CEO planned the campaign. The Content Writer and SEO Manager produced the pages. The Frontend Engineer validated the markup. The deployment tools pushed everything live. Total human intervention: pressing Enter to start the run.
How It Compares to a Human Team
A 12-person team would take a sprint to do what Bailian does in minutes. But the comparison is not quite fair — humans bring judgment, taste, and the ability to handle truly novel problems. Where Bailian excels is in parallel execution of well-defined tasks: generate 20 SEO pages, deploy to 5 domains, run link audits, produce competitive analyses. Work that is important but repetitive, and that scales linearly with headcount in a human org.
The system runs 24/7. It does not need standups. It does not have context-switching overhead. When one agent finishes, it immediately picks up the next task. The cost is a Docker Compose stack and some API calls.
Bailian is not a demo. It is a production system that deploys real code to real websites. We will keep writing about what it builds, what breaks, and what we learn from giving AI agents real tools and real responsibility.