Skip to main content
TACAVAR
Trading Systems

Paper Trading vs. Live: Why We Waited 90 Days

Most trading bots go live in week 1. We're waiting 90 days. Here's why paper trading first is the only way to build infrastructure you can actually trust.

Bottom line: Backtests lie. Paper trading doesn't. We're finding bugs in week 3 that would've cost thousands in live mode. That's the point.

The Problem: Everyone Goes Live Too Fast

Browse any trading bot marketplace. Read the landing pages. They all say the same thing:

"Deploy in 5 minutes. Start earning today."

"Backtested at 87% win rate. Ready for live capital."

"Our users average +15% monthly returns."

Here's what they don't tell you:

  • Backtests don't capture slippage on market orders
  • They don't model API rate limits or downtime
  • They don't account for latency between signal and execution
  • They definitely don't capture the psychological pressure of real money
  • And they can't predict the bugs that only appear under live load

So teams deploy to production. Something breaks. They lose money. Then they figure out what went wrong.

We're doing it differently.

Our Approach: 90 Days of Paper Trading

On Day 1 of our challenge, we made one rule:

No real capital until we complete 90 days of paper trading.

Not 30 days. Not "until we're confident." Ninety days. Minimum.

Here's what we're doing in that time:

1. Running the Full Stack in Production

Everything runs exactly as it would with real money:

  • Real-time market data from 12+ APIs
  • LLM decision layer (Qwen3.5-plus + critic validation)
  • Full risk management stack (circuit breakers, position limits, cooldowns)
  • Telegram alerts for every signal and decision
  • Dashboard updates every 60 seconds

The only difference: dry_run = true at the execution layer. Orders get logged, not sent to the exchange.

2. Modeling Slippage Realistically

Paper trading is useless if you assume you get the exact mid-market price. You don't.

Our paper trading engine applies:

  • 0.1% slippage on major pairs (BTC, ETH)
  • 0.3% on mid-cap alts
  • 0.5-2% on thin Polymarket order books
  • Additional slippage during high volatility (ATR > 2× average)

This isn't perfect. But it's closer to reality than assuming zero slippage.

3. Finding Bugs Before They Cost Money

Week 2: We discovered a clustering bug. The bot took three correlated ETH positions within 90 minutes — essentially one oversized bet wearing three disguises.

Impact in paper mode: -$8.40 simulated loss.

Impact if live with 10× leverage: ~$80-100 real loss, plus emotional pressure to override the system.

We shipped cluster detection that afternoon. Week 3, the same scenario appeared on BTC. This time:

  • Signal 1: Executed
  • Signal 2: Blocked (cooldown active)
  • Signal 3: Blocked (same underlying)

Loss: $23.10 instead of ~$70. The fix paid for itself in one incident.

4. Testing the LLM Under Load

We're tracking LLM performance across 1,000+ decisions:

  • Latency: GPT-5.4 averages 7.2s, GLM-5 averages 46.8s
  • Confidence scores: GPT-5.3 highest at 0.694 average
  • Fallback rate: GPT-5.4 has 27% fallback (JSON parsing issues)
  • Critic quality: 42.9% acceptable (test data pollution — being fixed)

This data is impossible to get from backtests. You need live load to understand how your LLM stack actually performs.

5. Validating Risk Controls

Our circuit breakers have triggered multiple times:

  • Consecutive loss pauses (after 3 straight losses)
  • Regime-based suspensions (mean reversion blocked during trending markets)
  • Cluster detection vetoes (correlated signals blocked)
  • Drawdown-based position sizing (automatic reduction at 3% drawdown)

Each trigger is a test. Did the system respond correctly? Did it prevent a bad trade? Or was it a false positive?

We're tuning based on real data, not hypotheticals.

What We've Learned (So Far)

Three weeks in, here are the biggest lessons:

Lesson 1: Regime Detection Matters More Than Strategy

A mediocre strategy in the right regime outperforms a great strategy in the wrong regime.

Week 3, our regime classifier flipped from "ranging" to "trending" for the first time. Mean reversion strategies auto-suspended. Momentum strategies upweighted. Two momentum trades executed — both profitable.

This is the adaptive behavior we designed for. And it worked on the first real test.

Lesson 2: The LLM Is a Great Brake Pedal

We don't use the LLM to predict price direction. We use it to spot contradictions.

When the LLM says "I'm not sure," that's valuable information. It's better at identifying bad trades than identifying great ones.

Week 1: 14 signals identified, 0 executed. The LLM rejected or queued every single one due to low confidence or conflicting indicators.

That's not a bug. That's the system working as intended.

Lesson 3: Infrastructure Is 90% of the Work

The trading strategies? Maybe 10% of the codebase.

The rest is:

  • Rate limit management across 12 APIs
  • WebSocket reconnection logic
  • LLM fallback chains with timeout handling
  • Audit logging for every decision
  • Dashboard data pipelines
  • Telegram bot integration

Anyone can write a mean reversion strategy. Building infrastructure that runs 24/7 without breaking? That's the hard part.

Lesson 4: Transparency Creates Discipline

Publishing weekly reports forces us to be honest about what's working and what's not.

When you know you'll have to explain a loss in public, you think harder about risk before pulling the trigger.

Building in public isn't just marketing. It's a discipline mechanism.

The Results (Week 3)

90-Day Challenge — Week 3 Summary

21

Days Elapsed

+$244.40

Cumulative P&L

60%

Win Rate

-0.9%

Max Drawdown

Paper trading only. Not indicative of future performance.

Is +$244.40 impressive? No. It's 2.44% in 21 days on simulated capital.

What's impressive:

  • Zero days of downtime
  • Cluster detection blocked a correlated loss (~$50 saved)
  • Regime classifier worked on first real test
  • LLM fallback chain held under load
  • Every decision logged and published

We're building infrastructure that might actually survive. That's the goal.

When Do We Go Live?

Day 91 — if we hit these gates:

Live Trading Gates:

  • Positive Sharpe ratio — risk-adjusted returns, not just raw P&L
  • Max drawdown < 10% — prove we can survive a bad stretch
  • 50+ trades executed — statistically significant sample size
  • Zero critical bugs — no unresolved issues that could cause catastrophic loss
  • All risk controls tested — circuit breakers, cooldowns, regime detection all validated

If we don't hit these gates by Day 90? We extend the paper phase. No ego. No "close enough." Either the system earns the right to go live, or it doesn't.

Why We're Sharing This

Most trading bot teams hide their development process. Black box systems. Unverifiable claims. Screenshot "proof" that could be faked.

We're doing the opposite. Every week:

  • Full trade history (wins, losses, reasoning)
  • Technical breakdowns (what broke, how we fixed it)
  • Performance metrics (win rate, P&L, drawdown)
  • LLM performance data (latency, confidence, fallback rates)

Transparency builds trust. You can't verify our claims yet. But you can watch us build. You can see the decisions. You can judge whether we're taking this seriously.

After 90 days, you'll know whether this system works. Not because we'll tell you. Because you'll have watched every step.


Follow the 90-Day Challenge

Every Monday: new transparency report. Real data, real lessons, zero hype. Watch us build trading infrastructure that might actually survive.