Understanding
Snowden
A technical deep-dive for developers. We walk through every piece of this autonomous prediction market trading system, from scanning Polymarket to placing Kelly-sized bets with LLM-generated probability estimates.
What Are We Building?
Snowden is an autonomous trading system for Polymarket, the world's largest prediction market, where binary contracts trade on real-world outcomes: elections, Fed rate decisions, geopolitical events, crypto price targets. Each contract settles at $1 (YES) or $0 (NO), and trades at a price between 0 and 1 that represents the market's implied probability.
Every 15 minutes, Snowden wakes up and runs a full cycle: scan hundreds of live markets, call Claude Opus to estimate true probabilities, size positions with the Kelly criterion, enforce strict risk limits through an automated Sentinel, and execute trades on the Polymarket CLOB (central limit order book). Everything is logged to TimescaleDB and visualized in Grafana.
The Thesis
Prediction markets are efficient in aggregate but have persistent micro-inefficiencies that a systematic approach can exploit. These inefficiencies cluster around a few patterns:
- Stale markets · thin liquidity, wide spreads, prices that haven't moved despite new information
- Longshot bias · small-probability events consistently overpriced (people love lottery tickets)
- Partisan bias · political markets where one side's money distorts the price away from polling data
- News latency · prices adjust slowly to new information, especially in low-volume markets
Snowden doesn't try to beat the market on every question. It runs a funnel to find the 10–15 markets per cycle where edge is most likely, then sizes bets conservatively enough to survive the inevitable losing streaks.
The Pipeline
| Metric | Value |
|---|---|
| Scan interval | 15 minutes |
| Markets scanned | 500+ per cycle |
| Funnel output | 10–15 opportunities |
| Position sizing | Quarter-Kelly (f/4) |
| Risk limits | 80% heat, 10% daily drawdown, 40% correlated |
| Starting bankroll | $2,000 USDC |
| LLM backbone | Claude Opus 4.6 (analyst) + Haiku 4.5 (triage) |
| Database | TimescaleDB (PostgreSQL 16 + hypertables) |
The Foundation
Configuration
All configuration flows through a single Pydantic Settings class that reads from
environment variables and an optional .env file. No YAML, no TOML, no
config hierarchy to debug. Docker Compose sets the infra variables; everything else
has sensible defaults.
The module-level settings = Settings() is the only global state in the
system. Every other module imports it by name.
class Settings(BaseSettings):
model_config = {"env_prefix": "", "env_file": ".env", "extra": "ignore"}
# Database
tsdb_host: str = "localhost"
tsdb_port: int = 5432
tsdb_db: str = "snowden"
# Polymarket API credentials
poly_api_key: str = ""
poly_api_secret: str = ""
poly_private_key: str = ""
# Risk parameters
max_heat: float = 0.80 # 80% of equity deployed
max_single_position: float = 0.25 # 25% max per market
max_daily_drawdown: float = 0.10 # 10% triggers kill switch
max_correlated: float = 0.40 # 40% per category
kelly_divisor: float = 4.0 # quarter-Kelly
edge_threshold: float = 0.05 # 5% minimum edge to trade
@property
def tsdb_dsn(self) -> str:
return f"postgresql://{self.tsdb_user}:{self.tsdb_password}@{self.tsdb_host}:{self.tsdb_port}/{self.tsdb_db}"
@property
def is_paper(self) -> bool:
return self.mode == "paper"
settings = Settings() # module-level singleton
Pydantic Settings gives us runtime type validation for free. If someone sets
TSDB_PORT=banana, the app fails immediately at startup with a clear
validation error, not deep in a database connection two hours later.
The "extra": "ignore" flag means unrecognized env vars are silently
dropped, so you can share a .env across services without conflicts.
The Type System
Every data shape in the system lives in a single file: types.py.
Pydantic models for structured data, StrEnum for categorical values,
and Protocol interfaces for swappable backends. This file is the
contract between every module. If you change a type here, the
type checker catches every downstream breakage.
class Regime(StrEnum):
"""Market regime drives strategy selection."""
CONSENSUS = "consensus" # market agrees, little edge
CONTESTED = "contested" # genuine disagreement
CATALYST = "catalyst_pending" # event upcoming that will move price
RESOLVING = "resolution_imminent"
STALE = "stale" # no one is paying attention
NEWS_DRIVEN = "news_driven" # recent news shifted reality
class Strategy(StrEnum):
THETA = "theta_harvest" # near-certain outcomes
LONGSHOT_FADE = "longshot_fade" # overpriced tails
NEWS_LATENCY = "news_latency" # slow price adjustment
PARTISAN_FADE = "partisan_fade" # political bias
CORRELATED_ARB = "correlated_arb" # linked market mispricing
STALE_REPRICE = "stale_reprice" # abandoned markets
@runtime_checkable
class MarketClient(Protocol):
"""Swappable backend: LiveClient for real trading, SimClient for paper."""
async def get_active_markets(self) -> pl.DataFrame: ...
async def get_book(self, token_id: str) -> dict: ...
async def get_midpoint(self, token_id: str) -> float: ...
async def execute(self, signal: TradeSignal) -> OrderResult: ...
The Protocol interfaces are the key abstraction. LiveClient
and SimClient both satisfy MarketClient through structural
typing: no inheritance, no registration, no abstract base class.
If a class has the right methods with the right signatures, it's a valid
MarketClient. This makes testing trivial: any object with the right
shape works.
The Scanner
The scanner is a 5-stage funnel that reduces 500+ active Polymarket markets down to 10–15 tradeable opportunities. Stages 1–4 are pure data operations (Polars and Python). Stage 5 is a single, cheap Haiku call to filter out noise before expensive Opus analysis.
The design principle: each stage is cheap enough to run on every market, and aggressive enough to cut the dataset significantly. By the time we reach the LLM, we've already eliminated 95% of markets through deterministic rules.
Stage 1: Fetch
Paginated fetch of all active markets from Polymarket's Gamma API. Events (which can contain multiple binary markets) are flattened into individual rows. Token IDs, outcome prices, volumes, and resolution dates are normalized into a Polars DataFrame.
This is the only network-heavy stage. A typical fetch takes 1–2 seconds and returns 500–800 markets depending on platform activity.
Stage 2: Liquidity Gate
A pure Polars filter with four hard requirements. Any market that fails any condition is dropped immediately.
def stage_2_liquidity_gate(df: pl.DataFrame) -> pl.DataFrame:
"""Filter markets by minimum liquidity requirements."""
return df.filter(
(pl.col("vol_24h") >= settings.min_liquidity_usd) # >= $5,000
& ((pl.col("bid_depth") + pl.col("ask_depth"))
>= settings.min_book_depth_usd) # >= $500 total
& (pl.col("spread") <= settings.max_spread) # <= 8%
& (pl.col("hours_to_resolve") >= settings.min_hours_to_resolve) # >= 24 hours
& (pl.col("hours_to_resolve") <= settings.max_days_to_resolve * 24) # <= 180 days
)
Why filter on resolution time? Markets resolving within 24 hours are usually efficiently priced: too many eyeballs, too little time for the price to drift. Markets beyond 180 days have too much uncertainty for our edge to compound meaningfully, and capital is tied up too long. The sweet spot is 1–4 weeks: enough time for information asymmetry to exist, short enough that your capital turns over.
Stage 3: Efficiency Score
A composite score estimating how "beatable" each market is. Each component contributes to a score between 0 and 1, where lower means less efficient (i.e., more likely to have exploitable mispricing). Markets scoring above 0.4 are dropped.
| Component | Weight | Logic |
|---|---|---|
| Spread | 25% | Wider spread = less liquidity = less efficient |
| Volume | 20% | Lower 24h volume = fewer participants = less efficient |
| Book depth | 15% | Shallow order book = easier to move the price |
| Price extremity | 15% | Prices near 0 or 1 have known tail biases |
| Time window | 25% | 1–4 week resolution = ideal sweet spot for mispricing |
Stage 4: Strategy Match
Each surviving market is classified into one or more strategy buckets based on its price, volume, category, and spread characteristics. Markets that don't match any known pattern are dropped. Priority scoring combines estimated edge, confidence modifier, liquidity, and time decay into a single sortable number.
| Strategy | Trigger Condition | Edge Estimate | Why It Works |
|---|---|---|---|
| Theta Harvest | mid ≥ 0.88 or ≤ 0.12 | distance to boundary | Near-certain outcomes trade at a discount to certainty |
| Longshot Fade | mid ≤ 0.08 or ≥ 0.92 | tail × 0.5 | People overpay for lottery-ticket outcomes |
| Stale Reprice | Low vol + wide spread | ~4% | No one is watching; reality has moved on |
| Partisan Fade | Political + mid 0.25–0.75 | ~6% | Partisan money pushes prices away from polling data |
Stage 5: Haiku Triage
One batch call to Claude Haiku with all remaining candidates. The model reads each market's question, mid price, matched strategies, and volume, then picks 10–15 worth deep analysis. This is a cheap filter (one Haiku call ~ $0.002) before expensive Opus calls (each ~ $0.05).
async def stage_5_haiku_triage(
candidates: list[ScanResult],
anthropic_client: anthropic.AsyncAnthropic,
) -> list[ScanResult]:
"""Haiku pre-screen: is this worth deep Analyst analysis?"""
batch_text = "\n".join(
f"[{i}] Q: {c.market.question} | Mid: {c.market.mid:.2f} | "
f"Strategies: {', '.join(s.value for s in c.matched_strategies)} | "
f"Vol24h: ${c.market.vol_24h:,.0f}"
for i, c in enumerate(candidates)
)
response = await anthropic_client.messages.create(
model=settings.triage_model, # claude-haiku-4-5
max_tokens=500,
system="Select 10-15 markets worth deep analysis. Skip obviously "
"efficient markets. Respond with comma-separated indices only.",
messages=[{"role": "user", "content": batch_text}],
)
indices = [int(x.strip()) for x in response.content[0].text.split(",")
if x.strip().isdigit()]
return [candidates[i] for i in indices if i < len(candidates)]
The Funnel in Numbers
The Analyst
The Analyst is the most expensive and most important component: a Claude Opus call for each candidate market, asking the model to estimate the true probability of the event occurring, independent of what the market currently prices. The prompt engineering is the core intellectual property of the system.
The System Prompt
The system prompt is designed to counteract known LLM failure modes in probability estimation. Without explicit guidance, language models anchor on the market price, fall for narrative bias, exhibit recency bias, and cluster their estimates at round numbers (50%, 75%, 90%). The prompt addresses each of these directly.
ANALYST_SYSTEM_PROMPT = (
"You are a professional prediction market analyst. Estimate the TRUE "
"probability of an event, independent of the market price.\n\n"
"CALIBRATION RULES:\n"
"1. When you say 70%, the event should happen ~70% of the time.\n"
"2. Base estimates on EVIDENCE, not narrative. Weight hard data\n"
" (polls, filings, schedules) over soft signals (sentiment, vibes).\n"
"3. Political markets: weight polling aggregates over pundit takes.\n"
"4. Distinguish 'I don't know' (low confidence, near market price)\n"
" from 'the market is wrong' (high confidence, far from market).\n"
"5. Biases to AVOID:\n"
" - Anchoring on the current market price\n"
" - Narrative bias (good story != high probability)\n"
" - Recency bias (last week's news != permanent shift)\n"
" - Round number bias (don't cluster at 50%, 75%, 90%)\n"
"6. If evidence is thin, say so. Set confidence LOW.\n\n"
"IMPORTANT: p_est_raw is YOUR raw estimate BEFORE calibration.\n"
"The system applies Platt scaling separately. Give your honest best."
)
Analysis Flow
For each candidate market, the Analyst executes a four-step pipeline:
- 1Fetch news via category-specific RSS feeds (politics, crypto, finance, sports, legal). Feeds are parsed with
feedparser, deduplicated by title prefix, sorted by recency, and capped at 15 items. General news feeds are always included alongside the category-specific ones. - 2Build prompt with market data (mid, bid/ask, spread, volume, open interest), 7-day price history, matched strategies from Stage 4, resolution source, and formatted news context. The description is capped at 500 characters to stay focused.
- 3Call Claude Opus. The model returns a JSON object with its raw probability estimate (
p_est_raw), confidence level, regime classification, reasoning (capped at 3 sentences), and a strategy hint. - 4Apply Platt scaling via
calibrator.correct(raw_est)to transform the raw LLM probability into a calibrated estimate. The calibrated value becomesp_estand is used for all downstream decisions.
async def analyze_market(
scan: ScanResult, calibrator: Calibrator,
client: anthropic.AsyncAnthropic | None = None,
) -> EventAnalysis | None:
# Step 1: Fetch fresh news for this market's category
news_items = await fetch_news_for_market(
scan.market.question, scan.market.category.value,
)
scan.news_headlines = [item.title for item in news_items]
# Step 2: Build the analyst prompt
prompt = build_analyst_prompt(scan)
# Step 3: Call Opus, parse JSON response
response = await client.messages.create(
model=settings.analyst_model, # claude-opus-4-6
max_tokens=settings.analyst_max_tokens,
system=ANALYST_SYSTEM_PROMPT,
messages=[{"role": "user", "content": prompt}],
)
data = json.loads(response.content[0].text.strip()
.replace("```json", "").replace("```", ""))
# Step 4: Apply calibration correction
raw_est = float(data["p_est_raw"])
calibrated = calibrator.correct(raw_est)
return EventAnalysis(
market_id=data["market_id"], question=data["question"],
p_market=float(data["p_market"]),
p_est=calibrated, p_est_raw=raw_est,
confidence=float(data["confidence"]),
regime=Regime(data["regime"]),
edge=round(calibrated - float(data["p_market"]), 4),
reasoning=data["reasoning"],
key_factors=data.get("key_factors", []),
data_quality=float(data.get("data_quality", 0.5)),
strategy_hint=Strategy(data["strategy_hint"]) if data.get("strategy_hint") else None,
)
The Analyst outputs p_est_raw, its honest best estimate
before any correction. The system then applies Platt scaling (logistic regression on
historical log-odds vs outcomes) to fix systematic bias. This separation is critical:
the LLM gives its best guess, statistics fix the systematic errors. Over time, the
calibrator learns whether Claude tends to be overconfident in the 60–80% range,
or underconfident near the tails, and corrects for it automatically.
Kelly Criterion
The Kelly criterion determines optimal bet sizing for repeated wagers with a known edge. In prediction markets, each trade is a binary bet: the contract settles at $1 (YES) or $0 (NO). If our estimated probability differs from the market price, we have edge, and Kelly tells us exactly how much of our bankroll to wager.
The Math
For buying YES at market price p_market with estimated true probability
p_est, the implied decimal odds are b = (1/p_market) - 1, and the
Kelly fraction is f = (p_est × b - (1 - p_est)) / b. For buying NO,
we mirror the probabilities. The fraction is then divided by kelly_divisor
(default 4) and clamped to max_single_position (25%).
def kelly_fraction(
p_est: float, p_market: float,
divisor: float | None = None,
max_frac: float | None = None,
) -> float | None:
"""Returns None if no edge or negative Kelly."""
divisor = divisor or settings.kelly_divisor # 4.0
max_frac = max_frac or settings.max_single_position # 0.25
# Minimum edge threshold to avoid noise trades
if abs(p_est - p_market) < settings.kelly_edge_threshold: # 3%
return None
if p_est > p_market:
b = (1.0 / p_market) - 1.0 # decimal odds
f = (p_est * b - (1 - p_est)) / b # Kelly fraction
else:
# Buying NO: mirror the probabilities
p_no_market = 1.0 - p_market
p_no_est = 1.0 - p_est
b = (1.0 / p_no_market) - 1.0
f = (p_no_est * b - p_est) / b
if f <= 0:
return None
return float(np.clip(f / divisor, 0.0, max_frac))
Why quarter-Kelly? Full Kelly maximizes the long-run compound growth rate but has enormous variance. A bad streak can draw down 50%+ before recovering. The growth rate scales as f × (2 - f) relative to full Kelly, so quarter-Kelly captures ~44% of the growth rate while reducing variance by 75%. For a $2,000 experiment where survival matters more than speed, this is the right trade-off.
Signal Building
The build_signal function ties estimation to execution. It computes a
confidence-weighted edge (|p_est - p_market| × confidence), checks
whether it exceeds the 5% minimum, determines direction (YES or NO), computes the dollar
size, and adds a 1-cent slippage buffer to the limit price.
def build_signal(market_id, yes_token, no_token,
p_est, p_market, confidence, bankroll, strategy):
# Confidence-weighted edge must exceed 5%
edge = abs(p_est - p_market) * confidence
if edge < settings.edge_threshold:
return None
going_yes = p_est > p_market
size = compute_size(effective_p_est, effective_p_market, bankroll)
if size is None or size < settings.min_trade_usd: # $5 minimum
return None
return TradeSignal(
direction="YES" if going_yes else "NO",
size_usd=round(size, 2),
limit_price=effective_p_market + settings.slippage_buffer, # +$0.01
kelly_frac=size / bankroll,
edge=round(p_est - p_market, 4),
...
)
Risk Management
The Sentinel is pure math. No LLM reasoning, no ambiguity, no exceptions. Four sequential checks, each with a hard limit. If any check fails, the trade is vetoed. If daily drawdown exceeds 10%, the entire system freezes.
def check_signal(
signal: TradeSignal, portfolio: PortfolioState,
) -> RiskCheck:
# CHECK 1: Single position size
single_exposure = signal.size_usd / portfolio.total_equity
if single_exposure > settings.max_single_position: # > 25%
return RiskCheck(approved=False, reason="Single position too large", ...)
# CHECK 2: Portfolio heat (total capital deployed)
new_heat = (portfolio.heat * portfolio.total_equity + signal.size_usd) \
/ portfolio.total_equity
if new_heat > settings.max_heat: # > 80%
return RiskCheck(approved=False, reason="Heat limit exceeded", ...)
# CHECK 3: Daily drawdown (kills switch if > 10%)
if portfolio.daily_drawdown > settings.max_daily_drawdown:
return RiskCheck(approved=False, reason="FROZEN: drawdown limit", ...)
# CHECK 4: Correlated exposure (same category)
correlated_usd = sum(
p.size_usd for p in portfolio.positions
if p.category.value == signal.category.value
)
if (correlated_usd + signal.size_usd) / portfolio.total_equity
> settings.max_correlated: # > 40%
return RiskCheck(approved=False, reason="Correlated limit", ...)
# All checks passed
return RiskCheck(approved=True, heat=new_heat, ...)
| Check | Limit | Rationale | On Breach |
|---|---|---|---|
| Single position | < 25% of equity | No one bet should be existential | Veto signal |
| Portfolio heat | < 80% | Always keep 20% cash for opportunities | Veto signal |
| Daily drawdown | < 10% | Losing $200 in a day means something is wrong | FREEZE all trading |
| Correlated exposure | < 40% per category | Don't be all-in on "politics" or "crypto" | Veto signal |
The Sentinel is deliberately simple. Complex risk models (Value-at-Risk, Monte Carlo, copulas) give a false sense of precision. Four hard limits, checked in sequence, with a kill switch. If any limit is hit, trading stops. No exceptions, no "just this once," no manual override. The best risk management is the kind that can't be argued with.
Execution
The Trader is pure execution, with no analysis and no opinion. It receives
a TradeSignal that has already been approved by the Sentinel, checks the
order book for adverse price movement, and places a limit order.
async def execute_signal(
signal: TradeSignal,
client: LiveClient | SimClient,
store: Store,
) -> OrderResult:
# Pre-execution: check order book hasn't moved against us
book = await client.get_book(signal.token_id)
best_ask = float(book["asks"][0]["price"]) if book.get("asks") else 1.0
# Slippage guard: abort if book moved > 3% against us
if signal.direction == "YES" and best_ask > signal.p_market * 1.03:
return OrderResult(status="CANCELLED", ts=datetime.now(UTC))
# Execute through client (live order or paper fill)
result = await client.execute(signal)
# Log every execution to TimescaleDB
await store.log_trade(signal, result, paper=result.status == "PAPER")
return result
Paper vs Live
The system ships with two clients that both satisfy the MarketClient protocol.
SimClient delegates all reads to the real Polymarket API (so you
scan real markets at real prices) but simulates writes: execute()
returns an instant "PAPER" fill at the current midpoint with zero slippage.
LiveClient uses py-clob-client to sign and submit real limit
orders to the Polygon-based CLOB.
Paper mode isn't a separate codepath. It's the same pipeline with a different client injected at startup. The Chief doesn't know or care whether it's paper or live. This means every bug you find in paper mode is a bug you've fixed before going live.
Calibration
The calibration engine answers the question: when Claude says 70%, does the event actually happen 70% of the time? If there's systematic bias (e.g., Claude is overconfident in the 60–80% range), Platt scaling corrects it. If there's no bias, the calibrator passes through the raw estimate unchanged.
Platt Scaling
The technique is simple: take all resolved predictions, convert raw probabilities to log-odds, and fit a logistic regression against actual outcomes. The fitted model then maps future raw probabilities to calibrated ones. The calibrator needs at least 50 resolved predictions before it activates; before that, raw estimates pass through unchanged.
class Calibrator:
def __init__(self):
self._scaler = LogisticRegression(C=1.0, solver="lbfgs")
self._fitted = False
async def fit_from_db(self, store: Store, min_samples=50) -> bool:
resolved = await store.get_resolved_predictions()
if len(resolved) < min_samples:
return False # Not enough data yet
preds = resolved["p_est_raw"].to_numpy().astype(np.float64)
actuals = resolved["outcome"].to_numpy().astype(np.int32)
# Clip to avoid log(0), convert to log-odds
preds = np.clip(preds, 0.001, 0.999)
logits = np.log(preds / (1.0 - preds)).reshape(-1, 1)
self._scaler.fit(logits, actuals)
self._fitted = True
return True
def correct(self, raw_prob: float) -> float:
"""Apply Platt scaling. Pass-through if not yet fitted."""
if not self._fitted:
return raw_prob
raw_prob = float(np.clip(raw_prob, 0.001, 0.999))
logit = np.log(raw_prob / (1.0 - raw_prob))
return float(self._scaler.predict_proba([[logit]])[0][1])
Brier score is the gold standard for measuring probability calibration: the mean squared error between predicted probabilities and actual binary outcomes. A score of 0 = perfect calibration. 0.25 = coin-flip performance. 1.0 = consistently wrong. For reference, Tetlock's superforecasters achieve ~0.15 on geopolitical questions. If Snowden stays below 0.20, the Kelly criterion will generate positive expected value over time.
Reliability Report
The generate_report() method produces a full calibration diagnostic: Brier
score, decile reliability buckets (predicted probability vs actual outcome rate), and
over/under-confidence bias detection. Predictions above 50% where the actual rate is lower
indicate overconfidence; predictions below 50% where the actual rate is higher indicate
underconfidence. Both are common LLM failure modes.
Resolution Backfill
A dedicated script (scripts/resolve.py) polls Polymarket for resolved
markets and backfills outcomes into the predictions table. This is how
the calibrator accumulates training data. As more predictions resolve, the Platt
scaling becomes more accurate, creating a positive feedback loop.
Infrastructure
TimescaleDB Schema
Six tables, all configured as hypertables (TimescaleDB's time-partitioned tables that automatically shard by timestamp). Every stage of the pipeline writes structured data. The schema is designed so that every decision is auditable: you can trace any trade back to the prediction, the scan result, and the market tick that triggered it.
| Table | Purpose | Write Frequency | Key Columns |
|---|---|---|---|
market_ticks | Price snapshots | Per scan per market | mid, spread, vol_24h, depths |
predictions | Every analyst estimate | 10–15 per cycle | p_est, p_est_raw, confidence, regime, resolved, outcome |
trades | Paper + live orders | 0–5 per cycle | size, price, status, kelly_frac, strategy |
portfolio_snapshots | Portfolio state | 1 per cycle | bankroll, heat, daily_pnl, drawdown |
scanner_metrics | Funnel numbers | 1 per cycle | stage_1..5 counts, duration_ms |
market_metadata | Cached market info | On first scan | question, category, token IDs |
CREATE TABLE predictions (
ts TIMESTAMPTZ NOT NULL,
market_id TEXT NOT NULL,
question TEXT,
p_market FLOAT8,
p_est FLOAT8, -- calibrated estimate
p_est_raw FLOAT8, -- raw LLM output
confidence FLOAT8,
regime TEXT,
strategy TEXT,
edge FLOAT8,
reasoning TEXT,
data_quality FLOAT8 DEFAULT 0.5,
resolved BOOLEAN DEFAULT FALSE,
outcome SMALLINT -- 1 = YES, 0 = NO
);
SELECT create_hypertable('predictions', 'ts');
CREATE INDEX idx_pred_market ON predictions (market_id, ts DESC);
CREATE INDEX idx_pred_resolved ON predictions (resolved) WHERE resolved = true;
Continuous Aggregates
TimescaleDB's continuous aggregates are materialized views that incrementally refresh as new data arrives. Three views power the Grafana dashboards without requiring any manual rollup logic:
-- Live Brier score, average edge, and confidence over time
CREATE MATERIALIZED VIEW prediction_accuracy_hourly
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', ts) AS bucket,
COUNT(*) AS n_predictions,
COUNT(*) FILTER (WHERE resolved) AS n_resolved,
AVG(CASE WHEN resolved THEN (p_est - outcome)^2 END) AS brier_score,
AVG(edge) AS avg_edge,
AVG(confidence) AS avg_confidence
FROM predictions GROUP BY bucket;
Docker Compose
Two services: TimescaleDB (PostgreSQL 16 with time-series extensions) and Grafana for
visualization. The SQL init scripts are mounted into the container's initdb.d
directory and execute automatically on first boot. A health check ensures the database is
ready before Grafana connects.
The Gymnasium Environment
SnowdenReplayEnv replays historical predictions through a standard Gymnasium
interface for parameter sweeps. It's not RL training; it's a
backtesting harness for finding the optimal Kelly divisor (2, 4, 6, or 8) and bet
threshold (5%, 10%, or 20% of bankroll) against resolved outcomes.
class SnowdenReplayEnv(gym.Env):
def __init__(self, predictions: pl.DataFrame, initial_bankroll=2000.0):
# Obs: [p_est, p_market, edge, confidence, spread, days_to_resolve]
self.observation_space = spaces.Box(-1, 365, shape=(6,), dtype=np.float32)
# Action: [skip, small 5%, medium 10%, large 20%]
self.action_space = spaces.Discrete(4)
def step(self, action):
size_map = {0: 0.0, 1: 0.05, 2: 0.10, 3: 0.20}
bet = size_map[action] * self._bankroll
# PnL computed from resolved outcome + market price
self._bankroll += pnl
self._peak = max(self._peak, self._bankroll)
return obs, pnl, done, False, {"bankroll": ..., "drawdown": ...}
The Full Journey
Every 15 minutes, the Chief orchestrator wakes up and runs a complete trading cycle. Here is the journey of a single cycle, from wake-up to portfolio snapshot.
- 1Chief wakes up. Reloads portfolio state from TimescaleDB: open positions, current bankroll, daily P&L. Checks the kill switch: if daily drawdown exceeds 10%, the cycle is aborted and all trading is frozen.
- 2Scanner runs stages 1–5. Fetches 500+ markets, filters through liquidity gate, efficiency scoring, strategy matching, and Haiku triage. ~2–3 seconds total. Funnel metrics (stage counts and duration) are logged to
scanner_metrics. - 3Price history enrichment. For each approved candidate, the system fetches 7-day price history from the CLOB timeseries API. This gives the Analyst context on recent price movement and volatility.
- 4Analyst calls Claude Opus for each candidate. Calls are sequential to respect rate limits. Each call fetches category-specific news, builds a detailed prompt, and returns a calibrated probability estimate. ~30 seconds for 10–15 markets.
- 5Kelly sizes each signal. For each analysis with sufficient confidence,
build_signal()computes the confidence-weighted edge, checks the 5% threshold, and calculates the quarter-Kelly position size in USD. - 6Sentinel checks risk limits. Each signal passes through four sequential checks: single position size, portfolio heat, daily drawdown, and correlated category exposure. Any failure vetoes the signal.
- 7Trader executes approved trades. Fetches the order book, runs the slippage guard (abort if price moved >3%), and places a limit order through the live or paper client. Every execution is logged.
- 8Portfolio snapshot. The Chief computes total equity (cash + mark-to-market position value), updates heat and P&L, and writes a
portfolio_snapshotsrow. Grafana picks it up in real time.
The Orchestrator
async def run_cycle(self, cycle_number: int) -> None:
self._portfolio.cycle_number = cycle_number
# Reload positions from DB
positions_df = await self._store.get_active_positions()
self._portfolio.positions = self._build_positions(positions_df)
# Kill switch check
if check_kill_switch(self._portfolio):
log.critical("FROZEN", reason="kill_switch_active")
return
# Scan stages 1-5
approved, stage_counts, scan_ms = await self._scan()
await self._store.log_scan_metrics(stage_counts, scan_ms)
if not approved:
log.info("no_opportunities"); return
# Enrich with price history, then analyze with Opus
analyses = await analyze_batch(approved, self._calibrator)
# Build signals → Sentinel risk check → Trader execution
for analysis in analyses:
if analysis.confidence < settings.min_confidence: continue
signal = build_signal(
market_id=analysis.market_id,
p_est=analysis.p_est, p_market=analysis.p_market,
confidence=analysis.confidence,
bankroll=self._portfolio.bankroll,
...
)
if signal is None: continue
risk = check_signal(signal, self._portfolio)
if not risk.approved: continue
result = await execute_signal(signal, self._client, self._store)
if result.status in ("FILLED", "PAPER"):
self._portfolio.bankroll -= signal.size_usd
self._portfolio.heat = risk.heat
# Snapshot portfolio state to TimescaleDB
await self._store.log_portfolio_snapshot(self._portfolio)
log.info("cycle_complete", cycle=cycle_number,
bankroll=round(self._portfolio.bankroll, 2))
The system is designed for one thing: disciplined, systematic edge extraction from prediction markets. No heroics, no overrides, no FOMO. Scan, analyze, size, check, execute, log. Every 15 minutes. The thesis isn't that the LLM is always right. It's that over hundreds of bets, a calibrated probability estimator with conservative sizing and strict risk limits has positive expected value. Let the math do the work.