# B.AI 5-Way Race — What This Is

> A local web app that races 5 frontier AI models against the same question
> and visualizes the results live. Three models go through B.AI's unified
> API. Two go direct to their vendor (OpenAI, Google) for head-to-head
> comparison.
>
> **Runs entirely local.** No Netlify, no deploy. `npm start` and you're done.
> **Stack:** Vanilla HTML/JS + tiny Node/Express proxy + Chart.js (CDN).
> **No build step.** No React. No frameworks.

---

# Part 1: What This App Does

User types a question → clicks RACE → 5 lanes light up on a live race track
while 5 streaming answer cards fill below → comparison charts appear at the
bottom showing speed, cost, throughput, and token usage.

```
 USER FLOW
 ═══════════════════════════════════════════════════════════════════

 ┌────────────────────────────────────────────────────────────┐
 │   "Will BTC hit $150K by end of 2026?"      [ ▶ RACE ]     │
 └────────────────────────────────────────────────────────────┘
                              │
                              ▼
 ┌────────────────────────────────────────────────────────────┐
 │ LIVE TRACK                                                 │
 │  🏁 Opus 4.7   ──────────────────────────────────✓  100%  │
 │  🏁 DeepSeek   ────────────💭                       60%   │
 │  🏁 Kimi K2.5  ──────💭                             30%   │
 │  🏁 GPT-5.5    ──────────────────────────────────✓  100%  │
 │  🏁 Gemini 3   ──────────────────────────────────✓  100%  │
 └────────────────────────────────────────────────────────────┘
                              │
                              ▼
 ┌──────────┬──────────┬──────────┬──────────┬──────────┐
 │ Opus 4.7 │ DeepSeek │ Kimi K2.5│ GPT-5.5  │ Gemini 3 │
 │ via B.AI │ V4 Pro   │ via B.AI │ direct   │ direct   │
 │          │ via B.AI │          │          │          │
 │ TIME     │ TIME     │ TIME     │ TIME     │ TIME     │
 │ TOKENS   │ TOKENS   │ TOKENS   │ TOKENS   │ TOKENS   │
 │ COST     │ COST     │ COST     │ COST     │ COST     │
 │ ─────    │ ─────    │ ─────    │ ─────    │ ─────    │
 │ <streamed answer text appears here as tokens arrive>   │
 └──────────┴──────────┴──────────┴──────────┴──────────┘
                              │
                              ▼
 ┌────────────────────────────────────────────────────────────┐
 │  🏆 Fastest    💰 Cheapest    🧠 Most thorough    ⚡ TPS  │
 │                                                            │
 │  📊 Speed bar chart                                        │
 │  📊 Cost bar chart                                         │
 │  📊 Throughput (tokens/sec) bar chart                      │
 │  📊 Token usage stacked bar (input vs output)              │
 └────────────────────────────────────────────────────────────┘
```

---

# Part 2: Architecture

```
                       BROWSER
        ┌────────────────────────────────────────┐
        │   index.html  +  app.js                │
        │   + racer.js  + charts.js              │
        │   + counter.js + pricing.js            │
        │                                        │
        │   Fires 5 parallel fetch() calls,      │
        │   each reads SSE stream                │
        └────┬───┬───┬───┬───┬───────────────────┘
             │   │   │   │   │
             ▼   ▼   ▼   ▼   ▼
        ┌──────────────────────────────────────┐
        │  POST /api/race  (Express, port 3000)│
        │                                      │
        │  Body: { racerId, question }         │
        │  Looks up racer config, holds API    │
        │  keys server-side, pipes SSE stream  │
        │  back to browser unchanged           │
        └────┬───┬───┬───┬───┬─────────────────┘
             │   │   │   │   │
             ▼   ▼   ▼   ▼   ▼
   ┌────────────┐ ┌────────────┐ ┌─────────────┐
   │ api.b.ai   │ │ api.openai │ │ generative- │
   │   /v1/chat │ │   /v1/chat │ │ language.   │
   │ /completions│ │ /completions│ │ googleapis │
   │            │ │            │ │ /openai/    │
   │ 3 racers   │ │ 1 racer    │ │ 1 racer     │
   └────────────┘ └────────────┘ └─────────────┘
```

**Why parallel fetches:** if one model is slow, the others still appear
instantly. Each card renders independently. Better UX.

**Why a server proxy:** keeps API keys server-side. NEVER put OpenAI /
Google / B.AI keys in the browser. Smallest possible Node + Express server,
no functions framework needed.

---

# Part 3: Project Structure

```
BAI-MULTICHAT/
├── server.js                      ← Express server + racer config + SSE proxy
├── package.json                   ← deps: express, dotenv
├── smoke.js                       ← CLI smoke test (node smoke.js)
├── .env                           ← BAI_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY
├── .env.example                   ← committed template
├── .gitignore                     ← excludes .env + node_modules
├── docs/
│   ├── bai-five-way-race.md             ← this file
│   └── bai-five-way-race-BAI_DOCS_REF.md ← B.AI API reference
└── static/
    ├── index.html                 ← Shell + CSS (glassmorphism dark theme)
    └── js/
        ├── app.js                 ← orchestrator + global totals + verdicts
        ├── racer.js               ← per-card streaming + lane animation
        ├── pricing.js             ← USD per 1M tokens
        ├── charts.js              ← Chart.js setup, dark theme
        └── counter.js             ← LiveCounter lerp utility for animated numbers
```

---

# Part 4: The Racer Config (the brain)

Inside `server.js`. Add/remove racers by editing this object only.

```javascript
// Per-racer flags:
//   sendTemperature: false   → omit `temperature` (Claude Opus 4.7, GPT-5+ reasoning)
//   maxTokensField: "max_completion_tokens" → use that instead of `max_tokens` (GPT-5+ reasoning)
const RACERS = {
  opus47: {
    name: "Claude Opus 4.7",
    via: "B.AI",
    apiUrl: "https://api.b.ai/v1/chat/completions",
    apiKeyEnv: "BAI_API_KEY",
    model: "claude-opus-4.7",
    color: "#d97757",
    sendTemperature: false        // Opus 4.7 rejects custom temperature
  },
  deepseek: {
    name: "DeepSeek V4 Pro",
    via: "B.AI",
    apiUrl: "https://api.b.ai/v1/chat/completions",
    apiKeyEnv: "BAI_API_KEY",
    model: "deepseek-v4-pro",
    color: "#4d6bfe"
  },
  kimi: {
    name: "Kimi K2.5",
    via: "B.AI",
    apiUrl: "https://api.b.ai/v1/chat/completions",
    apiKeyEnv: "BAI_API_KEY",
    model: "kimi-k2.5",
    color: "#111111"
  },
  gpt55: {
    name: "GPT-5.5",
    via: "OpenAI direct",
    apiUrl: "https://api.openai.com/v1/chat/completions",
    apiKeyEnv: "OPENAI_API_KEY",
    model: "gpt-5.5",
    color: "#10a37f",
    sendTemperature: false,                   // reasoning models reject temperature
    maxTokensField: "max_completion_tokens"   // and require this instead of max_tokens
  },
  gemini3pro: {
    name: "Gemini 3 Pro",
    via: "Google direct",
    apiUrl: "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions",
    apiKeyEnv: "GOOGLE_API_KEY",
    model: "gemini-3-pro-preview",   // real Google model id (not "gemini-3-pro")
    color: "#4285f4"
  }
};
```

**Pricing** lives in `static/js/pricing.js` (USD per 1M tokens). Verify
GPT-5.5 + Gemini 3 Pro pricing at the vendor docs before serious use.

---

# Part 5: The Server (server.js)

The whole proxy is one file: ~110 lines. Highlights:

```javascript
// dotenv loads .env relative to server.js, so the server works
// regardless of which directory it was launched from.
dotenv.config({ path: path.join(__dirname, ".env") });

// Build the upstream request body, conditionally including temperature
// and choosing between max_tokens / max_completion_tokens.
const body = {
  model: racer.model,
  messages: [
    { role: "system", content: SYSTEM_PROMPT },
    { role: "user", content: question }
  ],
  stream: true,
  stream_options: { include_usage: true }   // forces usage in final SSE chunk
};
body[racer.maxTokensField || "max_tokens"] = 1500;
if (racer.sendTemperature !== false) body.temperature = 0.7;

// Fetch upstream and pipe its body straight back to the browser.
const upstream = await fetch(racer.apiUrl, {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiKey}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify(body)
});

if (!upstream.ok) {
  const errText = await upstream.text();
  return res.status(upstream.status).json({ error: errText.slice(0, 1000) });
}

res.set({
  "Content-Type": "text/event-stream",
  "Cache-Control": "no-cache",
  "X-Accel-Buffering": "no"
});

const nodeStream = Readable.fromWeb(upstream.body);
req.on("close", () => nodeStream.destroy());
nodeStream.pipe(res);
```

**Critical:** `stream_options.include_usage = true` — without this, the
final SSE chunk has no token counts and your cost calc returns $0.

---

# Part 6: The Frontend

Five files in `static/`:

## index.html

A single screen. Top: hero with live total counters (race time, total
tokens, total cost, status). Middle: live race track with 5 lanes →
streaming answer cards → verdict trophies. Bottom: 4 metric charts.

Visual style: glassmorphism dark theme, racer-specific brand color accents,
animated counters, pulsing dots for active racers, blinking cursor on
streaming text, live progress bars.

## app.js — orchestrator

```javascript
import { Racer } from "./racer.js";
import { renderCharts } from "./charts.js";
import { LiveCounter, fmt } from "./counter.js";

const RACER_ORDER = ["opus47", "deepseek", "kimi", "gpt55", "gemini3pro"];
const racersMetaPromise = fetch("/api/racers").then(r => r.json());

document.getElementById("race-btn").addEventListener("click", async () => {
  const question = document.getElementById("question").value.trim();
  if (!question) return;

  prepareForRace();
  const metas = await racersMetaPromise;
  const racers = RACER_ORDER
    .filter(id => metas[id])
    .map(id => new Racer(id, metas[id], question, { onUpdate: refreshTotals }));

  const settled = await Promise.allSettled(racers.map(r => r.run()));
  const completed = settled.filter(s => s.status === "fulfilled").map(s => s.value);

  showVerdicts(completed);
  renderCharts(completed);
});
```

## racer.js — per-card streaming + lane animation

Each `Racer` instance:
- Mounts a card (streaming text + big stats + progress bar) AND a lane
  (horizontal runner that moves with progress)
- Tracks 5 phases: `queued → thinking → writing → done/errored`
- Handles `delta.content` (visible output) AND `delta.reasoning_content`
  (hidden chain-of-thought from DeepSeek + Kimi)
- Runs an 80ms wall-clock ticker so silent racers (GPT-5.5, Gemini buffer
  everything during reasoning) still show time/cost ticking
- Computes lane position as `max(time_progress, token_progress)`:
  - `time_progress = elapsed/30s` (capped at 30%, drives silent racers forward)
  - `token_progress = output_tokens/700` (caps at 100%)
  - Done racers snap to 100%

## pricing.js + cost calculation

```javascript
const PRICING = {
  opus47:     { input: 5.00,  output: 25.00 },  // USD per 1M tokens
  deepseek:   { input: 0.435, output: 0.87  },
  kimi:       { input: 0.23,  output: 3.00  },
  gpt55:      { input: 5.00,  output: 30.00 },  // ⚠ verify
  gemini3pro: { input: 1.25,  output: 10.00 }   // ⚠ verify
};

export function calcCost(racerId, inputTokens, outputTokens) {
  const p = PRICING[racerId];
  return (inputTokens * p.input + outputTokens * p.output) / 1_000_000;
}
```

**Output token count gotcha:** Gemini reports `completion_tokens` as
visible-only and hides reasoning tokens — but it bills you for them. We
derive billed output as `total_tokens - prompt_tokens`, which is correct
for every model we tested (DeepSeek, Kimi, Opus, GPT-5.5 all match
`completion_tokens`; only Gemini diverges).

## charts.js — the money shot

Four Chart.js bar charts (speed, cost, throughput, token stacked) themed
to match the dark UI: tabular numbers, rounded bars, smooth easing,
muted gridlines.

## counter.js — animated number lerp

`LiveCounter` smoothly tweens displayed numbers from current → target
using `requestAnimationFrame`. Used for time, tokens, cost in cards and in
the global hero totals so numbers never jump abruptly.

---

# Part 7: package.json + .env

```json
{
  "name": "bai-race",
  "version": "0.1.0",
  "private": true,
  "type": "module",
  "scripts": {
    "start": "node server.js",
    "dev": "node --watch server.js"
  },
  "dependencies": {
    "dotenv": "^16.4.5",
    "express": "^4.21.2"
  }
}
```

`.env`:
```
BAI_API_KEY=sk-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=AIza...

PORT=3000
```

`.gitignore`:
```
node_modules/
.env
.env.local
*.log
.DS_Store
```

---

# Part 8: Running It

```bash
npm install            # one time
npm start              # http://localhost:3000
# or
npm run dev            # auto-restart on server.js changes
```

Smoke test all 5 racers from the CLI without touching the browser:

```bash
node smoke.js          # fires all 5 in parallel, prints per-racer
                       # offsets, ttft, tokens, usage object
```

---

# Part 9: Pre-Flight Checklist

```
☐ B.AI account topped up (any amount unlocks Claude Opus 4.7)
☐ B.AI API key generated and pasted into .env
☐ OpenAI API key with billing enabled, pasted into .env
☐ Google AI Studio API key, pasted into .env
☐ npm install
☐ npm start → http://localhost:3000 loads
☐ Run smoke.js → all 5 racers return ok with usage object
☐ Hit RACE in the browser → all 5 lanes animate to 100%
☐ Verify cost numbers look sane (~$0.02-0.04 per race)
☐ Verify each model's pricing constants in pricing.js are current
```

---

# Part 10: Gotchas (the ones we actually hit)

```
✗ Browser CORS — DO NOT call OpenAI/Google/B.AI from index.html directly.
  Always go through /api/race. Some providers block browser origins.

✗ Token counts arrive in the LAST chunk only, and only when you set
  stream_options: { include_usage: true }. Without that flag, all your
  cost calcs return $0.

✗ Premium B.AI models 403 if your account hasn't been topped up
  (free credits don't unlock Opus). Top up $5 first.

✗ Google's OpenAI-compat endpoint is at /v1beta/openai/, not /v1/.

✗ Gemini's actual model id is `gemini-3-pro-preview`, not `gemini-3-pro`.
  Query GET /v1beta/models?key=... to see real ids on your account.

✗ Claude Opus 4.7 rejects `temperature` outright (deprecated for that
  model). Omit it for opus47.

✗ GPT-5+ reasoning models reject `max_tokens` and `temperature`. Use
  `max_completion_tokens` and skip temperature.

✗ Gemini hides reasoning tokens from `completion_tokens` but bills them.
  Compute billed output as total_tokens - prompt_tokens to match what
  you'll actually be charged for.

✗ Some models (DeepSeek, Kimi) emit `delta.reasoning_content` chunks
  during the thinking phase. These aren't `delta.content` and won't show
  up in the visible answer — handle them separately if you want to show
  live reasoning. Others (GPT-5.5, Gemini) buffer the whole response
  during reasoning and emit nothing on the wire until ready.

✗ dotenv.config() loads .env from process.cwd() by default, which breaks
  if the server is launched from a different directory. Use:
  dotenv.config({ path: path.join(__dirname, ".env") })

✗ Reasoning latency dwarfs throughput. Kimi K2.5 with full reasoning can
  take 30+ seconds before emitting a single visible token. Race UX has
  to compensate (see lane time-based drift component).
```