# Bonsai LLM - Build Guide

> **Download ready-made builds:**
> - [**Windows EXE**](https://github.com/alphastack1/storage/releases/download/bonsai-llm-v1.0.0/Bonsai-LLM.exe) (1.2 GB)
> - [**Android APK**](https://github.com/alphastack1/storage/releases/download/bonsai-llm-v1.0.0/bonsai-llm.apk) (250 MB)
> - [Release page](https://github.com/alphastack1/storage/releases/tag/bonsai-llm-v1.0.0)
>
> **Or build from source:** Drop this file into an empty folder, open it in
> Claude Code, and say *"Read this file and build everything."*

---

# Part 1: The Big Picture

```
 BONSAI LLM
 ================================================================

 A fully offline AI chat app. Zero cloud. Zero accounts.
 Runs a Large Language Model entirely on your own device.

 Ships as two standalone packages:
 ┌─────────────────────────┐  ┌─────────────────────────┐
 │  Windows EXE (1.2 GB)   │  │  Android APK (250 MB)   │
 │  Double-click to run    │  │  Sideload to install     │
 │  Native window          │  │  Runs on any ARM64 phone │
 │  CUDA GPU + CPU         │  │  CPU inference           │
 └─────────────────────────┘  └─────────────────────────┘

 Both bundle: model + inference engine + UI
 Nothing else to install. No internet after first launch.
```

## Architecture

```
 ┌─────────────────────────────────────────────────────────────┐
 │                     UI LAYER                                 │
 │                                                              │
 │  static/index.html  (~1740 lines, single file)              │
 │  Dark theme, streaming markdown, chat history                │
 │  No framework, no build step                                 │
 │  Shared between desktop and Android                          │
 └──────────────────────┬──────────────────────────────────────┘
                        │ HTTP on localhost
 ┌──────────────────────┴──────────────────────────────────────┐
 │                   SERVER LAYER                               │
 │                                                              │
 │  Desktop: app.py (Flask)     Android: LlamaService.java     │
 │  - Downloads engine/models   - Starts llama-server process  │
 │  - Manages subprocess        - Foreground service           │
 │  - Proxies SSE streams       - WebView talks direct to it   │
 └──────────────────────┬──────────────────────────────────────┘
                        │ subprocess
 ┌──────────────────────┴──────────────────────────────────────┐
 │                  INFERENCE ENGINE                             │
 │                                                              │
 │  llama-server (PrismML fork of llama.cpp)                   │
 │  - OpenAI-compatible /v1/chat/completions endpoint          │
 │  - Custom Q1_0 kernel for 1-bit quantization                │
 │  - CUDA on desktop, CPU on Android                          │
 │  - Loads Bonsai-*.gguf model files                          │
 └─────────────────────────────────────────────────────────────┘
```

## Cost

```
 ┌──────────────┬────────────────────────────────────────────┐
 │ Everything   │ $0  (open-source models + engine)          │
 │ Disk space   │ 250 MB min (APK) / 1.2 GB (EXE + CUDA)    │
 │ Internet     │ Not needed. Fully offline.                 │
 │ GPU          │ Optional. CPU works. NVIDIA GPU = faster.  │
 └──────────────┴────────────────────────────────────────────┘
```

---

# Part 2: Project Structure

```
 bonsai-llm/
 ├── app.py ·················· Flask backend + subprocess manager
 ├── requirements.txt ········ flask, flask-cors, requests
 ├── start.bat ··············· Dev launcher (creates venv, runs app)
 ├── static/
 │   ├── index.html ·········· ENTIRE frontend (~1740 lines)
 │   └── fonts/ ·············· Outfit + JetBrains Mono
 │
 │── EXE packaging:
 │   ├── bonsai-llm.spec ····· PyInstaller spec
 │   ├── build-exe.bat ······· Build script
 │   ├── prepare-exe-bin.py ·· Stages binaries for bundling
 │   ├── make-icon.py ········ Generates bonsai.ico
 │   └── bonsai.ico ·········· Multi-resolution app icon
 │
 └── android/ ················ Android Studio project
     └── app/src/main/
         ├── java/ ··········· MainActivity + LlamaService
         ├── assets/ ········· index.html + bundled model
         └── cpp/ ············ llama.cpp (PrismML fork, NDK)

 Auto-created at runtime (gitignored):
 ├── venv/ ··················· Python virtual environment
 ├── bin/ ···················· llama-server.exe + DLLs
 └── models/ ················· Bonsai-*.gguf files
```

```
 requirements.txt:
 ┌──────────────────────────────────┐
 │ flask>=3.0.0                    │
 │ flask-cors>=4.0.0               │
 │ requests>=2.31.0                │
 └──────────────────────────────────┘
 No PyTorch. No transformers. All inference is in llama-server.
```

---

# Part 3: The Models

```
 BONSAI MODEL FAMILY (PrismML, 1-bit Q1_0 quantization)
 ================================================================

 ┌───────────────┬──────────────────────┬────────┬──────────────┐
 │ Model         │ File                 │ Size   │ Notes        │
 ├───────────────┼──────────────────────┼────────┼──────────────┤
 │ Bonsai 1.7B   │ Bonsai-1.7B.gguf     │ 237 MB │ Bundled      │
 │ Bonsai 4B     │ Bonsai-4B.gguf       │ 546 MB │ On demand    │
 │ Bonsai 8B     │ Bonsai-8B.gguf       │ 1.2 GB │ On demand    │
 └───────────────┴──────────────────────┴────────┴──────────────┘

 All hosted on HuggingFace (prism-ml/). 1.7B ships with APK + EXE.
```

```
 WHY 1-BIT WORKS
 ================================================================

 Standard quantization:  fp16 ──► Q4 ──► Q8
                         3.4 GB   900 MB  1.7 GB   (for 1.7B model)

 PrismML Q1_0:           fp16 ──► Q1_0
                         3.4 GB   237 MB           (14x smaller!)

 Tradeoff: Q1_0 needs a custom kernel (CUDA for GPU, C for CPU).
 Standard llama.cpp doesn't have it. Must use PrismML's fork.
```

```
 LLAMA-SERVER LAUNCH FLAGS
 ================================================================

 llama-server.exe
   -m models/Bonsai-1.7B.gguf     model file
   --host 127.0.0.1 --port 8080   localhost only
   -c 2048                         context length (tokens)
   -ngl 99                         all layers on GPU
   -t 4                            CPU threads
   --no-webui                      we have our own UI
   --cache-type-k q8_0             ┐ KV cache quantization
   --cache-type-v q8_0             ┘ ~47% memory saved
```

---

# Part 4: The Backend (app.py)

```
 app.py  (~750 lines, single file)
 ================================================================

 Three jobs:
 ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐
 │ 1. DOWNLOAD  │  │ 2. MANAGE    │  │ 3. PROXY             │
 │              │  │              │  │                      │
 │ Engine zips  │  │ Start/stop   │  │ Forward /api/chat    │
 │ Model GGUFs  │  │ llama-server │  │ to llama-server as   │
 │ Progress %   │  │ subprocess   │  │ streaming SSE        │
 └──────────────┘  └──────────────┘  └──────────────────────┘
```

## API Routes

```
 ┌────────┬──────────────────────────┬──────────────────────────┐
 │ Method │ Path                     │ What                     │
 ├────────┼──────────────────────────┼──────────────────────────┤
 │ GET    │ /                        │ Serve index.html         │
 │ GET    │ /static/<file>           │ Serve CSS/JS/fonts       │
 │ GET    │ /api/status              │ Full state + heartbeat   │
 │ POST   │ /api/goodbye             │ Browser closing          │
 │ POST   │ /api/setup/binary        │ Start engine download    │
 │ POST   │ /api/setup/model         │ Start model download     │
 │ POST   │ /api/setup/delete_model  │ Delete cached .gguf      │
 │ POST   │ /api/load                │ Start llama-server       │
 │ POST   │ /api/unload              │ Stop llama-server        │
 │ POST   │ /api/chat                │ Stream chat completion   │
 └────────┴──────────────────────────┴──────────────────────────┘

 /api/status returns EVERYTHING: installed models, download
 progress, binary state, loaded model, server status.
 Frontend polls it every 1.5 seconds.
```

## Engine Download

```
 PrismML's GitHub releases provide two zips:
 ================================================================

 Step 1: Engine binary (~134 MB zip)
 ┌──────────────────────────────────────────────────────────┐
 │ llama-server.exe + ggml.dll + ggml-base.dll + ...       │
 └──────────────────────────────────────────────────────────┘

 Step 2: CUDA runtime (~383 MB zip)
 ┌──────────────────────────────────────────────────────────┐
 │ cublas64_13.dll + cublasLt64_13.dll + cudart64_13.dll   │
 └──────────────────────────────────────────────────────────┘

 CUDA version auto-detected from NVIDIA driver:
 ┌────────────────────────────────┐
 │ nvidia-smi driver major >= 560 │──► CUDA 13.1 binaries
 │ nvidia-smi driver major <  560 │──► CUDA 12.4 binaries
 │ no nvidia-smi (no GPU)         │──► CUDA 12.4 (CPU fallback)
 └────────────────────────────────┘

 Download progress state machine:
 connecting ──► downloading ──► extracting ──► done
                    │               │
                    └───────────────┴──► error
```

## Chat Streaming

```
 Browser            Flask              llama-server
 ═══════            ═════              ════════════

 POST /api/chat ──► set stream=True ──► POST /v1/chat/completions
                                              │
                    data: {delta} ◄─── data: {"choices":[{"delta":
 data: {delta} ◄───                     {"content":"Hi"}}]}
                                              │
 data: [DONE]  ◄── data: [DONE]  ◄─── data: [DONE]

 Flask uses Response(stream_with_context(generate()),
   content_type="text/event-stream")
 + header X-Accel-Buffering: no  (prevents buffering)
```

## Heartbeat Watchdog

```
 Auto-shutdown when user closes the window:
 ================================================================

 Browser                           Server
 ═══════                           ══════

 GET /api/status ───────────────► last_heartbeat = now()
    (every 1.5 seconds)

   ... user closes window ...

 beforeunload ──────────────────► POST /api/goodbye
                                  → stop_llama_server()
                                  → os._exit(0)

 (if goodbye missed:)
                                  Watchdog thread (every 5s):
                                  ┌──────────────────────────┐
                                  │ now - last > 20s?        │
                                  │ → stop llama-server      │
                                  │ → os._exit(0)            │
                                  └──────────────────────────┘

 os._exit(0) not sys.exit() -- Flask catches SystemExit.
```

---

# Part 5: The Frontend (static/index.html)

```
 DESIGN SYSTEM
 ================================================================

 Theme:    Dark (zinc palette)
 Accent:   #22c55e (Bonsai green)
 Fonts:    Outfit (UI) + JetBrains Mono (code, tok/s)

 Color tokens:
 ┌──────────────────────────────────────────────────┐
 │  --bg:       #09090b   almost black              │
 │  --surface:  #18181b   cards, input bg           │
 │  --surface2: #27272a   hover, active             │
 │  --border:   #27272a   subtle borders            │
 │  --text:     #fafafa   primary text              │
 │  --text2:    #a1a1aa   secondary                 │
 │  --muted:    #71717a   timestamps, hints         │
 │  --accent:   #22c55e   Bonsai green              │
 └──────────────────────────────────────────────────┘

 Animations: typing dots, streaming cursor pulse,
             progress bar shimmer, side panel slide
```

## Screen Flow

```
 PAGE LOAD
 ================================================================

 pollStatus() fires (repeats every 1.5s)
       │
       ▼
 ┌─────────────────────────────────────┐
 │  Binary installed?                  │
 │  ├── NO  ──► SETUP SCREEN           │
 │  │           "Download engine"       │
 │  └── YES ──► Models on disk?        │
 │              ├── NO  ──► SETUP       │
 │              │   "Pick a model"      │
 │              └── YES ──► CHAT        │
 └─────────────────────────────────────┘

 In packaged EXE/APK: engine + 1.7B are bundled,
 so it skips straight to the chat screen.
```

## Chat UI Layout

```
 ┌────────────────────────────────────────────────────────────┐
 │  Bonsai                                       [+]  [gear] │
 │  Bonsai 1.7B                                              │
 │ ──────────────────────────────────────────────────────── │
 │                                                            │
 │   YOU                                                      │
 │   Explain quantum computing simply                         │
 │                                                            │
 │   BONSAI                                                   │
 │   Quantum computers use qubits which can exist in          │
 │   superposition. Unlike classical bits...                   │
 │                                                            │
 │   ```python                                                │
 │   def grover(n):                                           │
 │       return math.sqrt(n)                                  │
 │   ```                                                      │
 │   7.1 tok/s    [Copy]                                      │
 │                                                            │
 │ ──────────────────────────────────────────────────────── │
 │ [Message Bonsai...                              ] [Send]   │
 └────────────────────────────────────────────────────────────┘

 Side panel (slides from right):
 ┌─────────────────────────────┐
 │ [Chats] [Models]         X  │
 │ ──────────────────────────  │
 │ + New chat                  │
 │ ─────────                   │
 │ * Quantum explanation  Apr5 │
 │   Python fizzbuzz      Apr4 │
 │                             │
 │ MODELS TAB:                 │
 │ * Bonsai 1.7B  237MB  [ok]  │
 │   Bonsai 4B    546MB [Load] │
 │   Bonsai 8B   1.2GB  [DL]  │
 └─────────────────────────────┘
```

## Streaming Token Rendering

```
 TOKEN-BY-TOKEN MARKDOWN
 ================================================================

 fullResponse = ''        ◄── accumulates raw text
       │
       │  SSE chunk arrives: delta = "Hello"
       │
       ▼
 fullResponse += delta    ◄── append
       │
       ▼
 renderMarkdown(fullResponse)  ◄── re-parse entire response
       │
       ▼
 streamContent.innerHTML = result  ◄── DOM updates live

 tokPerSec = tokenCount / elapsed  ◄── speed counter

 Works at 5-60 tok/s. No throttling needed.
```

## Chat History

```
 localStorage (no database):
 ================================================================

 bonsai_chats_list = [
   { id: "abc123", title: "Quantum...", date: 1712345678 },
   { id: "def456", title: "Python...",  date: 1712300000 },
 ]

 bonsai_chats_abc123 = [
   { role: "user",      content: "Explain quantum..." },
   { role: "assistant", content: "Quantum computers..." },
 ]

 Title = first 50 chars of first user message.
 Current chat saves before switching.
```

## JavaScript State

```
 state = {
   binaryInstalled   ◄── from /api/status
   models: {}        ◄── which GGUFs exist on disk
   activeModel       ◄── which one llama-server loaded
   llamaRunning      ◄── is llama-server up?
   downloads: {}     ◄── progress for each download
   messages: []      ◄── current chat messages
   streaming         ◄── in-flight request?
   abortController   ◄── Stop button
   chatId            ◄── current chat ID
   chatList: []      ◄── [{id, title, date}, ...]
 }

 Polling loop drives everything:
 pollStatus() ──► fetch('/api/status') ──► update state ──► re-render
      │
      └──► setTimeout(1500ms) ──► pollStatus()
```

---

# Part 6: Design Decisions & Gotchas

```
 WHY IT'S BUILT THIS WAY
 ================================================================

 Single HTML file (no build step)
 └── Zero tooling. 1740 lines = CSS + JS + all features.

 Flask proxies to llama-server (not Python bindings)
 └── PrismML's 1-bit kernel only exists in their fork's
     prebuilt binaries. Subprocess = swap any fork easily.

 Auto-detect CUDA from nvidia-smi
 └── Driver 560+ needs CUDA 13.x, older needs 12.4.
     Wrong version = cryptic "DLL load failed" errors.

 KV cache quantization (q8_0)
 └── ~47% memory saved vs f16. No quality loss.
     Critical for phones with limited RAM.

 localStorage for chats (no database)
 └── Zero-config. Chats are tiny (<1KB each).
```

```
 GOTCHAS
 ================================================================

 Windows DLL error popup
 ├── ggml.dll calls LoadLibrary("ggml-cuda.dll")
 ├── On CPU-only systems: blocking error dialog
 └── Fix: SetErrorMode(SEM_FAILCRITICALERRORS) before Popen

 llama-server needs cwd=BIN_DIR
 ├── It loads sibling DLLs by relative path
 └── Spawning from elsewhere = silent failure

 taskkill /T to kill process tree
 ├── process.terminate() only kills parent on Windows
 └── Children = zombie GPU memory. Use taskkill /F /T /PID

 SSE buffering
 ├── Without X-Accel-Buffering: no header
 └── Browsers/proxies buffer = full response at once

 os._exit(0) not sys.exit()
 ├── Flask catches SystemExit in threaded mode
 └── Call stop_llama_server() first (skips atexit)

 renderMessages() must rewrite innerHTML
 ├── Even for empty state (welcome screen)
 └── Otherwise old chat's DOM stays visible on switch
```

---

# Part 7: Packaging as a Windows EXE

```
 Bonsai-LLM.exe  (~1.2 GB, single file)
 ================================================================

 Bundles:
 ┌─────────────────────────────────────────────────────────────┐
 │ Python runtime                                              │
 │ Flask + dependencies                                        │
 │ llama-server.exe + all DLLs (incl. CUDA runtime ~540 MB)   │
 │ Bonsai-1.7B.gguf model (237 MB)                            │
 │ HTML/CSS/JS UI + fonts                                      │
 │ Bonsai tree icon                                            │
 └─────────────────────────────────────────────────────────────┘

 Double-click ──► native window (pywebview + Edge WebView2)
 No terminal. No browser. No setup screen.
```

## Runtime Layout

```
 PyInstaller --onefile extracts to temp dir at launch:
 ================================================================

 sys._MEIPASS/  (read-only, deleted on exit)
 ├── bin/               llama-server.exe + DLLs
 ├── static/            index.html + fonts
 ├── models/            Bonsai-1.7B.gguf (bundled copy)
 └── bonsai.ico

 <EXE directory>/  (persistent)
 └── models/            model copied here on first run
```

## Frozen Mode Startup

```
 EXE launches
 ════════════════════════════════════════════════════════
      │
      ├── 1. Set paths: BIN_DIR = _MEIPASS/bin
      │                 MODELS_DIR = <exe dir>/models
      │
      ├── 2. Extract bundled model (first run only)
      │      _MEIPASS/models/*.gguf ──► <exe dir>/models/
      │
      ├── 3. Start Flask in background thread (:7860)
      │
      ├── 4. Start llama-server with 1.7B model
      │      (skips setup screen -- model is bundled)
      │
      ├── 5. Wait for Flask to accept connections
      │
      ├── 6. Open pywebview native window
      │      (Edge WebView2, 1100x800, min 600x500)
      │
      └── 7. On window close:
             stop_llama_server() ──► os._exit(0)
```

## CUDA Without Installation

```
 ┌─────────────────────────────────────────────────────────────┐
 │  EXE bundles full CUDA runtime:                             │
 │  cublas64_13.dll + cublasLt64_13.dll + cudart64_13.dll     │
 │                                                              │
 │  NVIDIA GPU present?                                        │
 │  ├── YES ──► GPU inference (fast)                           │
 │  └── NO  ──► cudaGetDeviceCount returns 0                   │
 │              llama-server falls back to CPU automatically   │
 │                                                              │
 │  ggml.dll has hard PE imports on ggml-cuda.dll, so the     │
 │  CUDA DLLs must be bundled even for CPU-only operation.    │
 └─────────────────────────────────────────────────────────────┘
```

## Build

```
 build-exe.bat (runs all three steps):
 ════════════════════════════════════════════════════════

 1. python prepare-exe-bin.py    copy binaries to bin_exe/
 2. python make-icon.py          generate bonsai.ico
 3. pyinstaller bonsai-llm.spec  build dist/Bonsai-LLM.exe

 Key spec settings:
 ┌────────────────────────────────────────────────────────┐
 │ console=False          no terminal window              │
 │ icon=bonsai.ico        custom app icon                 │
 │ onefile=True           single EXE output               │
 │ hiddenimports:         webview, edgechromium,           │
 │                        clr_loader, pythonnet            │
 └────────────────────────────────────────────────────────┘
```

---

# Part 8: Packaging as an Android APK

```
 bonsai-llm.apk  (~250 MB, single file)
 ================================================================

 Same UI + same model, but no Python. No internet. No servers.
 The Flask layer is replaced by Java services.

 DESKTOP                         ANDROID
 ═══════                         ═══════
 Browser ──► Flask ──► llama     WebView ──► llama-server
             (Python)            LlamaService.java spawns it
                                 as a native subprocess
```

## Project Structure

```
 android/app/
 ├── build.gradle ············ signing, NDK ABI filter
 └── src/main/
     ├── AndroidManifest.xml
     ├── res/values/
     │   └── styles.xml ······ edge-to-edge theme
     ├── cpp/
     │   └── llama.cpp/ ······ PrismML fork (git submodule)
     │                         compiled via NDK (arm64-v8a)
     ├── java/com/bonsai/llm/
     │   ├── MainActivity ···· WebView + inset handling
     │   └── LlamaService ···· spawns llama-server process
     └── assets/
         ├── index.html ······ same file as desktop
         └── Bonsai-1.7B.gguf  bundled model (237 MB)
```

## Android 15+ Edge-to-Edge

```
 API 35+ forces edge-to-edge. Three things must align:
 ================================================================

 (1) THEME (styles.xml)
 ┌─────────────────────────────────────────────────────┐
 │ statusBarColor = transparent                        │
 │ navigationBarColor = transparent                    │
 │ enforceStatusBarContrast = false                    │
 │ enforceNavigationBarContrast = false                │
 └─────────────────────────────────────────────────────┘

 (2) MAINACTIVITY (insets listener)
 ┌─────────────────────────────────────────────────────┐
 │ Attach to android.R.id.content (NOT the WebView)   │
 │ Apply systemBars + displayCutout as padding         │
 │ Include Type.ime() for keyboard handling            │
 │ bottom = max(bars.bottom, ime.bottom)               │
 └─────────────────────────────────────────────────────┘

 (3) HTML
 ┌─────────────────────────────────────────────────────┐
 │ No env(safe-area-inset-*) -- unreliable on Android  │
 │ Native padding handles everything                   │
 └─────────────────────────────────────────────────────┘
```

## Build

```
 cd android
 JAVA_HOME="/c/Program Files/Android/Android Studio/jbr"
 ANDROID_HOME="/c/Users/Admin/AppData/Local/Android/Sdk"
 ./gradlew.bat assembleRelease

 ──► app/build/outputs/apk/release/app-release.apk (~249 MB)

 Requires signing config in build.gradle or phones
 will reject with "App not installed".
```

---

# Part 9: .gitignore

```
venv/
bin/
bin_exe/
models/
__pycache__/
*.pyc
*.log
*.part
build/
dist/
*.exe
*.apk
android/.gradle/
android/app/.cxx/
android/app/build/
android/build/
android/local.properties
android/app/src/main/assets/Bonsai-*.gguf
android/app/src/main/cpp/llama.cpp/
```