System Design
LiveEdge Architecture
How WebLLM, Transformers.js, and ONNX Runtime Web work together to enable browser-based AI inference.
System Overview
Loading diagram...
Client-Side Stack
WebLLM
LLM InferenceRuns quantized LLMs directly in the browser via WebGPU. Handles conversational refinement, intent classification, and simple query handling.
Supported Models
- Phi-3-mini (~1.5GB quantized)
- Llama-3.2 (1B, 3B variants)
- Gemma 2 (2B)
Performance
- 15-20 tokens/second (desktop)
- 5-10 tokens/second (mobile)
- 20-30s initial load time
Transformers.js
EmbeddingsHugging Face's JavaScript library for running transformer models. Provides product embeddings and semantic search capabilities.
Primary Model
- all-MiniLM-L6-v2 (~25MB)
- 384-dimensional embeddings
- Optimized for semantic search
Performance
- <50ms per text embedding
- 2-3s initial load time
- WebGPU acceleration (v3)
ONNX Runtime Web
VisionMicrosoft's cross-platform inference engine for ONNX models. Enables visual search and image classification on-device.
Use Cases
- CLIP visual search (~350MB)
- Image classification
- "Shop this look" features
Performance
- 100-200ms per image
- 5-10s initial load time
- WASM + WebGPU backends
Local State + IndexedDB
StoragePersistent local storage for preference models, cached product catalog, and conversation history. Enables offline functionality.
- On-device preference learning
- Cached product embeddings
- Conversation context persistence
Browser Requirements
Minimum
- Chrome 113+ / Firefox 141+ / Safari 26+
- 4GB RAM available
- WebGPU support (or WASM fallback)
Optimal
- Modern device (2020+)
- Dedicated GPU or Apple Silicon
- 8GB+ RAM
Model Selection
| Model | Size | Use Case | Load Time | Performance |
|---|---|---|---|---|
| Phi-3-mini | ~1.5GB | Conversational agent | 20-30s | 15-20 tok/s |
| all-MiniLM-L6-v2 | ~25MB | Product embeddings | 2-3s | <50ms/text |
| CLIP (ViT-B/32) | ~350MB | Visual search | 5-10s | 100-200ms/image |