Edge Architecture

How WebLLM, Transformers.js, and ONNX Runtime Web work together to enable browser-based AI inference.

System Overview

Loading diagram...

LLM Inference

Runs quantized LLMs directly in the browser via WebGPU. Handles conversational refinement, intent classification, and simple query handling.

Embeddings

Hugging Face's JavaScript library for running transformer models. Provides product embeddings and semantic search capabilities.

Vision

Microsoft's cross-platform inference engine for ONNX models. Enables visual search and image classification on-device.

Storage

Persistent local storage for preference models, cached product catalog, and conversation history. Enables offline functionality.

Model	Size	Use Case	Load Time	Performance
Phi-3-mini	~1.5GB	Conversational agent	20-30s	15-20 tok/s
all-MiniLM-L6-v2	~25MB	Product embeddings	2-3s	<50ms/text
CLIP (ViT-B/32)	~350MB	Visual search	5-10s	100-200ms/image

V3 Overview

Client Models