SIX|V3 DocumentationEdge-NativeLive

System Design

Live

Edge Architecture

How WebLLM, Transformers.js, and ONNX Runtime Web work together to enable browser-based AI inference.

System Overview

Loading diagram...

Client-Side Stack

WebLLM

LLM Inference

Runs quantized LLMs directly in the browser via WebGPU. Handles conversational refinement, intent classification, and simple query handling.

Supported Models

  • Phi-3-mini (~1.5GB quantized)
  • Llama-3.2 (1B, 3B variants)
  • Gemma 2 (2B)

Performance

  • 15-20 tokens/second (desktop)
  • 5-10 tokens/second (mobile)
  • 20-30s initial load time

Transformers.js

Embeddings

Hugging Face's JavaScript library for running transformer models. Provides product embeddings and semantic search capabilities.

Primary Model

  • all-MiniLM-L6-v2 (~25MB)
  • 384-dimensional embeddings
  • Optimized for semantic search

Performance

  • <50ms per text embedding
  • 2-3s initial load time
  • WebGPU acceleration (v3)

ONNX Runtime Web

Vision

Microsoft's cross-platform inference engine for ONNX models. Enables visual search and image classification on-device.

Use Cases

  • CLIP visual search (~350MB)
  • Image classification
  • "Shop this look" features

Performance

  • 100-200ms per image
  • 5-10s initial load time
  • WASM + WebGPU backends

Local State + IndexedDB

Storage

Persistent local storage for preference models, cached product catalog, and conversation history. Enables offline functionality.

  • On-device preference learning
  • Cached product embeddings
  • Conversation context persistence

Browser Requirements

Minimum

  • Chrome 113+ / Firefox 141+ / Safari 26+
  • 4GB RAM available
  • WebGPU support (or WASM fallback)

Optimal

  • Modern device (2020+)
  • Dedicated GPU or Apple Silicon
  • 8GB+ RAM

Model Selection

ModelSizeUse CaseLoad TimePerformance
Phi-3-mini~1.5GBConversational agent20-30s15-20 tok/s
all-MiniLM-L6-v2~25MBProduct embeddings2-3s<50ms/text
CLIP (ViT-B/32)~350MBVisual search5-10s100-200ms/image