Client Models

Running LLMs and embeddings directly in the browser using WebGPU and WebAssembly for zero-latency inference.

WebLLM

WebLLM brings large language models to the browser by leveraging WebGPU for GPU-accelerated inference. This enables conversational AI without server roundtrips.

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Initialize the engine with a model
const engine = await CreateMLCEngine("Phi-3-mini-4k-instruct-q4f16_1-MLC", {
  initProgressCallback: (progress) => {
    console.log(`Loading: ${progress.progress * 100}%`);
  }
});

// Generate a response
const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a shopping assistant." },
    { role: "user", content: "Show me something warmer" }
  ],
  temperature: 0.7,
  max_tokens: 150
});

console.log(response.choices[0].message.content);
// "I'll filter for warmer items like jackets and sweaters..."

Pros

+Zero network latency for generation
+Complete privacy (no data leaves device)
+Works offline after initial load
+OpenAI-compatible API

Cons

-Large initial download (1-2GB)
-Requires WebGPU support
-Lower quality than GPT-4o
-Battery/thermal impact on mobile

Transformers.js

Hugging Face's JavaScript library provides access to 2,400+ models for embeddings, classification, and more. Version 3 adds WebGPU acceleration.

import { pipeline, env } from '@xenova/transformers';

// Enable WebGPU acceleration (v3)
env.backends.onnx.wasm.numThreads = 4;

// Create embedding pipeline
const embedder = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2'
);

// Generate embeddings for products
const productEmbeddings = await Promise.all(
  products.map(p => embedder(p.description, {
    pooling: 'mean',
    normalize: true
  }))
);

// Semantic search
function findSimilar(query: string, topK = 5) {
  const queryEmbedding = await embedder(query);
  return products
    .map((p, i) => ({
      product: p,
      score: cosineSimilarity(queryEmbedding, productEmbeddings[i])
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

Performance Characteristics

<50ms

Per embedding

~25MB

Model size

2-3s

Initial load

Commerce Use Cases

Instant Refinement

User says "something warmer" → client model classifies intent → embeddings filter products → layout updates in <100ms

WebLLM + Transformers.js

Visual Search

User uploads photo → ONNX CLIP extracts features → similarity search finds matching products → "Shop this look" in <500ms

ONNX Runtime + CLIP

Predictive Cart

User adds rain jacket → client model infers "outdoor, weather protection" → complementary items (boots, umbrella) surface before user asks

WebLLM + Embeddings

Progressive Enhancement

V3 uses progressive enhancement to work on all devices while providing the best experience on capable hardware.

// Feature detection for model loading
async function initializeModels() {
  const capabilities = await detectCapabilities();

  if (capabilities.webgpu && capabilities.ram >= 8) {
    // Full experience: WebLLM + embeddings
    await loadWebLLM();
    await loadEmbeddings();
  } else if (capabilities.wasm && capabilities.ram >= 4) {
    // Reduced experience: embeddings only
    await loadEmbeddings();
  } else {
    // Fallback: server-side only (V2 behavior)
    console.log('Using server-side AI');
  }
}

async function detectCapabilities() {
  return {
    webgpu: 'gpu' in navigator,
    wasm: typeof WebAssembly !== 'undefined',
    ram: navigator.deviceMemory || 4 // Estimate
  };
}

Edge Architecture

Hybrid Routing