WebLLM
LiveClient Models
Running LLMs and embeddings directly in the browser using WebGPU and WebAssembly for zero-latency inference.
WebLLM
WebLLM brings large language models to the browser by leveraging WebGPU for GPU-accelerated inference. This enables conversational AI without server roundtrips.
import { CreateMLCEngine } from "@mlc-ai/web-llm";
// Initialize the engine with a model
const engine = await CreateMLCEngine("Phi-3-mini-4k-instruct-q4f16_1-MLC", {
initProgressCallback: (progress) => {
console.log(`Loading: ${progress.progress * 100}%`);
}
});
// Generate a response
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a shopping assistant." },
{ role: "user", content: "Show me something warmer" }
],
temperature: 0.7,
max_tokens: 150
});
console.log(response.choices[0].message.content);
// "I'll filter for warmer items like jackets and sweaters..."Pros
- +Zero network latency for generation
- +Complete privacy (no data leaves device)
- +Works offline after initial load
- +OpenAI-compatible API
Cons
- -Large initial download (1-2GB)
- -Requires WebGPU support
- -Lower quality than GPT-4o
- -Battery/thermal impact on mobile
Transformers.js
Hugging Face's JavaScript library provides access to 2,400+ models for embeddings, classification, and more. Version 3 adds WebGPU acceleration.
import { pipeline, env } from '@xenova/transformers';
// Enable WebGPU acceleration (v3)
env.backends.onnx.wasm.numThreads = 4;
// Create embedding pipeline
const embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2'
);
// Generate embeddings for products
const productEmbeddings = await Promise.all(
products.map(p => embedder(p.description, {
pooling: 'mean',
normalize: true
}))
);
// Semantic search
function findSimilar(query: string, topK = 5) {
const queryEmbedding = await embedder(query);
return products
.map((p, i) => ({
product: p,
score: cosineSimilarity(queryEmbedding, productEmbeddings[i])
}))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}Performance Characteristics
Commerce Use Cases
Instant Refinement
User says "something warmer" → client model classifies intent → embeddings filter products → layout updates in <100ms
Visual Search
User uploads photo → ONNX CLIP extracts features → similarity search finds matching products → "Shop this look" in <500ms
Predictive Cart
User adds rain jacket → client model infers "outdoor, weather protection" → complementary items (boots, umbrella) surface before user asks
Progressive Enhancement
V3 uses progressive enhancement to work on all devices while providing the best experience on capable hardware.
// Feature detection for model loading
async function initializeModels() {
const capabilities = await detectCapabilities();
if (capabilities.webgpu && capabilities.ram >= 8) {
// Full experience: WebLLM + embeddings
await loadWebLLM();
await loadEmbeddings();
} else if (capabilities.wasm && capabilities.ram >= 4) {
// Reduced experience: embeddings only
await loadEmbeddings();
} else {
// Fallback: server-side only (V2 behavior)
console.log('Using server-side AI');
}
}
async function detectCapabilities() {
return {
webgpu: 'gpu' in navigator,
wasm: typeof WebAssembly !== 'undefined',
ram: navigator.deviceMemory || 4 // Estimate
};
}