Latency in conversational AI is not a soft UX concern. Below 400ms, an assistant feels alive; above 1,000ms, users start typing over the top of it. The path from "feels broken" to "feels instant" is mechanical — edge orchestration, smart caching, and tier-zero routing. None of it is magic.
TL;DR — Where the Latency Actually Hides
A naive RAG-style chat request, measured end-to-end, decomposes like this:
| Stage | Typical latency | Why |
|---|---|---|
| DNS + TLS to origin | 50-120ms | Distance to origin region |
| Auth check (DB lookup) | 60-150ms | Database round-trip |
| Vector retrieval | 80-200ms | Embedding + ANN search |
| Provider network hop | 50-100ms | Origin → LLM provider |
| Model TTFT | 200-700ms | Prefill + first token |
| Total | 440-1,270ms | Compounding |
The lowest-hanging fruit is rarely the model. It is the three round-trips before the model.
Move the Gateway to the Edge
Cloudflare Workers, Vercel Edge Functions, Deno Deploy, and Fastly Compute@Edge all give you the same primitive: a code execution environment that runs in the POP closest to the user, before any request crosses an ocean.
The first three things to move there:
- TLS termination + JWT signature verification. The signature check is offline; no database needed. You verify the user is who they claim to be without leaving the edge.
- Coarse rate limiting. Per-IP and per-user, in a regional KV store (Workers KV, Vercel KV).
- Request shaping — input length cap, language detection, PII regex pre-scan, intent classification.
ts// Cloudflare Worker — edge gateway pattern export default { async fetch(req: Request, env: Env): Promise<Response> { const start = Date.now(); // 1. Verify JWT at the edge (signature only — no DB) const user = await verifyJwt(req.headers.get("authorization"), env.JWT_SECRET); if (!user) return new Response("unauthorized", { status: 401 }); // 2. Rate limit in regional KV if (!(await rateLimit(env.KV, user.id, 60, 100))) { return new Response("rate_limited", { status: 429 }); } // 3. Request shaping (Wasm-compiled tokenizer for accurate budget) const body = await req.json(); const tokens = tokenizer.countTokens(body.message); if (tokens > 4000) return new Response("too_long", { status: 413 }); // 4. Semantic cache lookup const hit = await semanticCache.lookup(env, body.message, user.tenant_id); if (hit) { return streamFromCache(hit, start); } // 5. Route to origin or directly to provider return await proxyToInferenceOrigin(req, user, body, env); }, };
This block of code typically saves 150-250ms of round-trip on global traffic. The single biggest win is collapsing the auth lookup into a signature check.
Semantic Caching Done Right
Hot-query caching cuts the cost and the latency of repeated questions. The naive version is "hash the prompt and reuse." The production version is more nuanced.
The key is the key. Cache keys must include all identity-bound state:
tsconst cacheKey = await embed(normalizedQuery); const namespace = `${tenant_id}:${role}:${locale}:${model_tier}`; const hit = await vectorCache.search(namespace, cacheKey, threshold = 0.95);
Properties of a safe semantic cache:
- Strict similarity threshold. 0.95+ cosine similarity, not 0.85. A loose threshold returns plausibly-wrong answers, which is worse than no cache.
- Tenant-scoped namespace. Two customers asking the same generic question still pay for full inference if their tenant policies differ.
- Short TTL. 60-300 seconds is plenty for "what's the status of order #X" style queries. Hours for stable definitional content.
- No personalised fragments cached. If the response interpolates user-specific facts, don't cache the rendered response — cache the template and re-fill at serve time.
Production hit rates we typically see in production: 15-40% on chat-style traffic, 40-70% on support-FAQ-style traffic.
Tier-Zero Routing — The Small Model Filter
Most queries don't need a frontier model. A small model running at the edge can answer 50-70% of them.
The pattern:
- Classify the query with a 1-3B parameter model in the edge runtime (Wasm or WebGPU).
- Decide whether the small model can answer directly, must escalate to a medium provider model, or must escalate to a large model.
- Route accordingly. Track outcomes.
tsasync function tierZeroRouter(query: string, ctx: Ctx): Promise<Response> { const route = await edgeSmallModel.classify(query, { classes: ["trivial", "factual_rag", "complex_reasoning"], }); if (route.label === "trivial" && route.confidence > 0.9) { return await edgeSmallModel.generate(query, ctx); } if (route.label === "factual_rag") { return await mediumProviderRag(query, ctx); } return await largeProviderReasoning(query, ctx); }
In production, this typically cuts the share of requests that touch a large model by 60%+, with no measurable quality regression on a well-tuned classifier.
Prefill Optimization
Once you can't avoid hitting the large model, the next lever is prefill time — the time the model spends ingesting your prompt before producing tokens.
What reduces prefill:
- Prompt caching. Anthropic, OpenAI, and Google all offer prompt caching with significant TTFT improvement on the cached portion. Structure prompts so the stable system + instructions come first, the per-request bits last.
- Shorter prompts. Aggressive retrieval (top-4, not top-20) and aggressive memory selection (only the procedurally-relevant docs, not the full library).
- Speculative decoding. Increasingly available on managed APIs. Trades a small accuracy ceiling for much lower TTFT on simple completions.
A practical rule: a system prompt that exceeds 4,000 tokens probably contains material that should have been retrieved on-demand, not always-on.
Streaming Discipline
Streaming is the difference between perceived TTFT and real TTFT. A 700ms TTFT on a streamed response feels faster than a 400ms TTFT on a buffered one, because the user sees the answer start instead of staring at a loader.
Concrete patterns:
- Server-Sent Events end-to-end. Don't buffer in your gateway. Don't buffer in your CDN.
- First-token boost. Send an empty
data:event as soon as the request is accepted, before the model starts streaming. Some browsers begin painting earlier. - Token coalescing. Don't ship one network frame per token. Coalesce on a 10-20ms timer. The frame overhead is otherwise non-trivial.
- Backpressure handling. If the client is slow, throttle generation, don't buffer indefinitely.
Measurement: The Numbers That Matter
Three percentiles, three regions, three sample sizes:
| Metric | What | Target |
|---|---|---|
| p50 TTFT global | Median first-token time | <400ms |
| p95 TTFT global | 95th percentile | <900ms |
| p50 TTFT distant region | Median for users far from origin | <600ms |
| Cache hit rate | Semantic + exact | >25% |
| Tier-0 share | % handled without large model | >50% |
| End-to-end p50 | First-token to last-token | depends on length |
Wire OpenTelemetry from the edge. A typical trace should show edge gateway → cache lookup (miss) → router → retrieval → model → token. Latencies per span land in your dashboard the same day.
The Order to Implement
If you are starting from a 1.2-second TTFT product, the order of optimization matters because each layer changes the size of the problem the next layer is solving.
- Edge gateway with JWT verification and rate limiting. Saves 80-150ms.
- Semantic caching with strict keying. Saves 200-400ms on cache hits.
- Streaming end-to-end. Saves ~0ms of real latency but transforms perceived latency.
- Tier-zero small-model routing. Removes large-model latency from 50-60% of traffic.
- Prompt caching. Saves 100-300ms on cached prefill portions.
- Retrieval tightening. Cuts retrieval from 200ms to 80ms.
- Regional model deployments. Last resort — high cost, real win only for voice-grade SLAs.
Each step is independently measurable, independently revertable, and independently valuable.
When Edge Inference Is the Wrong Answer
Edge is not always right.
- Highly stateful agents with multi-step memory benefit from being close to their state store, not close to the user. If your state lives in
us-east-1, putting the gateway ineu-west-1adds round-trips. - Multi-tenant clusters with strict isolation are sometimes simpler to operate in a single region.
- Compliance regimes that require all processing in one jurisdiction make edge orchestration tricky — the gateway has to enforce regional routing.
For most products, none of these are show-stoppers. They are constraints to design around, not reasons to skip the optimization.
Frequently Asked Questions
What is TTFT and why does it matter?
Time to First Token is the latency from request to the first character of the model's response. Below ~400ms it feels instant; above ~1s users perceive the system as "thinking too hard." For voice and high-frequency tool calling, TTFT is the dominant UX metric.
How much can edge inference actually save?
Moving auth, routing, and first-pass orchestration to edge functions typically saves 150-250ms of round-trip latency for global traffic. Combined with semantic caching for hot queries, total TTFT reduction lands between 40-70%.
Does edge inference run the LLM itself?
Sometimes. Small models (under ~3B parameters) run on edge runtimes via WebGPU or compiled Wasm. Large models still call out to a provider or a regional GPU pool, but the orchestration around the call lives at the edge.
Is semantic caching safe for personalized responses?
Yes, when keyed correctly. The cache key must include any identity-bound state (tenant_id, role, locale). With strict keying and short TTLs (60-300s), semantic caching is safe and cuts hot-query latency to sub-50ms.
Should I optimize TTFT before launch or after?
Before, but only past a threshold. A 1.5-second TTFT product is unusable for chat; ship at 800ms or better, then optimize toward 400ms as a continuous improvement.
Key Takeaways
- TTFT under 400ms is the floor for conversational AI; everything above feels sluggish.
- Moving auth, routing, and orchestration to edge functions reclaims 150-250ms for global users.
- Semantic caching with tenant-scoped keys handles 15-40% of production traffic at sub-50ms.
- First-pass small-model routing pays for itself by handling 60% of queries without ever invoking a large model.
- Optimize in order — each layer changes the size of the problem the next is solving.
What is TTFT and why does it matter?
Time to First Token is the latency from request to the first character of the model's response. Below ~400ms it feels instant; above ~1s users perceive the system as 'thinking too hard.' For voice and high-frequency tool calling, TTFT is the dominant UX metric.
How much can edge inference actually save?
Moving auth, routing, and first-pass orchestration to edge functions typically saves 150-250ms of round-trip latency for global traffic. Combined with semantic caching for hot queries, total TTFT reduction lands between 40-70%.
Does edge inference run the LLM itself?
Sometimes. Small models (under ~3B parameters) run on edge runtimes via WebGPU or compiled Wasm. Large models still call out to a provider or a regional GPU pool, but the orchestration around the call lives at the edge.
Is semantic caching safe for personalized responses?
Yes, when keyed correctly. The cache key must include any identity-bound state (tenant_id, role, locale). With strict keying and short TTLs (60-300s), semantic caching is safe and cuts hot-query latency to sub-50ms.
