Intelligence Hub
Performance Engineering9 Min Read[ Implementation Pattern ]

Server-Side AI Streaming with Next.js: Production Patterns for Resilient Token Flow

S
Synthara Web Engineering
Engineering Team
Published

Streaming an AI response token-by-token is the easy part. Doing it under real network conditions — at scale, with reconnects, with tool calls interleaved, and with observability the on-call engineer can use at 2am — is where the engineering lives. The patterns below survived production.

TL;DR — The Production Streaming Stack

ConcernProduction answer
TransportServer-Sent Events (SSE)
ProtocolAI SDK Data Stream Protocol (multiplexed)
ReconnectionResumable streams with last-event-id cursor
BackpressureToken coalescing on 10-20ms timer + client-side throttle
Tool calls in streamMultiplexed as typed events alongside text deltas
ObservabilityTrace ID propagated from first event to last
TerminationGraceful with explicit done event; idempotent on retry

Why SSE Beats WebSockets for AI

The default choice that surprises some teams: AI token streaming should use Server-Sent Events, not WebSockets.

PropertySSEWebSockets
DirectionServer → client onlyBidirectional
Proxy / CDN compatibilityExcellent (looks like HTTP)Often blocked by corporate proxies
ReconnectionNative (EventSource retries with last-event-id)Manual
BackpressureHTTP-nativeApplication-managed
Multiplexed eventsYes via event typesYes via custom protocol
HTTP/2 multiplexingYesNo

For AI streaming, the data flow is one-way: tokens flow server → client; only the initial request goes client → server. SSE matches that shape exactly. WebSockets add complexity (handshake, ping/pong, framing) that you do not benefit from.

WebSockets become correct when you genuinely need bidirectional real-time — voice agents, collaborative editing, live multiplayer interactions.

The Server Side, Concretely

A production streaming Route Handler that does the right things by default:

ts
// app/api/chat/route.ts import { streamText, type Message } from "ai"; import { anthropic } from "@ai-sdk/anthropic"; import { headers } from "next/headers"; import { z } from "zod"; export const runtime = "edge"; export const maxDuration = 60; const RequestSchema = z.object({ messages: z.array(z.object({ role: z.enum(["user", "assistant", "system", "tool"]), content: z.string(), })), conversationId: z.string().uuid(), }); export async function POST(req: Request) { const traceId = crypto.randomUUID(); const parsed = RequestSchema.safeParse(await req.json()); if (!parsed.success) { return new Response("invalid_request", { status: 400 }); } const { messages, conversationId } = parsed.data; const user = await authenticate(req); // Pre-flight budget check if (await isOverBudget(user.tenant_id)) { return streamErrorResponse("budget_exceeded", traceId); } const result = await streamText({ model: anthropic("claude-sonnet-4-6"), messages: messages as Message[], maxTokens: 2048, tools: registeredTools(user), experimental_telemetry: { isEnabled: true, metadata: { traceId, conversationId, tenantId: user.tenant_id }, }, onFinish: async (event) => { await persistAssistantTurn(conversationId, event, traceId); await emitMetrics(event, traceId); }, }); return result.toDataStreamResponse({ headers: { "X-Trace-Id": traceId, "Cache-Control": "no-cache, no-transform", "X-Accel-Buffering": "no", // disable nginx buffering }, }); }

Three properties worth flagging:

  • X-Accel-Buffering: no. Nginx, Cloudflare, and other proxies buffer responses by default. This header opts your stream out. Skip this and you get the user complaint: "the model finishes typing and the whole answer shows up at once."
  • Schema validation up-front. Catches malformed requests before they hit billing.
  • Trace ID in the response header. Lets support engineers correlate a user's complaint with the exact server-side trace.

The Client Side: Token Receipt and Demux

tsx
"use client"; import { useChat } from "ai/react"; export function ChatStream({ conversationId }: { conversationId: string }) { const { messages, input, handleInputChange, handleSubmit, isLoading, stop, reload, error, } = useChat({ api: "/api/chat", body: { conversationId }, initialMessages: [], onResponse: (response) => { const traceId = response.headers.get("X-Trace-Id"); if (traceId) (window as any).__lastTraceId = traceId; }, onError: (err) => { logError(err, { conversationId }); }, }); return ( <div> {messages.map((m) => ( <Message key={m.id} message={m} /> ))} {error && <ErrorBanner onRetry={() => reload()} />} {isLoading && <StreamingIndicator onCancel={stop} />} <form onSubmit={handleSubmit}> <input value={input} onChange={handleInputChange} /> </form> </div> ); }

The AI SDK's useChat handles SSE consumption, message accumulation, and tool-call demultiplexing. What you have to add: the error UI, the cancel UI, and the trace-id capture for support flows.

Resumable Streams

A dropped connection mid-generation is the most common failure mode under real network conditions. Two patterns handle it.

Client-side reconnect with last-event-id. SSE includes a built-in Last-Event-ID mechanism. When EventSource reconnects, it sends the last event ID it received. Your server resumes from there.

ts
// Server emits event IDs per token data: {"type":"text-delta","delta":"Hello"} id: 1 data: {"type":"text-delta","delta":" world"} id: 2 // On reconnect with Last-Event-ID: 1, resume from token 2

The harder bit: you have to store the in-flight generation state server-side so a reconnect can resume.

Server-side checkpointed streams. A more robust pattern stores the in-flight stream state in Redis. Any worker can resume the stream:

ts
async function* checkpointedStream(streamId: string, initialMessages: Message[]) { let cursor = await redis.get(`stream:${streamId}:cursor`) ?? 0; if (cursor > 0) { const replay = await redis.lrange(`stream:${streamId}:tokens`, 0, cursor - 1); for (const event of replay) yield event; } const remaining = await streamFromProvider(initialMessages, fromTokenIndex = cursor); for await (const token of remaining) { await redis.rpush(`stream:${streamId}:tokens`, JSON.stringify(token)); await redis.incr(`stream:${streamId}:cursor`); yield JSON.stringify(token); } }

The Vercel AI SDK has experimental support for resumable streams that implements this pattern out of the box. It is becoming standard in 2026.

Multiplexed Data Streams

A production AI response is not just text — it can include tool calls, tool results, metadata, citations, and errors. All on the same SSE channel.

The AI SDK's Data Stream Protocol encodes this:

data: {"type":"text-delta","textDelta":"Looking that up"}
data: {"type":"tool-call","toolCallId":"call_1","toolName":"search","args":{"q":"..."}}
data: {"type":"tool-result","toolCallId":"call_1","result":{"hits":[...]}}
data: {"type":"text-delta","textDelta":". I found three results..."}
data: {"type":"finish","finishReason":"stop","usage":{"promptTokens":340,"completionTokens":58}}

The client switches behavior per event type: text deltas append to the current message; tool calls render a "searching..." indicator; tool results may be hidden or shown; finish updates token usage UI.

Backpressure and Coalescing

Token-by-token frames are expensive — each SSE event has framing overhead, and clients render in animation frames anyway. Coalesce tokens server-side on a short timer:

ts
async function* coalescedStream(source: AsyncIterable<string>, windowMs: number = 16) { let buffer = ""; let lastFlush = Date.now(); for await (const token of source) { buffer += token; if (Date.now() - lastFlush >= windowMs) { yield buffer; buffer = ""; lastFlush = Date.now(); } } if (buffer) yield buffer; }

16ms maps onto 60fps display refresh. Coalescing at this window typically reduces network frames by 60-80% with no perceptible latency increase.

Observability for Streams

The defining hard part: a single stream produces many events over many seconds. Correlating them at 2am during an incident requires discipline.

The minimum:

  • Trace ID in the response header. The user's support ticket can reference it.
  • Span per stream event type. OpenTelemetry instrumentation with stream.event_type attribute.
  • Token count in the finish event. Bills, alerts, dashboards depend on this.
  • Error type, not just error message. A timeout is a different signal from a content policy refusal.
ts
async function recordStreamMetrics(streamId: string, event: StreamEvent, traceId: string) { span.setAttributes({ "stream.id": streamId, "stream.event": event.type, "stream.tokens": event.tokens ?? 0, "stream.tool_calls": event.toolCalls?.length ?? 0, "trace.id": traceId, }); if (event.type === "finish") { metrics.histogram("stream.duration_ms").record(event.durationMs); metrics.histogram("stream.tokens_out").record(event.tokensOut); } }

Edge Cases That Bite

A list of failure modes we have shipped fixes for:

  • Proxy buffering. Add X-Accel-Buffering: no and disable buffering at every layer.
  • Connection limits. Browsers cap SSE connections per origin (typically 6). Beyond that, new tabs hang. Use HTTP/2 to multiplex.
  • Mobile Safari background tab. Suspends connections aggressively. Build the UI to resume on visibility-change.
  • Cloudflare 100s timeout. Long generations on Cloudflare Workers can hit the 100-second free-tier limit. Use maxDuration and consider Workers Paid or split-stream patterns.
  • Provider rate limits mid-stream. Generation can fail partway through. Emit a clean error event; the client should render a "Retry" affordance instead of looking frozen.
  • The user clicks Stop. Implement stop() cleanly. The server-side handler should abort the upstream provider request, not just stop forwarding.

Frequently Asked Questions

Should I use Server-Sent Events or WebSockets for AI streaming?

Server-Sent Events for one-way token streams from server to client — which covers 95% of AI use cases. WebSockets only when you need bidirectional real-time (voice, collaborative editing). SSE is simpler, survives proxies better, and reconnects natively.

How do I handle a dropped connection mid-stream?

Two patterns: client-side reconnect with last-token cursor (the AI SDK's resumable streams), or server-side checkpointing that lets a new connection resume from the saved state. The second is more complex but recovers cleanly from network hiccups without the user noticing.

Can I stream structured tool calls and text in the same response?

Yes. The AI SDK's data stream protocol multiplexes text deltas, tool calls, tool results, and metadata events in a single SSE channel. The client switches behavior per event type.

What's the right buffer policy for streaming AI tokens?

Coalesce tokens on a 10-20ms timer to reduce frame overhead, but never buffer beyond that. The user-visible delay between a token being generated and a token being rendered should stay under 50ms end-to-end.

How long can an SSE connection stay open?

HTTP/2 servers will hold SSE connections open indefinitely. Practical limits come from your platform (Cloudflare Workers free tier: 100s; Vercel Edge: 25s; Vercel Fluid Compute: 5min). Plan for them.

Key Takeaways

  • SSE is the right protocol for AI token streams in 95% of cases; reserve WebSockets for bidirectional needs.
  • Resumable streams turn network blips from visible failures into invisible retries.
  • Multiplexed data streams (text + tool calls + metadata) belong on one channel, demuxed on the client.
  • Observability per stream requires propagating a trace_id from the first SSE event to the last.
  • Plan for proxy buffering, connection limits, and platform timeouts before they bite in production.
Frequently Asked Questions

Should I use Server-Sent Events or WebSockets for AI streaming?

Server-Sent Events for one-way token streams from server to client — which covers 95% of AI use cases. WebSockets only when you need bidirectional real-time (voice, collaborative editing). SSE is simpler, survives proxies better, and reconnects natively.

How do I handle a dropped connection mid-stream?

Two patterns: client-side reconnect with last-token cursor (the AI SDK's resumable streams), or server-side checkpointing that lets a new connection resume from the saved state. The second is more complex but recovers cleanly from network hiccups without the user noticing.

Can I stream structured tool calls and text in the same response?

Yes. The AI SDK's data stream protocol multiplexes text deltas, tool calls, tool results, and metadata events in a single SSE channel. The client switches behavior per event type.

What's the right buffer policy for streaming AI tokens?

Coalesce tokens on a 10-20ms timer to reduce frame overhead, but never buffer beyond that. Buffer-and-flush patterns add perceived latency. The user-visible delay between a token being generated and a token being rendered should stay under 50ms end-to-end.

Article Taxonomy
#streaming#nextjs#ai-sdk#server-sent-events#resumable-streams#production-ai
Strategic Deployment Active

Let's Build Your
Sovereign System

Architecture audits, AI knowledge systems, autonomous agents — the engineering you need, built under your ownership.

Synthara Logo

SyntharaTechnologies

Your dedicated partner in enterprise AI transformation. We build production-ready, sovereign intelligence architectures designed explicitly to secure your strategic and competitive advantage.

Direct Communication

INITIATE
PROTOCOL.

Ready to secure your strategic advantage? Connect with our engineering nodes directly.

© 2026 SyntharaTechnologies
Private Limited Venture.Engineered in India • Deploying Strategic Nodes Globally.
Sovereign Excellence