NewsroomEngineering
Engineering

How We Built Real-Time AI Streaming Responses

A deep dive into our token-by-token streaming architecture that makes OpusVoice AI feel instant and alive.

Feb 12, 20268 min read

When you chat with OpusVoice AI, responses appear token by token — just like watching someone type. This isn't just a visual trick. It fundamentally changes how the product feels.

Why Streaming Matters

A typical AI response takes 2–4 seconds to generate fully. Without streaming, users stare at a blank screen for that entire duration. With streaming, the first token appears in under 200ms. The perceived latency drops dramatically.

The Architecture

Our streaming pipeline has three stages:

1. Context Assembly. When a visitor sends a message, we pull the conversation history, run a semantic search against the workspace's knowledge base, and assemble a context window. This happens in parallel — knowledge retrieval and history fetch run concurrently.

2. Token Generation. We send the assembled prompt to our AI model with streaming enabled. Tokens arrive one at a time over a server-sent event stream.

3. Real-Time Delivery. Each token is pushed to the frontend via WebSocket. The UI renders tokens as they arrive with a subtle shimmer effect on the latest token, creating a natural typing feel.

Handling Edge Cases

Streaming introduces complexity. What if the connection drops mid-response? What if the user sends another message before the AI finishes? We handle these with a message status system (sending → sent → delivered → seen) and conversation locking to prevent race conditions.

The Result

Average time-to-first-token: 180ms. Average full response time: 2.1s. But because users see content appearing immediately, satisfaction scores are significantly higher than with batch responses.

Ready to get started?

Transform your customer conversations with AI-powered chat, voice, and analytics.