Real-time bi-directional WebSocket for streaming microphone audio and receiving AI voice responses with barge-in support.
Voice WebSocket Endpoint
Output Audio: Chunked MP3, 22.05 kHz
Quick Start
- 1. Connect
- 2. Authenticate
- 3. Ready Signal
- 4. Stream Audio
Open WebSocket Connection
Audio Format Specifications
- Input
- Output
Required Format
| Property | Value | Description |
|---|---|---|
| Format | PCM16LE | 16-bit signed integer, little-endian |
| Channels | Mono (1) | Single channel only |
| Sample Rate | 16,000 Hz | Required sampling rate |
| Bit Depth | 16-bit | 2 bytes per sample |
Frame Sizes
| Duration | Bytes | Recommendation |
|---|---|---|
| 20 ms | 640 bytes | ⭐ Recommended - Best for low latency |
| 40 ms | 1,280 bytes | ✅ Good - Balanced performance |
| 50 ms | 1,600 bytes | ⚠️ Max - Higher latency |
Message Types
- Transcripts
- Agent Response
- Control Messages
- Final Answer
Speech Recognition Events
Partial Transcript
Partial Transcript
Sent continuously while user is speaking:
Display with visual indication (italic, gray) to show it’s not final
Final Transcript
Final Transcript
Sent when user stops speaking:
After non-empty final transcript, agent starts processing response
Barge-in Control
Barge-in allows users to interrupt the AI agent mid-speech, creating natural conversation flow.
How It Works
How It Works
1
User Starts Speaking
While agent is talking, user begins new input
2
Server Detects Speech
Server analyzes if speech is meaningful (not filler words)
3
Server Sends Barge Signal
4
Client Stops Old Audio
Stop all audio players with
seq < 125
Play New Response
Only play audio matching new
seq: 12Gate Conditions
Gate Conditions
Server triggers barge-in when ALL conditions are met:
| Setting | Default | Description |
|---|---|---|
MIN_PARTIAL_CHARS | 10 | Minimum characters in speech |
MIN_PARTIAL_WORDS | 2 | Minimum number of words |
MIN_CONFIDENCE | 0.30 | STT confidence threshold (0.0 to 1.0) |
SPEECH_START_WINDOW | 1.5s | Time window after VAD detection |
BARGE_COOLDOWN | 0.8s | Minimum time between barges |
Sequence Number Logic
Sequence Number Logic
Check Sequence on Every Message:Stop Old Audio Function:
Complete Example
Error Handling
Protocol Errors
Protocol Errors
start message is sent immediately after connectionAuthentication Errors
Authentication Errors
Network Errors
Network Errors
Server attempts transparent reconnect for audio sends. Client should implement reconnection logic with exponential backoff.
Troubleshooting
No audio playback
No audio playback
Possible causes:
- Not using MSE for MP3 streaming
- Missing user-gesture for autoplay
- Chunks not combined correctly
- Use
<audio>element or Web Audio API - Require user click before playing
- Verify all chunks collected before playing
Overlapping audio
Overlapping audio
Cause: Not respecting sequence numbersSolution:
High latency
High latency
Solutions:
- Reduce frame size to 20-40 ms
- Disable heavy DSP in
getUserMedia - Check network latency
Microphone not working
Microphone not working
Checklist:
- ✅ Grant microphone permission
- ✅ Use HTTPS (required for getUserMedia)
- ✅ Check browser compatibility
- ✅ Verify audio constraints (16kHz, mono)
Try It Out
See complete working implementation with source code on GitHub