Skip to main content
Real-time bi-directional WebSocket for streaming microphone audio and receiving AI voice responses with barge-in support.

Voice WebSocket Endpoint

wss://api.soca.ai/v1/voice-ws
Input Audio: PCM16LE, Mono, 16 kHz
Output Audio: Chunked MP3, 22.05 kHz

Quick Start

  • 1. Connect
  • 2. Authenticate
  • 3. Ready Signal
  • 4. Stream Audio

Open WebSocket Connection

const ws = new WebSocket("ws://your-host/chat/agent/voice-ws");
ws.binaryType = "arraybuffer";
Set binaryType to "arraybuffer" for efficient binary audio streaming

Audio Format Specifications

  • Input
  • Output

Required Format

PropertyValueDescription
FormatPCM16LE16-bit signed integer, little-endian
ChannelsMono (1)Single channel only
Sample Rate16,000 HzRequired sampling rate
Bit Depth16-bit2 bytes per sample

Frame Sizes

DurationBytesRecommendation
20 ms640 bytesRecommended - Best for low latency
40 ms1,280 bytes✅ Good - Balanced performance
50 ms1,600 bytes⚠️ Max - Higher latency
Formula: bytes = sampleRate × duration × 2Example: 16,000 × 0.020 × 2 = 640 bytes

Message Types

  • Transcripts
  • Agent Response
  • Control Messages
  • Final Answer

Speech Recognition Events

Partial Transcript

Sent continuously while user is speaking:
{
  "type": "stt.partial",
  "text": "Halo apa kabar"
}
Display with visual indication (italic, gray) to show it’s not final
Sent when user stops speaking:
{
  "type": "stt.final",
  "text": "Halo apa kabar hari ini?"
}
After non-empty final transcript, agent starts processing response

Barge-in Control

Barge-in allows users to interrupt the AI agent mid-speech, creating natural conversation flow.

How It Works

1

User Starts Speaking

While agent is talking, user begins new input
2

Server Detects Speech

Server analyzes if speech is meaningful (not filler words)
3

Server Sends Barge Signal

{
  "type": "barge",
  "seq": 12,
  "reason": "content_partial"
}
4

Client Stops Old Audio

Stop all audio players with seq < 12
5

Play New Response

Only play audio matching new seq: 12
Server triggers barge-in when ALL conditions are met:
SettingDefaultDescription
MIN_PARTIAL_CHARS10Minimum characters in speech
MIN_PARTIAL_WORDS2Minimum number of words
MIN_CONFIDENCE0.30STT confidence threshold (0.0 to 1.0)
SPEECH_START_WINDOW1.5sTime window after VAD detection
BARGE_COOLDOWN0.8sMinimum time between barges
Filler words do NOT trigger barge-in: uh, um, hmm, eh, ah, ya, yah
Check Sequence on Every Message:
// When receiving any message
if (msg.seq && msg.seq > currentSeq) {
  // Stop all old audio
  stopAllAudio(currentSeq);
  currentSeq = msg.seq;
}
Stop Old Audio Function:
function stopAllAudio(beforeSeq) {
  audioPlayers.forEach((player, key) => {
    const [seq] = key.split('_');
    if (parseInt(seq) < beforeSeq) {
      player.audio.pause();
      player.audio.currentTime = 0;
      audioPlayers.delete(key);
    }
  });
}
Always maintain currentSeq as a global variable to track the latest sequence number.

Complete Example

const ws = new WebSocket("ws://localhost:8100/chat/agent/voice-ws");
ws.binaryType = "arraybuffer";

ws.onopen = () => {
  // Send start message
  ws.send(JSON.stringify({
    type: "start",
    token: "<YOUR_TOKEN>",
    session_id: `sess_${Date.now()}`,
    agent_id: null,
    lang: "id",
    sr: 16000
  }));
};

ws.onmessage = (e) => {
  const msg = JSON.parse(e.data);
  
  // Handle different message types
  switch(msg.type) {
    case "stt.partial":
      console.log("Partial:", msg.text);
      break;
      
    case "stt.final":
      console.log("Final:", msg.text);
      break;
      
    case "barge":
      handleBarge(msg.seq);
      break;
  }
  
  // Handle audio chunks
  if (msg.stepType === "sentence") {
    handleAudioChunk(msg);
  }
};

Error Handling

{
  "type": "error",
  "message": "must start with type=start"
}
Solution: Ensure start message is sent immediately after connection
{
  "type": "error",
  "message": "Invalid or expired token"
}
Solution: Get fresh token from Soca AI dashboard
Server attempts transparent reconnect for audio sends. Client should implement reconnection logic with exponential backoff.
ws.onclose = (event) => {
  if (event.code !== 1000) {
    setTimeout(() => reconnect(), getBackoffDelay());
  }
};

Troubleshooting

Possible causes:
  • Not using MSE for MP3 streaming
  • Missing user-gesture for autoplay
  • Chunks not combined correctly
Solutions:
  • Use <audio> element or Web Audio API
  • Require user click before playing
  • Verify all chunks collected before playing
Cause: Not respecting sequence numbersSolution:
if (msg.seq > currentSeq) {
  stopAllAudio(currentSeq);
  currentSeq = msg.seq;
}
Solutions:
  • Reduce frame size to 20-40 ms
  • Disable heavy DSP in getUserMedia
  • Check network latency
Checklist:
  • ✅ Grant microphone permission
  • ✅ Use HTTPS (required for getUserMedia)
  • ✅ Check browser compatibility
  • ✅ Verify audio constraints (16kHz, mono)

Try It Out

See complete working implementation with source code on GitHub