Voice Stream Websocket

Real-time bi-directional WebSocket for streaming microphone audio and receiving AI voice responses with barge-in support.

Voice WebSocket Endpoint

wss://api.soca.ai/voice-ws

Input Audio: PCM16LE, Mono, 16 kHz
Output Audio: Chunked MP3, 22.05 kHz

Quick Start

1. Connect
2. Authenticate
3. Ready Signal
4. Stream Audio

Open WebSocket Connection

const ws = new WebSocket("wss://api.soca.ai/voice-ws");
ws.binaryType = "arraybuffer";

Set binaryType to "arraybuffer" for efficient binary audio streaming

Send Start Message

{
  "type": "start",
  "private_key": "<private_key>",
  "session_id": "<session_id>",
  "agent_id": "<agent_id>",
  "lang": "id",
  "sr": 16000
}

type

string

required

Must be exactly "start"

private_key

string

required

Your API key from Soca AI platform

session_id

string

required

Session ID for this message.

agent_id

string

required

The unique identifier of a Soca AI agent created in the Studio.

lang

string

default:"id"

Language: "id" (Indonesian) or "en" (English)

number

default:"16000"

Sample rate (must be 16000)

Server Responses

{
  "type": "ready",
  "session_id": "<session_id>"
}

When you receive stt.ready, you can start streaming audio!

Send PCM16 Frames

// Continuously send binary audio frames
ws.send(int16Array.buffer);

Send binary frames only (not base64). Must be Int16Array buffer.

Audio Format Specifications

Input
Output

Required Format

Property	Value	Description
Format	PCM16LE	16-bit signed integer, little-endian
Channels	Mono (1)	Single channel only
Sample Rate	16,000 Hz	Required sampling rate
Bit Depth	16-bit	2 bytes per sample

Frame Sizes

Duration	Bytes	Recommendation
20 ms	640 bytes	⭐ Recommended - Best for low latency
40 ms	1,280 bytes	✅ Good - Balanced performance
50 ms	1,600 bytes	⚠️ Max - Higher latency

Formula: bytes = sampleRate × duration × 2Example: 16,000 × 0.020 × 2 = 640 bytes

Response Format

Property	Value
Format	MP3
Sample Rate	22,050 Hz
Bitrate	~32 kbps
Channels	Mono (1)
Delivery	Base64-encoded chunks

Agent speech is delivered as multiple chunks per sentence. Collect all chunks and play when complete.

Message Types

Transcripts
Agent Response
Control Messages
Final Answer

Speech Recognition Events

Partial Transcript

Sent continuously while user is speaking:

{
  "type": "stt.partial",
  "text": "Halo apa kabar"
}

Display with visual indication (italic, gray) to show it’s not final

Final Transcript

Sent when user stops speaking:

{
  "type": "stt.final",
  "text": "Halo apa kabar hari ini?"
}

After non-empty final transcript, agent starts processing response

Audio Response - Chunked MP3

Sentence Chunk

Each sentence arrives as multiple chunks:

{
  "stepType": "sentence",
  "chatId": "9e4d8f7a-1234-5678",
  "sentenceId": 123456,
  "contentStep": "Halo! Senang bertemu dengan Anda.",
  "audioBase64": "<base64 MP3 chunk>",
  "audioMime": "audio/mp3",
  "chunkIndex": 0,
  "isLastChunk": false,
  "seq": 12
}

Collect Chunks

Keep collecting chunks where isLastChunk: false

Last Chunk

Final chunk may have audioBase64: null:

{
  "stepType": "sentence",
  "sentenceId": 123456,
  "audioBase64": null,
  "chunkIndex": 7,
  "isLastChunk": true,
  "seq": 12
}

Play Audio

Combine all chunks and play complete sentence

Maintain separate player per sentenceId AND per seq for proper synchronization

Optional Control Commands

{
  "type": "stop"
}

Send these commands as JSON text frames (not binary)

Complete Response Summary

{
  "stepType": "final_answer",
  "stepTitle": "Final Answer",
  "fullResponse": {
    "answers": [{
      "output": "Complete response text"
    }]
  },
  "audioBase64": null,
  "audioMime": "audio/mp3",
  "isFinal": true,
  "seq": 12,
  "stepDuration": 1543.2
}

stepDuration

number

Processing duration in milliseconds

Barge-in Control

Barge-in allows users to interrupt the AI agent mid-speech, creating natural conversation flow.

How It Works

User Starts Speaking

While agent is talking, user begins new input

Server Detects Speech

Server analyzes if speech is meaningful (not filler words)

Server Sends Barge Signal

{
  "type": "barge",
  "seq": 12,
  "reason": "content_partial"
}

Client Stops Old Audio

Stop all audio players with seq < 12

Play New Response

Only play audio matching new seq: 12

Gate Conditions

Server triggers barge-in when ALL conditions are met:

Setting	Default	Description
`MIN_PARTIAL_CHARS`	10	Minimum characters in speech
`MIN_PARTIAL_WORDS`	2	Minimum number of words
`MIN_CONFIDENCE`	0.30	STT confidence threshold (0.0 to 1.0)
`SPEECH_START_WINDOW`	1.5s	Time window after VAD detection
`BARGE_COOLDOWN`	0.8s	Minimum time between barges

Filler words do NOT trigger barge-in: uh, um, hmm, eh, ah, ya, yah

Sequence Number Logic

Check Sequence on Every Message:

// When receiving any message
if (msg.seq && msg.seq > currentSeq) {
  // Stop all old audio
  stopAllAudio(currentSeq);
  currentSeq = msg.seq;
}

Stop Old Audio Function:

function stopAllAudio(beforeSeq) {
  audioPlayers.forEach((player, key) => {
    const [seq] = key.split('_');
    if (parseInt(seq) < beforeSeq) {
      player.audio.pause();
      player.audio.currentTime = 0;
      audioPlayers.delete(key);
    }
  });
}

Always maintain currentSeq as a global variable to track the latest sequence number.

Complete Example

const ws = new WebSocket("wss://api.soca.ai/voice-ws");
ws.binaryType = "arraybuffer";

ws.onopen = () => {
  // Send start message
  ws.send(JSON.stringify({
    type: "start",
    private_key: "<private_key>",
    session_id: "<session_id>",
    agent_id: "<agent_id>",
    lang: "id",
    sr: 16000
  }));
};

ws.onmessage = (e) => {
  const msg = JSON.parse(e.data);
  
  // Handle different message types
  switch(msg.type) {
    case "stt.partial":
      console.log("Partial:", msg.text);
      break;
      
    case "stt.final":
      console.log("Final:", msg.text);
      break;
      
    case "barge":
      handleBarge(msg.seq);
      break;
  }
  
  // Handle audio chunks
  if (msg.stepType === "sentence") {
    handleAudioChunk(msg);
  }
};

Error Handling

Protocol Errors

{
  "type": "error",
  "message": "must start with type=start"
}

Solution: Ensure start message is sent immediately after connection

Authentication Errors

{
  "type": "error",
  "message": "Invalid or expired token"
}

Solution: Get fresh token from Soca AI dashboard

Network Errors

Server attempts transparent reconnect for audio sends. Client should implement reconnection logic with exponential backoff.

ws.onclose = (event) => {
  if (event.code !== 1000) {
    setTimeout(() => reconnect(), getBackoffDelay());
  }
};

Troubleshooting

No audio playback

Possible causes:

Not using MSE for MP3 streaming
Missing user-gesture for autoplay
Chunks not combined correctly

Solutions:

Use <audio> element or Web Audio API
Require user click before playing
Verify all chunks collected before playing

Overlapping audio

Cause: Not respecting sequence numbersSolution:

if (msg.seq > currentSeq) {
  stopAllAudio(currentSeq);
  currentSeq = msg.seq;
}

High latency

Solutions:

Reduce frame size to 20-40 ms
Disable heavy DSP in getUserMedia
Check network latency

Microphone not working

Checklist:

✅ Grant microphone permission
✅ Use HTTPS (required for getUserMedia)
✅ Check browser compatibility
✅ Verify audio constraints (16kHz, mono)

Try It Out

See complete working implementation with source code on GitHub

API Documentation

Chats

Sessions

WSS Voice Stream

Voice WebSocket Endpoint

Quick Start

Open WebSocket Connection

Send Start Message

Server Responses

Send PCM16 Frames

Audio Format Specifications

Required Format

Frame Sizes

Response Format

Message Types

Speech Recognition Events

Audio Response - Chunked MP3

Optional Control Commands

Complete Response Summary

Barge-in Control

Complete Example

Error Handling

Troubleshooting

Try It Out

API Documentation

Chats

Sessions

WSS Voice Stream

​Voice WebSocket Endpoint

​Quick Start

​Open WebSocket Connection

​Send Start Message

​Server Responses

​Send PCM16 Frames

​Audio Format Specifications

​Required Format

​Frame Sizes

​Response Format

​Message Types

​Speech Recognition Events

​Audio Response - Chunked MP3

​Optional Control Commands

​Complete Response Summary

​Barge-in Control

​Complete Example

​Error Handling

​Troubleshooting

Try It Out

Voice WebSocket Endpoint

Quick Start

Open WebSocket Connection

Send Start Message

Server Responses

Send PCM16 Frames

Audio Format Specifications

Required Format

Frame Sizes

Response Format

Message Types

Speech Recognition Events

Audio Response - Chunked MP3

Optional Control Commands

Complete Response Summary

Barge-in Control

Complete Example

Error Handling

Troubleshooting