> ## Documentation Index > Fetch the complete documentation index at: https://docs.soca.ai/llms.txt > Use this file to discover all available pages before exploring further. # Voice Stream Websocket > Real-time voice conversations with Speech-to-Text and Text-to-Speech streaming Real-time bi-directional WebSocket for streaming microphone audio and receiving AI voice responses with barge-in support. ## Voice WebSocket Endpoint ```bash theme={null} wss://api.soca.ai/voice-ws ``` **Input Audio:** PCM16LE, Mono, 16 kHz\ **Output Audio:** Chunked MP3, 22.05 kHz *** ## Quick Start ### Open WebSocket Connection ```javascript theme={null} const ws = new WebSocket("wss://api.soca.ai/voice-ws"); ws.binaryType = "arraybuffer"; ``` Set `binaryType` to `"arraybuffer"` for efficient binary audio streaming ### Send Start Message ```json theme={null} { "type": "start", "private_key": "", "session_id": "", "agent_id": "", "lang": "id", "sr": 16000 } ``` Must be exactly `"start"` Your API key from Soca AI platform Session ID for this message. The unique identifier of a Soca AI agent created in the Studio. Language: `"id"` (Indonesian) or `"en"` (English) Sample rate (must be 16000) ### Server Responses ```json Session Ready theme={null} { "type": "ready", "session_id": "" } ``` ```json STT Ready theme={null} { "type": "stt.ready" } ``` When you receive `stt.ready`, you can start streaming audio! ### Send PCM16 Frames ```javascript theme={null} // Continuously send binary audio frames ws.send(int16Array.buffer); ``` Send **binary frames only** (not base64). Must be Int16Array buffer. *** ## Audio Format Specifications ### Required Format | Property | Value | Description | | --------------- | --------- | ------------------------------------ | | **Format** | PCM16LE | 16-bit signed integer, little-endian | | **Channels** | Mono (1) | Single channel only | | **Sample Rate** | 16,000 Hz | Required sampling rate | | **Bit Depth** | 16-bit | 2 bytes per sample | ### Frame Sizes | Duration | Bytes | Recommendation | | --------- | ----------- | ---------------------------------------- | | **20 ms** | 640 bytes | ⭐ **Recommended** - Best for low latency | | **40 ms** | 1,280 bytes | ✅ Good - Balanced performance | | **50 ms** | 1,600 bytes | ⚠️ Max - Higher latency | **Formula:** `bytes = sampleRate × duration × 2` Example: 16,000 × 0.020 × 2 = 640 bytes ### Response Format | Property | Value | | --------------- | --------------------- | | **Format** | MP3 | | **Sample Rate** | 22,050 Hz | | **Bitrate** | \~32 kbps | | **Channels** | Mono (1) | | **Delivery** | Base64-encoded chunks | Agent speech is delivered as **multiple chunks per sentence**. Collect all chunks and play when complete. *** ## Message Types ### Speech Recognition Events Sent continuously while user is speaking: ```json theme={null} { "type": "stt.partial", "text": "Halo apa kabar" } ``` Display with visual indication (italic, gray) to show it's not final Sent when user stops speaking: ```json theme={null} { "type": "stt.final", "text": "Halo apa kabar hari ini?" } ``` After non-empty final transcript, agent starts processing response ### Audio Response - Chunked MP3 Each sentence arrives as multiple chunks: ```json theme={null} { "stepType": "sentence", "chatId": "9e4d8f7a-1234-5678", "sentenceId": 123456, "contentStep": "Halo! Senang bertemu dengan Anda.", "audioBase64": "", "audioMime": "audio/mp3", "chunkIndex": 0, "isLastChunk": false, "seq": 12 } ``` Keep collecting chunks where `isLastChunk: false` Final chunk may have `audioBase64: null`: ```json theme={null} { "stepType": "sentence", "sentenceId": 123456, "audioBase64": null, "chunkIndex": 7, "isLastChunk": true, "seq": 12 } ``` Combine all chunks and play complete sentence Maintain separate player per `sentenceId` AND per `seq` for proper synchronization ### Optional Control Commands ```json Stop Utterance theme={null} { "type": "stop" } ``` ```json Manual Barge-in theme={null} { "type": "barge" } ``` Send these commands as JSON text frames (not binary) ### Complete Response Summary ```json theme={null} { "stepType": "final_answer", "stepTitle": "Final Answer", "fullResponse": { "answers": [{ "output": "Complete response text" }] }, "audioBase64": null, "audioMime": "audio/mp3", "isFinal": true, "seq": 12, "stepDuration": 1543.2 } ``` Processing duration in milliseconds *** ## Barge-in Control **Barge-in** allows users to interrupt the AI agent mid-speech, creating natural conversation flow. While agent is talking, user begins new input Server analyzes if speech is meaningful (not filler words) ```json theme={null} { "type": "barge", "seq": 12, "reason": "content_partial" } ``` Stop all audio players with `seq < 12` Only play audio matching new `seq: 12` Server triggers barge-in when ALL conditions are met: | Setting | Default | Description | | --------------------- | ------- | ------------------------------------- | | `MIN_PARTIAL_CHARS` | 10 | Minimum characters in speech | | `MIN_PARTIAL_WORDS` | 2 | Minimum number of words | | `MIN_CONFIDENCE` | 0.30 | STT confidence threshold (0.0 to 1.0) | | `SPEECH_START_WINDOW` | 1.5s | Time window after VAD detection | | `BARGE_COOLDOWN` | 0.8s | Minimum time between barges | **Filler words do NOT trigger barge-in:** `uh`, `um`, `hmm`, `eh`, `ah`, `ya`, `yah` **Check Sequence on Every Message:** ```javascript theme={null} // When receiving any message if (msg.seq && msg.seq > currentSeq) { // Stop all old audio stopAllAudio(currentSeq); currentSeq = msg.seq; } ``` **Stop Old Audio Function:** ```javascript theme={null} function stopAllAudio(beforeSeq) { audioPlayers.forEach((player, key) => { const [seq] = key.split('_'); if (parseInt(seq) < beforeSeq) { player.audio.pause(); player.audio.currentTime = 0; audioPlayers.delete(key); } }); } ``` Always maintain `currentSeq` as a global variable to track the latest sequence number. *** ## Complete Example ```javascript Basic Setup theme={null} const ws = new WebSocket("wss://api.soca.ai/voice-ws"); ws.binaryType = "arraybuffer"; ws.onopen = () => { // Send start message ws.send(JSON.stringify({ type: "start", private_key: "", session_id: "", agent_id: "", lang: "id", sr: 16000 })); }; ws.onmessage = (e) => { const msg = JSON.parse(e.data); // Handle different message types switch(msg.type) { case "stt.partial": console.log("Partial:", msg.text); break; case "stt.final": console.log("Final:", msg.text); break; case "barge": handleBarge(msg.seq); break; } // Handle audio chunks if (msg.stepType === "sentence") { handleAudioChunk(msg); } }; ``` ```javascript Send Audio theme={null} // Capture microphone const stream = await navigator.mediaDevices.getUserMedia({ audio: { sampleRate: 16000, channelCount: 1, echoCancellation: true, noiseSuppression: true } }); const audioContext = new AudioContext({ sampleRate: 16000 }); const source = audioContext.createMediaStreamSource(stream); const processor = audioContext.createScriptProcessor(4096, 1, 1); processor.onaudioprocess = (e) => { const float32 = e.inputBuffer.getChannelData(0); const int16 = convertFloat32ToInt16(float32); if (ws.readyState === WebSocket.OPEN) { ws.send(int16.buffer); } }; source.connect(processor); processor.connect(audioContext.destination); ``` ```javascript Float32 to Int16 theme={null} function convertFloat32ToInt16(float32Array) { const int16Array = new Int16Array(float32Array.length); for (let i = 0; i < float32Array.length; i++) { const s = Math.max(-1, Math.min(1, float32Array[i])); int16Array[i] = s < 0 ? s * 0x8000 : s * 0x7FFF; } return int16Array; } ``` ```javascript Play Audio theme={null} function handleAudioChunk(msg) { const key = `${msg.seq}_${msg.sentenceId}`; // Collect chunks if (!pendingChunks.has(key)) { pendingChunks.set(key, []); } if (msg.audioBase64) { pendingChunks.get(key).push(msg.audioBase64); } // Play when complete if (msg.isLastChunk) { const chunks = pendingChunks.get(key); const fullBase64 = chunks.join(''); // Decode and play const audioData = atob(fullBase64); const buffer = new ArrayBuffer(audioData.length); const view = new Uint8Array(buffer); for (let i = 0; i < audioData.length; i++) { view[i] = audioData.charCodeAt(i); } const blob = new Blob([buffer], { type: 'audio/mp3' }); const url = URL.createObjectURL(blob); const audio = new Audio(url); audio.play(); audio.onended = () => URL.revokeObjectURL(url); pendingChunks.delete(key); } } ``` *** ## Error Handling ```json theme={null} { "type": "error", "message": "must start with type=start" } ``` **Solution:** Ensure `start` message is sent immediately after connection ```json theme={null} { "type": "error", "message": "Invalid or expired token" } ``` **Solution:** Get fresh token from Soca AI dashboard Server attempts transparent reconnect for audio sends. Client should implement reconnection logic with exponential backoff. ```javascript theme={null} ws.onclose = (event) => { if (event.code !== 1000) { setTimeout(() => reconnect(), getBackoffDelay()); } }; ``` *** ## Troubleshooting **Possible causes:** * Not using MSE for MP3 streaming * Missing user-gesture for autoplay * Chunks not combined correctly **Solutions:** * Use `

**Cause:** Not respecting sequence numbers **Solution:** ```javascript theme={null} if (msg.seq > currentSeq) { stopAllAudio(currentSeq); currentSeq = msg.seq; } ``` **Solutions:** * Reduce frame size to 20-40 ms * Disable heavy DSP in `getUserMedia` * Check network latency **Checklist:** * ✅ Grant microphone permission * ✅ Use HTTPS (required for getUserMedia) * ✅ Check browser compatibility * ✅ Verify audio constraints (16kHz, mono) *** See complete working implementation with source code on GitHub