WebSocket protocol for AI voice bot integration with Voice Gateway. The vendor provides a WebSocket server according to the spec below; the Gateway will connect when the call starts and exchange audio/control events throughout the call. Reference model: Twilio Media Streams.
Version 1.0— Audience: AI Voice Bot Vendor
Integration consists of 2 parts: REST API for the vendor to request the Gateway to dial, and WebSocket for the Gateway to stream audio/control events with the bot after the caller picks up.
Half-duplex in v1:The bot sends audio response, then the Gateway forwards the caller's audio. The bot does not needto handle barge-in in this version.
VENDOR SIDE PLATFORM SIDE PSTN
┌──────────────────────────┐ ┌──────────────────────────────────┐ ┌────────┐
│ │ │ │ │ │
│ ┌────────────────────┐ │ 1. REST │ ┌────────────────────────────┐ │ SIP │ │
│ │ │──┼──────────────┼─►│ │─────────── ──► │
│ │ REST client │ │ POST │ │ Voice Gateway │ │ │ Caller │
│ │ (vendor app) │◄─┼──────────────┼──│ (REST server) │◄───────────── │
│ │ │ │ 200 OK │ │ │ │ │ │
│ └────────────────────┘ │ │ └──────────────┬─────────────┘ │ │ │
│ │ │ │ │ │ │
│ ┌────────────────────┐ │ │ │ 2. WS connect │ │ │
│ │ │◄─┼──────────────┼─────── WSS ─────┘ (sau khi caller│ │ │
│ │ AI Voice Bot │ │ │ nhấc máy) │ │ │
│ │ (WebSocket │──┼──────────────┼─────── WSS ─────────────────────►│ │ │
│ │ server) │ │ │ │ └────────┘
│ │ │ │ │ │
│ │ STT → LLM → TTS │ │ │ │
│ └────────────────────┘ │ │ │
│ │ │ │
└──────────────────────────┘ └──────────────────────────────────┘Components | Roles |
|---|---|
Voice Gateway | WebSocket client(to bot), also a REST server(receiving dial commands from vendor) |
AI Voice Bot | WebSocket server(receiving connections from Gateway), also a REST client(calling dial API) |
Step | Action | Description |
|---|---|---|
1 | Vendor requests dialing via REST API | Call |
2 | Gateway dials via SIP/PSTN | The Gateway makes an outbound call to the caller. If the caller does not pick up → no WebSocket is opened. |
3 | Gateway opens WebSocket to bot | After the caller picks up, the Gateway connects to |
4 | Exchange audio/control | The bot sends audio response ( |
5 | End the call | The caller hangs up / the bot sends |
Three common scenarios: happy path with transfer, bot actively ends, caller does not pick up.
AI Voice Bot Voice Gateway Caller
════════════ ═════════════ ══════
│ │ │
════ PHASE 0 · Vendor yêu cầu quay số (REST API) ═══════════════════════════════════════════════════════════
│ │ │
│ POST /v1/voice/callbot │ │
├───────────────────────────────────────────►│ │
│phone, campaignId, transactionId, socketUrl │ │
│ │ │
│ 200 OK {error_code: "success"} │ │
│◄───────────────────────────────────────────┤ │
│ │ │
│ │ Dial (SIP / PSTN) │
│ ├───────────────────────────────────────────►│
│ │ Nhấc máy │
│ │◄───────────────────────────────────────────┤
│ │ │
════ PHASE 1 · WebSocket handshake & start ═════════════════════════════════════════════════════════════════
│ │ │
│ WS connect │ │
│◄───────────────────────────────────────────┤ │
│ wss://.../ws/voice?api_key=... │ │
│ │ │
│ 101 Switching Protocols │ │
├───────────────────────────────────────────►│ │
│ │ │
│ connected │ │
│◄───────────────────────────────────────────┤ │
│ │ │
│ start │ │
│◄───────────────────────────────────────────┤ │
│ call_sid, metadata.custom │ │
│ │ │
════ PHASE 2 · Loop hội thoại (lặp cho mỗi lượt) ═══════════════════════════════════════════════════════════
│ │ │
╭── [ BOT NÓI ] ──────────────────────────────────────────────────────────────────────╮
│ │ │
│ media │ │
├───────────────────────────────────────────►│ │
│ audio chunks (PCM 8kHz) │ │
│ │ │
│ │ phát audio │
│ ├┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈►│
│ │ │
│ mark name: "turn_N_done" │ │
├───────────────────────────────────────────►│ │
│ [ chờ phát xong ] │
│ mark (echo) │ │
│◄───────────────────────────────────────────┤ │
╰─────────────────────────────────────────────────────────────────────────────────────╯
│ │ │
╭── [ CALLER NÓI ] ───────────────────────────────────────────────────────────────────╮
│ │ │
│ │ caller speak │
│ │◄┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤
│ │ │
│ media track: "inbound" │ │
│◄───────────────────────────────────────────┤ │
╰─────────────────────────────────────────────────────────────────────────────────────╯
│ │ │
│ … bot xử lý và quay lại pha "Bot nói" … │
│ │ │
════ PHASE 3 · Chuyển cuộc gọi sang agent ══════════════════════════════════════════════════════════════════
│ │ │
│ transfer target: "agent_01" │ │
├───────────────────────────────────────────►│ │
│ │ │
│ │ SIP REFER / bridge → agent │
│ ├───────────────────────────────────────────►│
│ │ │
│ stop reason: "transferred" │ │
│◄───────────────────────────────────────────┤ │
│ │ │Regarding
mark:The bot sendsmarkafter each audio segment, the Gateway echoesmarkwhen the caller has finished listening. The bot uses this signal to know when to activate ASR to process the caller's response.
AI Voice Bot Voice Gateway
════════════ ═════════════
│ │
════ Bot chủ động kết thúc cuộc gọi ════════════════════════════════════════════════════════════════════════
│ │
│ media (chunk cuối) │
├───────────────────────────────────────────►│
│ │
│ stop reason: "conversation_complete" │
├───────────────────────────────────────────►│
│ │
│ [ drain buffer ] │
│ phát nốt audio cho caller │
│ │
│ stop (ack) close code 1000 │
│◄───────────────────────────────────────────┤
│ │
AI Voice Bot Voice Gateway Caller
════════════ ═════════════ ══════
│ │ │
│ POST /v1/voice/callbot │ │
├───────────────────────────────────────────►│ │
│ │ │
│ 200 OK {error_code: "success"} │ │
│◄───────────────────────────────────────────┤ │
│ │ │
│ │ Dial (SIP / PSTN) │
│ ├───────────────────────────────────────────►│
│ │ │
╔════════════════════════════════════════════════════════════════════╗
║ ║
║ Không nhấc máy · máy bận · thuê bao tắt ║
║ ║
╚════════════════════════════════════════════════════════════════════╝
→ Không có WebSocket nào được mởNote on dialing result feedback:In v1, there is no webhook returning the dial result to the vendor. If the caller does not pick up, the vendor will not receive the information — should use
transactionIdfor reconciliation, or contact the integration teamto obtaincall_sidwhen debugging is needed.
The vendor calls this API when they want the Gateway to dial out to the caller (e.g., in an outbound campaign). The API is async— returns immediately upon receiving the request; the dialing process and WebSocket connection occur afterward.
{{baseUrl}}/v1/voice/callbotHeader X-Api-Keyis provided when the vendor registers the bot.
X-Api-Key: <api_key>
Content-Type: application/jsoncurl --location '{{baseUrl}}/v1/voice/callbot' \
--header 'X-Api-Key: {{api_key}}' \
--header 'Content-Type: application/json' \
--data-raw '{
"phone": "0123456789",
"campaignId": 1,
"transactionId": "TXN_004",
"socketUrl": "wss://bot.vendor.com/ws/voice?api_key=xxx",
"name": "John Doe",
"email": "john.doe@example.com",
"address": "123 Main St, Anytown, USA",
"pField1": "Thông tin cá thể hoá 1",
"pField2": "Thông tin cá thể hoá 2",
"pField3": "Thông tin cá thể hoá 3",
"pField4": "Thông tin cá thể hoá 4",
"pField5": "Thông tin cá thể hoá 5",
"pField6": "Thông tin cá thể hoá 6"
}'const axios = require('axios')
const response = await axios.post(
'{{baseUrl}}/v1/voice/callbot',
{
phone: '0123456789',
campaignId: 1,
transactionId: 'TXN_004',
socketUrl: 'wss://bot.vendor.com/ws/voice?api_key=xxx',
name: 'John Doe',
email: 'john.doe@example.com',
address: '123 Main St, Anytown, USA',
pField1: 'Thông tin cá thể hoá 1',
pField2: 'Thông tin cá thể hoá 2',
pField3: 'Thông tin cá thể hoá 3',
pField4: 'Thông tin cá thể hoá 4',
pField5: 'Thông tin cá thể hoá 5',
pField6: 'Thông tin cá thể hoá 6'
},
{
headers: {
'X-Api-Key': '{{api_key}}',
'Content-Type': 'application/json'
}
}
)
console.log(response.data)import requests
response = requests.post(
'{{baseUrl}}/v1/voice/callbot',
json={
'phone': '0123456789',
'campaignId': 1,
'transactionId': 'TXN_004',
'socketUrl': 'wss://bot.vendor.com/ws/voice?api_key=xxx',
'name': 'John Doe',
'email': 'john.doe@example.com',
'address': '123 Main St, Anytown, USA',
'pField1': 'Thông tin cá thể hoá 1',
'pField2': 'Thông tin cá thể hoá 2',
'pField3': 'Thông tin cá thể hoá 3',
'pField4': 'Thông tin cá thể hoá 4',
'pField5': 'Thông tin cá thể hoá 5',
'pField6': 'Thông tin cá thể hoá 6'
},
headers={
'X-Api-Key': '{{api_key}}',
'Content-Type': 'application/json'
}
)
print(response.json())Field | Type | Required | Description |
|---|---|---|---|
| string | Yes | Caller’s phone number |
| int | Yes | ID of the campaign configured on the Gateway |
| string | Yes | Vendor’s transaction ID, used for reconciliation |
| string | Yes | Bot’s WebSocket URL — the Gateway will connect to it after the caller picks up |
| string | No | Customer name |
| string | No | |
| string | No | Address |
| string | No | Personalization fields, to be forwarded into |
{
"error_code": "success",
"message": "OK"
}Response semantics:HTTP 200 +
error_code = "success"→ The Gateway has received the request and will dial. Other valueserror_codeare errors (wrongapi_key, missing field, campaign does not exist, …).
The Gateway actively opens a WebSocket to socketUrlprovided by the vendor during the call initiation step.
The Gateway will connect to socketUrlprovided by the vendor in the API request. The URL must include api_keyas a query param:
wss://bot.vendor.com/ws/voice?api_key=<KEY>Bot verifies
api_keyin the handshake:If incorrect → close WebSocket with close code1008(policy violation).
Requirement | Value |
|---|---|
Protocol | WebSocket (RFC 6455) |
Scheme |
|
Message format | JSON, text frames, UTF-8 |
Audio transport | Base64 in field |
Timeout | Default value |
|---|---|
Connect timeout | 5 seconds |
Idle timeout (no media) | 30 seconds |
Max session duration | 900 seconds |
End of call:When the call ends, the Gateway sends event
stopthen closes WS with close code1000(normal). Vendor does not need to reconnect.
Fixed audio format in v1. Both the Gateway and bot must use this format correctly.
Property | Value |
|---|---|
Codec | PCM signed 16-bit little-endian ( |
Sample rate | 8000 Hz |
Channels | 1 (mono) |
Frame size | 20ms / chunk (160 samples = 320 bytes) |
Transport | Base64 string in JSON |
Every message is JSON UTF-8 sent via WebSocket text frame.
connectedGATEWAY → BOT— Sent immediately after WebSocket handshake is successful.
{
"event": "connected",
"protocol": "voice_stream",
"version": "1.0"
}startGATEWAY → BOT— Sent once after connected. Contains call metadata. The vendor should cache this information throughout the session.
{
"event": "start",
"sequence_number": 1,
"start": {
"stream_sid": "MZxxxxxxxxxxxxxxxxx",
"call_sid": "call-abc123",
"media_format": {
"encoding": "pcm_s16le",
"sample_rate": 8000,
"channels": 1
},
"metadata": {
"phone_number": "0900000000",
"direction": "outbound",
"custom": {
"key1": "value1",
"key2": "value2"
}
}
}
}Field | Type | Description |
|---|---|---|
| string | Unique ID of the WebSocket stream |
| string | Call ID, used for logging / tracing |
| object | Always PCM 8kHz mono s16le in v1 |
| string | Caller’s phone number |
| string |
|
| object | Dynamic fields depending on each bot's configuration |
metadata.custom:Dynamic fields agreed upon by the vendor and Gateway prior to integration. The schema is defined when registering the bot (e.g., mapping frompField1…pField6in REST API).
mediaGATEWAY → BOT— Audio from caller streams to the bot. Sent continuously ~20ms / chunk.
{
"event": "media",
"sequence_number": 42,
"media": {
"track": "inbound",
"chunk": 41,
"timestamp": 1776326027630,
"payload": "<base64_pcm_data>"
}
}Field | Type | Description |
|---|---|---|
| string | Always |
| int | Chunk order number, starting from 0 |
| int | Unix ms |
| string | Base64 of raw PCM bytes (320 bytes / chunk) |
markGATEWAY → BOT— Echo back the marker that the bot sent. The Gateway sends markback to the bot when the corresponding audio has finished playing for the caller.
{
"event": "mark",
"sequence_number": 80,
"mark": {
"name": "greeting_done"
}
}stopGATEWAY → BOT— Call ended. The Gateway sends then closes WS.
{
"event": "stop",
"sequence_number": 999,
"stop": {
"reason": "caller_hangup",
"call_sid": "call-abc123"
}
}
| Description |
|---|---|
| Caller hangs up |
| Bot requests to stop |
| Call has been successfully transferred |
| Idle timeout |
| System error |
The bot sends audio response, marker, transfer command, or stop to the Gateway.
mediaBOT → GATEWAY— Audio response the bot plays for the caller.
{
"event": "media",
"media": {
"payload": "<base64_pcm_data>"
}
}Must be PCM 8kHz mono s16le (matching start.media_format)
Chunk size should be 20–100ms to reduce latency
No need for chunk/ timestamp— The Gateway sequences automatically
markBOT → GATEWAY— Set checkpoint. The Gateway will echo back when the previous audio has finished playing for the caller.
{
"event": "mark",
"mark": {
"name": "question_1_done"
}
}The bot uses this to know when the caller has finished listening to the audio, thus activating ASR to listen for responses.
transferBOT → GATEWAY— Transfer the call to an agent or queue.
{
"event": "transfer",
"transfer": {
"target": "agent_extension_or_queue",
"context": "default",
"on_complete": "hangup_bot"
}
}Field | Type | Description |
|---|---|---|
| string | Extension / queue ID / phone number |
| string | Routing context, received from Gateway during registration |
| string |
|
After receiving
transfer, the Gateway will:
Play the remaining audio in the buffer (drain)
Perform transfer
Send
stopwith reasontransferredto the botClose WS
stopBOT → GATEWAY— Active call end from the bot.
{
"event": "stop",
"stop": {
"reason": "conversation_complete"
}
}The Gateway will play the remaining audio in the buffer then end the call.
Contact the Alohub integration teamwhen support is needed:
Review implementation:Sanity check spec and event processing logic before going to production.
Configuration api_key& URL:On staging / production environment for each bot.
Debug real calls:Need to provide call_sidor transactionIdto trace logs.