This document describes the WebSocket protocol for integrating the AI Voice Bot with Alohub's Voice Gateway. The vendor provides a WebSocket server according to the specifications below — the Gateway will connect when the call starts and exchange audio/control events throughout the call.
Note: The reference model follows Twilio Media Streams. Each call corresponds to an independent WebSocket connection, not reused. Contact Alohub for integration support.
The Voice Gateway is a WebSocket client, initiating a connection to the bot when a new call starts.
The AI Voice Bot is a WebSocket server, listening for connections, processing audio, and returning audio responses.
Each call = 1 independent WebSocket connection, not reused.
The vendor provides the WebSocket URL, and the Gateway transmits api_keyvia query parameters:
wss://bot.vendor.com/ws/voice?api_key=<KEY>Bot verifies api_keyduring the handshake. If incorrect → close WS with close code 1008.
Requirement | Value |
|---|---|
Protocol | WebSocket (RFC 6455) |
Scheme |
|
Message format | JSON, text frames, UTF-8 |
Audio transport | Base64 in field |
Timeout | Default value |
|---|---|
Connect timeout | 5 seconds |
Idle timeout (no media) | 30 seconds |
Max session duration | 900 seconds |
When the call ends, the Gateway sends an event stopthen closes WS with close code 1000(normal). The vendor does not need to reconnect.
Property | Value |
|---|---|
Codec | PCM signed 16-bit little-endian ( |
Sample rate | 8000 Hz |
Channels | 1 (mono) |
Frame size | 20ms/chunk (160 samples = 320 bytes) |
Transport | Base64 string in JSON |
Note: The audio format is fixed at v1. Both the Gateway and bot must use this exact format.
In v1, the bot operates in a half-duplex model: the bot finishes the audio response before the Gateway forwards the caller's audio. The bot does not need to handle barge-in.
Caller Voice Gateway AI Voice Bot
| | |
| +---------------------------------------+
| | Handshake & khởi tạo session |
| +---------------------------------------+
| | (1) WS connect |
| |----------------------->|
| | (2) 101 Switching |
| |<-----------------------|
| | (3) connected |
| |----------------------->|
| | (4) start (metadata) |
| |----------------------->|
| | |
+------------------------------------------------+
| loop [Mỗi lượt hội thoại] |
| +---------------------------------------+ |
| | Bot nói | |
| +---------------------------------------+ |
| | | (5) media (chunks) |
| | |<-----------------------|
| | (6) phát audio | |
| |<-----------------| |
| | | (7) mark {turn_done} |
| | |<-----------------------|
| | +-----------------+ |
| | | Chờ phát xong | |
| | +-----------------+ |
| | | (8) mark echo |
| | |----------------------->|
| +---------------------------------------+ |
| | Caller nói | |
| +---------------------------------------+ |
| | (9) caller nói | |
| |----------------->| |
| | | (10) media (inbound) |
| | |----------------------->|
+------------------------------------------------+
| | |
| +---------------------------------------+ |
| | Chuyển cuộc gọi sang agent | |
| +---------------------------------------+ |
| | (11) transfer target |
| |<-----------------------|
| (12) transfer to agent| |
|<----------------------| |
| | (13) stop (transferred)|
| |----------------------->|
| | (14) close WS |
| |-----------X------------| Voice Gateway AI Voice Bot
| |
+------------------------------------------+
| Giai đoạn kết thúc |
+------------------------------------------+
| (1) media (chunk cuối) |
|<-------------------------------|
| (2) stop {conv_complete} |
|<-------------------------------|
| |
+----------------------------+ |
| Drain buffer | |
| phát nốt audio cho caller | |
+----------------------------+ |
| |
| (3) stop (ack) |
|------------------------------->|
| (4) close WS |
|---------------X----------------|Every message is JSON UTF-8 sent via WebSocket text frame.
Sent once when the WS connection is successfully established.
{
"event": "connected",
"sequence_number": 0
}
Sent once after connected. Contains call metadata. The vendor should cache this information throughout the session.
{
"event": "start",
"sequence_number": 1,
"start": {
"stream_sid": "MZxxxxxxxxxxxxxxxx",
"call_sid": "call-abc123",
"media_format": {
"encoding": "pcm_s16le",
"sample_rate": 8000,
"channels": 1
},
"metadata": {
"phone_number": "0900000000",
"direction": "outbound",
"custom": {
"key1": "value1",
"key2": "value2"
}
}
}
}
Field | Type | Description |
|---|---|---|
| string | Unique ID of the WebSocket stream |
| string | Call ID, used for logging/tracing |
| object | Always PCM 8kHz mono s16le in v1 |
| string | Caller's phone number |
| string |
|
| object | Dynamic fields depending on each bot's configuration — schema defined during bot registration |
Audio from the caller streams to the bot. Sent continuously ~20ms/chunk.
{
"event": "media",
"sequence_number": 42,
"media": {
"track": "inbound",
"chunk": 41,
"timestamp": 1776326027630,
"payload": "<base64_pcm_data>"
}
}Field | Type | Description |
|---|---|---|
| string | Always |
| int | Chunk sequence number, starting from 0 |
| int | Unix ms |
| string | Base64 of raw PCM bytes (320 bytes/chunk) |
Echo back the marker that the bot sent — the Gateway sends markback to the bot when the corresponding audio has finished playing for the caller.
{
"event": "mark",
"sequence_number": 80,
"mark": {
"name": "greeting_done"
}
}The call ends. The Gateway sends then closes WS.
{
"event": "stop",
"sequence_number": 999,
"stop": {
"reason": "caller_hangup",
"call_sid": "call-abc123"
}
}
| Description |
|---|---|
| Caller hangs up |
| Bot requests to stop |
| The call has been successfully transferred |
| Idle timeout |
| System error |
Audio response the bot plays for the caller.
{
"event": "media",
"media": {
"payload": "<base64_pcm_data>"
}
}Must be PCM 8kHz mono s16le (matching start.media_format)
Chunk size should be 20–100ms to reduce latency
No need for chunk/timestamp — the Gateway sequences automatically
Set a checkpoint. The Gateway will echo back when the previous audio has finished playing for the caller.
{
"event": "mark",
"mark": {
"name": "question_1_done"
}
}
The bot uses this to know when the caller has finished listening to the audio, thus activating ASR to listen for feedback.
Transfer the call to an agent or queue.
{
"event": "transfer",
"transfer": {
"target": "agent_extension_or_queue",
"context": "default",
"on_complete": "hangup_bot"
}
}
Field | Type | Description |
|---|---|---|
| string | Extension / queue ID / phone number |
| string | Routing context, received from the Gateway during registration |
| string |
|
After receiving transfer, the Gateway will: play the remaining audio in the buffer (drain) → perform transfer → send stopwith reason transferredto the bot → close WS.
End the call actively from the bot.
{
"event": "stop",
"stop": {
"reason": "conversation_complete"
}
}
The Gateway will play the remaining audio in the buffer then end the call.
Code | Meaning |
|---|---|
| Normal closure |
| Protocol error — invalid JSON format, missing required fields |
| Policy violation — |
| Internal error |
| Audio format mismatch |
| Timeout |
The Gateway will close WS with code 1002if:
Message is not valid JSON
Missing field event
eventnot in the whitelist
media.payloadnot valid base64
Audio decode fails > 5 consecutive times
Log complete call_sidand stream_sidin every log line
Validate JSON before processing, do not crash WS handler
Rate limit audio output: do not send > 2x real-time
Graceful shutdown: when sending stop/ transfer, wait for the Gateway to close WS
File | Description |
|---|---|
| Reference bot — Python asyncio + websockets |
| Reference bot — Node.js (ws) |
| Mock Gateway for the vendor to test the bot locally |
WS server verifies api_keyfrom query parameters
Handles connected, start, media, mark, stop
Sends media response in correct PCM 8kHz mono s16le
Implement markto sync when the caller has finished listening
Implement transferwhen needing to transfer to an agent
Log complete call_sid, stream_sid
Graceful shutdown
Test with mock_gateway.pypass
Send URL + api_keyto the integration team
Smoke test 10 calls on staging
Q: Can the bot send audio in multiple chunks at once?
A: Yes, but the total audio duration must not exceed 500ms/message to avoid jitter.
Q: Can I use binary frames instead of JSON?
A: Not supported in v1.
Q: Expected latency?
A: End-to-end (caller speaks → bot responds): target < 800ms. Network round-trip Gateway↔Bot should be < 100ms.
Q: Is barge-in (caller interrupting the bot) supported?
A: No, v1 operates half-duplex — the bot finishes before the Gateway forwards the caller's audio. It will be supported in v2.
Q: When does the Gateway close WS?
A: After sending stop, or caller hangup, or timeout, or protocol error.
Q: Is support for sample rates other than 8kHz or codecs other than PCM available?
A: No in v1. Fixed at PCM 8kHz mono s16le.
Q: Does the bot need to handle DTMF?
A: No, v1 does not forward DTMF to the bot.
Contact the integration team to:
Review implementation
Configure api_keyand URL on staging/production
Debug real calls (need to provide call_sid)
Contact information: Please Contact Alohub for integration support and to receive credentials.