Voicebot Integration

AI Voice Bot integration via WebSocket

This document describes the WebSocket protocol for integrating the AI Voice Bot with Alohub's Voice Gateway. The vendor provides a WebSocket server according to the specifications below — the Gateway will connect when the call starts and exchange audio/control events throughout the call.

Note: The reference model follows Twilio Media Streams. Each call corresponds to an independent WebSocket connection, not reused. Contact Alohub for integration support.

2. Architecture

The Voice Gateway is a WebSocket client, initiating a connection to the bot when a new call starts.
The AI Voice Bot is a WebSocket server, listening for connections, processing audio, and returning audio responses.
Each call = 1 independent WebSocket connection, not reused.

3. WebSocket Connection

3.1 URL + Authentication

The vendor provides the WebSocket URL, and the Gateway transmits api_keyvia query parameters:

wss://bot.vendor.com/ws/voice?api_key=<KEY>

Bot verifies api_keyduring the handshake. If incorrect → close WS with close code 1008.

3.2 Technical Requirements

Requirement	Value
Protocol	WebSocket (RFC 6455)
Scheme	`wss://`(TLS required for production)
Message format	JSON, text frames, UTF-8
Audio transport	Base64 in field `media.payload`

3.3 Timeouts

Timeout	Default value
Connect timeout	5 seconds
Idle timeout (no media)	30 seconds
Max session duration	900 seconds

When the call ends, the Gateway sends an event stopthen closes WS with close code 1000(normal). The vendor does not need to reconnect.

4. Audio Format

Property	Value
Codec	PCM signed 16-bit little-endian ( `pcm_s16le`)
Sample rate	8000 Hz
Channels	1 (mono)
Frame size	20ms/chunk (160 samples = 320 bytes)
Transport	Base64 string in JSON

Note: The audio format is fixed at v1. Both the Gateway and bot must use this exact format.

5. Processing Flow (Sequence Diagrams)

In v1, the bot operates in a half-duplex model: the bot finishes the audio response before the Gateway forwards the caller's audio. The bot does not need to handle barge-in.

5.1 Happy Path & Transfer Agent

  Caller              Voice Gateway             AI Voice Bot
    |                       |                        |
    |        +---------------------------------------+
    |        |       Handshake & khởi tạo session    |
    |        +---------------------------------------+
    |                       |  (1) WS connect        |
    |                       |----------------------->|
    |                       |  (2) 101 Switching     |
    |                       |<-----------------------|
    |                       |  (3) connected         |
    |                       |----------------------->|
    |                       |  (4) start (metadata)  |
    |                       |----------------------->|
    |                       |                        |
    +------------------------------------------------+
    | loop [Mỗi lượt hội thoại]                      |
    |    +---------------------------------------+   |
    |    |               Bot nói                 |   |
    |    +---------------------------------------+   |
    |    |                  |  (5) media (chunks)    |
    |    |                  |<-----------------------|
    |    |  (6) phát audio  |                        |
    |    |<-----------------|                        |
    |    |                  |  (7) mark {turn_done}  |
    |    |                  |<-----------------------|
    |    |          +-----------------+              |
    |    |          |  Chờ phát xong  |              |
    |    |          +-----------------+              |
    |    |                  |  (8) mark echo         |
    |    |                  |----------------------->|
    |    +---------------------------------------+   |
    |    |              Caller nói               |   |
    |    +---------------------------------------+   |
    |    |  (9) caller nói  |                        |
    |    |----------------->|                        |
    |    |                  | (10) media (inbound)   |
    |    |                  |----------------------->|
    +------------------------------------------------+
    |                       |                        |
    |    +---------------------------------------+   |
    |    |      Chuyển cuộc gọi sang agent       |   |
    |    +---------------------------------------+   |
    |                       | (11) transfer target   |
    |                       |<-----------------------|
    | (12) transfer to agent|                        |
    |<----------------------|                        |
    |                       | (13) stop (transferred)|
    |                       |----------------------->|
    |                       | (14) close WS          |
    |                       |-----------X------------|

5.2 End of Conversation & Drain Buffer

          Voice Gateway             AI Voice Bot
                |                        |
      +------------------------------------------+
      |         Giai đoạn kết thúc               |
      +------------------------------------------+
                |  (1) media (chunk cuối)        |
                |<-------------------------------|
                |  (2) stop {conv_complete}      |
                |<-------------------------------|
                |                                |
      +----------------------------+             |
      |        Drain buffer        |             |
      |  phát nốt audio cho caller |             |
      +----------------------------+             |
                |                                |
                |  (3) stop (ack)                |
                |------------------------------->|
                |  (4) close WS                  |
                |---------------X----------------|

6. Event Specification

Every message is JSON UTF-8 sent via WebSocket text frame.

6.1 Gateway → Bot

6.1.1 connected

Sent once when the WS connection is successfully established.

{
  "event": "connected",
  "sequence_number": 0
}

6.1.2 start

Sent once after connected. Contains call metadata. The vendor should cache this information throughout the session.

{
  "event": "start",
  "sequence_number": 1,
  "start": {
    "stream_sid": "MZxxxxxxxxxxxxxxxx",
    "call_sid": "call-abc123",
    "media_format": {
      "encoding": "pcm_s16le",
      "sample_rate": 8000,
      "channels": 1
    },
    "metadata": {
      "phone_number": "0900000000",
      "direction": "outbound",
      "custom": {
        "key1": "value1",
        "key2": "value2"
      }
    }
  }
}

Field	Type	Description
`stream_sid`	string	Unique ID of the WebSocket stream
`call_sid`	string	Call ID, used for logging/tracing
`media_format`	object	Always PCM 8kHz mono s16le in v1
`metadata.phone_number`	string	Caller's phone number
`metadata.direction`	string	`outbound`or `inbound`
`metadata.custom`	object	Dynamic fields depending on each bot's configuration — schema defined during bot registration

6.1.3 media

Audio from the caller streams to the bot. Sent continuously ~20ms/chunk.

{
  "event": "media",
  "sequence_number": 42,
  "media": {
    "track": "inbound",
    "chunk": 41,
    "timestamp": 1776326027630,
    "payload": "<base64_pcm_data>"
  }
}

Field	Type	Description
`track`	string	Always `"inbound"`
`chunk`	int	Chunk sequence number, starting from 0
`timestamp`	int	Unix ms
`payload`	string	Base64 of raw PCM bytes (320 bytes/chunk)

6.1.4 mark

Echo back the marker that the bot sent — the Gateway sends markback to the bot when the corresponding audio has finished playing for the caller.

{
  "event": "mark",
  "sequence_number": 80,
  "mark": {
    "name": "greeting_done"
  }
}

6.1.5 stop

The call ends. The Gateway sends then closes WS.

{
  "event": "stop",
  "sequence_number": 999,
  "stop": {
    "reason": "caller_hangup",
    "call_sid": "call-abc123"
  }
}

`reason`	Description
`caller_hangup`	Caller hangs up
`ai_hangup`	Bot requests to stop
`transferred`	The call has been successfully transferred
`timeout`	Idle timeout
`error`	System error

6.2 Bot → Gateway

6.2.1 media

Audio response the bot plays for the caller.

{
  "event": "media",
  "media": {
    "payload": "<base64_pcm_data>"
  }
}

Must be PCM 8kHz mono s16le (matching start.media_format)
Chunk size should be 20–100ms to reduce latency
No need for chunk/timestamp — the Gateway sequences automatically

6.2.2 mark

Set a checkpoint. The Gateway will echo back when the previous audio has finished playing for the caller.

{
  "event": "mark",
  "mark": {
    "name": "question_1_done"
  }
}

The bot uses this to know when the caller has finished listening to the audio, thus activating ASR to listen for feedback.

6.2.3 transfer

Transfer the call to an agent or queue.

{
  "event": "transfer",
  "transfer": {
    "target": "agent_extension_or_queue",
    "context": "default",
    "on_complete": "hangup_bot"
  }
}

Field	Type	Description
`target`	string	Extension / queue ID / phone number
`context`	string	Routing context, received from the Gateway during registration
`on_complete`	string	`hangup_bot`(close WS bot) or `keep_alive`(keep WS)

After receiving transfer, the Gateway will: play the remaining audio in the buffer (drain) → perform transfer → send stopwith reason transferredto the bot → close WS.

6.2.4 stop

End the call actively from the bot.

{
  "event": "stop",
  "stop": {
    "reason": "conversation_complete"
  }
}

The Gateway will play the remaining audio in the buffer then end the call.

7. Error Handling

7.1 Close codes

Code	Meaning
`1000`	Normal closure
`1002`	Protocol error — invalid JSON format, missing required fields
`1008`	Policy violation — `api_key`incorrect or missing
`1011`	Internal error
`4001`	Audio format mismatch
`4002`	Timeout

7.2 Validation

The Gateway will close WS with code 1002if:

Message is not valid JSON
Missing field event
eventnot in the whitelist
media.payloadnot valid base64
Audio decode fails > 5 consecutive times

7.3 Recommendations

Log complete call_sidand stream_sidin every log line
Validate JSON before processing, do not crash WS handler
Rate limit audio output: do not send > 2x real-time
Graceful shutdown: when sending stop/ transfer, wait for the Gateway to close WS

8. Reference Implementation

File	Description
`examples/bot_server_python.py`	Reference bot — Python asyncio + websockets
`examples/bot_server_nodejs.js`	Reference bot — Node.js (ws)
`examples/mock_gateway.py`	Mock Gateway for the vendor to test the bot locally

9. Integration Checklist

WS server verifies api_keyfrom query parameters
Handles connected, start, media, mark, stop
Sends media response in correct PCM 8kHz mono s16le
Implement markto sync when the caller has finished listening
Implement transferwhen needing to transfer to an agent
Log complete call_sid, stream_sid
Graceful shutdown
Test with mock_gateway.pypass
Send URL + api_keyto the integration team
Smoke test 10 calls on staging

10. FAQ

Q: Can the bot send audio in multiple chunks at once?
A: Yes, but the total audio duration must not exceed 500ms/message to avoid jitter.

Q: Can I use binary frames instead of JSON?
A: Not supported in v1.

Q: Expected latency?
A: End-to-end (caller speaks → bot responds): target < 800ms. Network round-trip Gateway↔Bot should be < 100ms.

Q: Is barge-in (caller interrupting the bot) supported?
A: No, v1 operates half-duplex — the bot finishes before the Gateway forwards the caller's audio. It will be supported in v2.

Q: When does the Gateway close WS?
A: After sending stop, or caller hangup, or timeout, or protocol error.

Q: Is support for sample rates other than 8kHz or codecs other than PCM available?
A: No in v1. Fixed at PCM 8kHz mono s16le.

Q: Does the bot need to handle DTMF?
A: No, v1 does not forward DTMF to the bot.

11. Contact

Contact the integration team to:

Review implementation
Configure api_keyand URL on staging/production
Debug real calls (need to provide call_sid)

Contact information: Please Contact Alohub for integration support and to receive credentials.