Khám phá & đăng kýVoicebot Integration

Voicebot Integration

Nhữ Hào Nam·5/5/2026

AI Voice Bot integration via WebSocket

This document describes the WebSocket protocol for integrating the AI Voice Bot with Alohub's Voice Gateway. The vendor provides a WebSocket server according to the specifications below — the Gateway will connect when the call starts and exchange audio/control events throughout the call.

Note: The reference model follows Twilio Media Streams. Each call corresponds to an independent WebSocket connection, not reused. Contact Alohub for integration support.

2. Architecture

  • The Voice Gateway is a WebSocket client, initiating a connection to the bot when a new call starts.

  • The AI Voice Bot is a WebSocket server, listening for connections, processing audio, and returning audio responses.

  • Each call = 1 independent WebSocket connection, not reused.

3. WebSocket Connection

3.1 URL + Authentication

The vendor provides the WebSocket URL, and the Gateway transmits api_keyvia query parameters:

wss://bot.vendor.com/ws/voice?api_key=<KEY>

Bot verifies api_keyduring the handshake. If incorrect → close WS with close code 1008.

3.2 Technical Requirements

Requirement

Value

Protocol

WebSocket (RFC 6455)

Scheme

wss://(TLS required for production)

Message format

JSON, text frames, UTF-8

Audio transport

Base64 in field media.payload

3.3 Timeouts

Timeout

Default value

Connect timeout

5 seconds

Idle timeout (no media)

30 seconds

Max session duration

900 seconds

When the call ends, the Gateway sends an event stopthen closes WS with close code 1000(normal). The vendor does not need to reconnect.

4. Audio Format

Property

Value

Codec

PCM signed 16-bit little-endian ( pcm_s16le)

Sample rate

8000 Hz

Channels

1 (mono)

Frame size

20ms/chunk (160 samples = 320 bytes)

Transport

Base64 string in JSON

Note: The audio format is fixed at v1. Both the Gateway and bot must use this exact format.

5. Processing Flow (Sequence Diagrams)

In v1, the bot operates in a half-duplex model: the bot finishes the audio response before the Gateway forwards the caller's audio. The bot does not need to handle barge-in.

5.1 Happy Path & Transfer Agent

  Caller              Voice Gateway             AI Voice Bot
    |                       |                        |
    |        +---------------------------------------+
    |        |       Handshake & khởi tạo session    |
    |        +---------------------------------------+
    |                       |  (1) WS connect        |
    |                       |----------------------->|
    |                       |  (2) 101 Switching     |
    |                       |<-----------------------|
    |                       |  (3) connected         |
    |                       |----------------------->|
    |                       |  (4) start (metadata)  |
    |                       |----------------------->|
    |                       |                        |
    +------------------------------------------------+
    | loop [Mỗi lượt hội thoại]                      |
    |    +---------------------------------------+   |
    |    |               Bot nói                 |   |
    |    +---------------------------------------+   |
    |    |                  |  (5) media (chunks)    |
    |    |                  |<-----------------------|
    |    |  (6) phát audio  |                        |
    |    |<-----------------|                        |
    |    |                  |  (7) mark {turn_done}  |
    |    |                  |<-----------------------|
    |    |          +-----------------+              |
    |    |          |  Chờ phát xong  |              |
    |    |          +-----------------+              |
    |    |                  |  (8) mark echo         |
    |    |                  |----------------------->|
    |    +---------------------------------------+   |
    |    |              Caller nói               |   |
    |    +---------------------------------------+   |
    |    |  (9) caller nói  |                        |
    |    |----------------->|                        |
    |    |                  | (10) media (inbound)   |
    |    |                  |----------------------->|
    +------------------------------------------------+
    |                       |                        |
    |    +---------------------------------------+   |
    |    |      Chuyển cuộc gọi sang agent       |   |
    |    +---------------------------------------+   |
    |                       | (11) transfer target   |
    |                       |<-----------------------|
    | (12) transfer to agent|                        |
    |<----------------------|                        |
    |                       | (13) stop (transferred)|
    |                       |----------------------->|
    |                       | (14) close WS          |
    |                       |-----------X------------|

5.2 End of Conversation & Drain Buffer

          Voice Gateway             AI Voice Bot
                |                        |
      +------------------------------------------+
      |         Giai đoạn kết thúc               |
      +------------------------------------------+
                |  (1) media (chunk cuối)        |
                |<-------------------------------|
                |  (2) stop {conv_complete}      |
                |<-------------------------------|
                |                                |
      +----------------------------+             |
      |        Drain buffer        |             |
      |  phát nốt audio cho caller |             |
      +----------------------------+             |
                |                                |
                |  (3) stop (ack)                |
                |------------------------------->|
                |  (4) close WS                  |
                |---------------X----------------|

6. Event Specification

Every message is JSON UTF-8 sent via WebSocket text frame.

6.1 Gateway → Bot

6.1.1 connected

Sent once when the WS connection is successfully established.

{
  "event": "connected",
  "sequence_number": 0
}

6.1.2 start

Sent once after connected. Contains call metadata. The vendor should cache this information throughout the session.

{
  "event": "start",
  "sequence_number": 1,
  "start": {
    "stream_sid": "MZxxxxxxxxxxxxxxxx",
    "call_sid": "call-abc123",
    "media_format": {
      "encoding": "pcm_s16le",
      "sample_rate": 8000,
      "channels": 1
    },
    "metadata": {
      "phone_number": "0900000000",
      "direction": "outbound",
      "custom": {
        "key1": "value1",
        "key2": "value2"
      }
    }
  }
}

Field

Type

Description

stream_sid

string

Unique ID of the WebSocket stream

call_sid

string

Call ID, used for logging/tracing

media_format

object

Always PCM 8kHz mono s16le in v1

metadata.phone_number

string

Caller's phone number

metadata.direction

string

outboundor inbound

metadata.custom

object

Dynamic fields depending on each bot's configuration — schema defined during bot registration

6.1.3 media

Audio from the caller streams to the bot. Sent continuously ~20ms/chunk.

{
  "event": "media",
  "sequence_number": 42,
  "media": {
    "track": "inbound",
    "chunk": 41,
    "timestamp": 1776326027630,
    "payload": "<base64_pcm_data>"
  }
}

Field

Type

Description

track

string

Always "inbound"

chunk

int

Chunk sequence number, starting from 0

timestamp

int

Unix ms

payload

string

Base64 of raw PCM bytes (320 bytes/chunk)

6.1.4 mark

Echo back the marker that the bot sent — the Gateway sends markback to the bot when the corresponding audio has finished playing for the caller.

{
  "event": "mark",
  "sequence_number": 80,
  "mark": {
    "name": "greeting_done"
  }
}

6.1.5 stop

The call ends. The Gateway sends then closes WS.

{
  "event": "stop",
  "sequence_number": 999,
  "stop": {
    "reason": "caller_hangup",
    "call_sid": "call-abc123"
  }
}

reason

Description

caller_hangup

Caller hangs up

ai_hangup

Bot requests to stop

transferred

The call has been successfully transferred

timeout

Idle timeout

error

System error

6.2 Bot → Gateway

6.2.1 media

Audio response the bot plays for the caller.

{
  "event": "media",
  "media": {
    "payload": "<base64_pcm_data>"
  }
}
  • Must be PCM 8kHz mono s16le (matching start.media_format)

  • Chunk size should be 20–100ms to reduce latency

  • No need for chunk/timestamp — the Gateway sequences automatically

6.2.2 mark

Set a checkpoint. The Gateway will echo back when the previous audio has finished playing for the caller.

{
  "event": "mark",
  "mark": {
    "name": "question_1_done"
  }
}

The bot uses this to know when the caller has finished listening to the audio, thus activating ASR to listen for feedback.

6.2.3 transfer

Transfer the call to an agent or queue.

{
  "event": "transfer",
  "transfer": {
    "target": "agent_extension_or_queue",
    "context": "default",
    "on_complete": "hangup_bot"
  }
}

Field

Type

Description

target

string

Extension / queue ID / phone number

context

string

Routing context, received from the Gateway during registration

on_complete

string

hangup_bot(close WS bot) or keep_alive(keep WS)

After receiving transfer, the Gateway will: play the remaining audio in the buffer (drain) → perform transfer → send stopwith reason transferredto the bot → close WS.

6.2.4 stop

End the call actively from the bot.

{
  "event": "stop",
  "stop": {
    "reason": "conversation_complete"
  }
}

The Gateway will play the remaining audio in the buffer then end the call.

7. Error Handling

7.1 Close codes

Code

Meaning

1000

Normal closure

1002

Protocol error — invalid JSON format, missing required fields

1008

Policy violation — api_keyincorrect or missing

1011

Internal error

4001

Audio format mismatch

4002

Timeout

7.2 Validation

The Gateway will close WS with code 1002if:

  • Message is not valid JSON

  • Missing field event

  • eventnot in the whitelist

  • media.payloadnot valid base64

  • Audio decode fails > 5 consecutive times

7.3 Recommendations

  • Log complete call_sidand stream_sidin every log line

  • Validate JSON before processing, do not crash WS handler

  • Rate limit audio output: do not send > 2x real-time

  • Graceful shutdown: when sending stop/ transfer, wait for the Gateway to close WS

8. Reference Implementation

File

Description

examples/bot_server_python.py

Reference bot — Python asyncio + websockets

examples/bot_server_nodejs.js

Reference bot — Node.js (ws)

examples/mock_gateway.py

Mock Gateway for the vendor to test the bot locally

9. Integration Checklist

  • WS server verifies api_keyfrom query parameters

  • Handles connected, start, media, mark, stop

  • Sends media response in correct PCM 8kHz mono s16le

  • Implement markto sync when the caller has finished listening

  • Implement transferwhen needing to transfer to an agent

  • Log complete call_sid, stream_sid

  • Graceful shutdown

  • Test with mock_gateway.pypass

  • Send URL + api_keyto the integration team

  • Smoke test 10 calls on staging

10. FAQ

Q: Can the bot send audio in multiple chunks at once?
A: Yes, but the total audio duration must not exceed 500ms/message to avoid jitter.

Q: Can I use binary frames instead of JSON?
A: Not supported in v1.

Q: Expected latency?
A: End-to-end (caller speaks → bot responds): target < 800ms. Network round-trip Gateway↔Bot should be < 100ms.

Q: Is barge-in (caller interrupting the bot) supported?
A: No, v1 operates half-duplex — the bot finishes before the Gateway forwards the caller's audio. It will be supported in v2.

Q: When does the Gateway close WS?
A: After sending stop, or caller hangup, or timeout, or protocol error.

Q: Is support for sample rates other than 8kHz or codecs other than PCM available?
A: No in v1. Fixed at PCM 8kHz mono s16le.

Q: Does the bot need to handle DTMF?
A: No, v1 does not forward DTMF to the bot.

11. Contact

Contact the integration team to:

  • Review implementation

  • Configure api_keyand URL on staging/production

  • Debug real calls (need to provide call_sid)

Contact information: Please Contact Alohub for integration support and to receive credentials.

Was this article helpful?
Updated: 5/5/2026
để chuyển bài