Custom LLM integration | ElevenLabs Documentation

Overview

ElevenAgents’ native Twilio integration covers the case where ElevenLabs hosts the LLM. Reach for this guide when you need full control of the LLM brain on your own server — your own model, RAG pipeline, function-call routing, or other server-side reasoning — and the agent is still on a Twilio phone number.

The custom-LLM half is delivered by the Speech Engine SDK, which opens a WebSocket between ElevenLabs and your server so your LLM can stream responses back as the call unfolds. The Twilio half uses Media Streams to relay call audio into the agent.

Architecture

The Speech Engine SDK exposes two WebSocket endpoints in the agent’s conversation system:

The brain WebSocket runs on your server. ElevenLabs connects to it to deliver transcripts and receive LLM-generated text.
The conversation WebSocket runs on ElevenLabs. Clients connect to it to send audio in and receive synthesised audio back. The Twilio bridge connects via a signed URL and relays μ-law audio in both directions.

Because Twilio Media Streams and the Speech Engine both speak ulaw_8000, the bridge relays base64-encoded audio with no transcoding.

The bridge and the brain server can run in the same process if it is convenient — the example below combines them.

When to use this pattern

Both this guide and the native Twilio integration put an agent on a Twilio phone number. The difference is who owns the LLM:

Native integration: ElevenLabs hosts the LLM, you configure it through the agent. Simpler.
Custom LLM via Speech Engine SDK (this guide): you host the LLM on your own server. Full control over the model, RAG, function calls, and business logic. More moving parts.

If your LLM logic fits within the standard agent configuration, prefer native integration. Reach for this guide when your brain needs to run code on your own infrastructure.

For a custom LLM that doesn’t require Twilio at all (browser-only), see Custom LLM, which uses an OpenAI-compatible HTTP endpoint instead of the Speech Engine SDK.

Prerequisites

A Twilio account and a voice-capable phone number.
A Speech Engine resource. Follow the Speech Engine quickstart to create one and learn the brain-server pattern.
A public HTTPS tunnel (e.g. ngrok). Twilio dials your bridge over the public internet.
Python 3.9+ or Node.js 18+.

Configure the agent for μ-law audio

Twilio Media Streams uses 8 kHz μ-law audio. Configure the Speech Engine to accept and emit the same format so the bridge does not need to transcode.

1 import asyncio
2 import os
3 from elevenlabs import AsyncElevenLabs
4 
5 elevenlabs = AsyncElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
6 
7 
8 async def update_engine():
9     await elevenlabs.speech_engine.update(
10         speech_engine_id="seng_8k3m9xr4hjnfg983brhmhkd98n6",
11         asr={"user_input_audio_format": "ulaw_8000"},
12         tts={
13             "model_id": "eleven_flash_v2",
14             "agent_output_audio_format": "ulaw_8000",
15         },
16         speech_engine={
17             "request_headers": {"x-api-key": os.environ["SHARED_SECRET"]},
18         },
19     )
20 
21 
22 asyncio.run(update_engine())

eleven_flash_v2 keeps text-to-speech latency low, which matters on a phone call. The request_headers block tells ElevenLabs to include x-api-key: <shared-secret> on every brain WebSocket connection — the brain server checks the header to ensure only your Speech Engine can reach it.

Build the bridge server

The bridge serves three routes:

POST /incoming-call — Twilio webhook. Returns TwiML telling Twilio to open a Media Stream to /media-stream.
GET /media-stream — Twilio Media Streams WebSocket. Relays audio to and from the Speech Engine conversation WebSocket.
GET /ws — Brain WebSocket. ElevenLabs connects here when a conversation starts. Runs the standard engine.serve() / engine.attach() server.

Install dependencies

$ pip install "elevenlabs" "aiohttp" "twilio" "python-dotenv"

Mint a signed URL for the Speech Engine

The bridge requests a signed URL each time a new call arrives. The URL embeds the Speech Engine ID and a one-time signature, so the bridge never needs the raw API key.

1 from elevenlabs import AsyncElevenLabs
2 
3 elevenlabs = AsyncElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
4 
5 async def signed_url() -> str:
6     response = await elevenlabs.conversational_ai.conversations.get_signed_url(
7         agent_id=os.environ["SPEECH_ENGINE_ID"],
8     )
9     return response.signed_url

Serve the TwiML response

When a call arrives, Twilio POSTs to /incoming-call. The response is TwiML that opens a Media Stream to the bridge’s own /media-stream WebSocket.

1 from aiohttp import web
2 from twilio.request_validator import RequestValidator
3 
4 validator = RequestValidator(os.environ["TWILIO_AUTH_TOKEN"])
5 
6 
7 async def incoming_call(request: web.Request) -> web.Response:
8     form = await request.post()
9     signature = request.headers.get("X-Twilio-Signature", "")
10     url = str(request.url)
11     if not validator.validate(url, dict(form), signature):
12         return web.Response(status=403, text="forbidden")
13 
14     host = request.headers.get("X-Forwarded-Host") or request.host
15     twiml = (
16         '<?xml version="1.0" encoding="UTF-8"?>'
17         "<Response><Connect>"
18         f'<Stream url="wss://{host}/media-stream"/>'
19         "</Connect></Response>"
20     )
21     return web.Response(text=twiml, content_type="text/xml")

RequestValidator (Python) and twilio.webhook({ validate: true }) (Node) check the X-Twilio-Signature header against TWILIO_AUTH_TOKEN. Without validation, anyone on the public internet could POST to /incoming-call and bill calls to your account.

Bridge the Media Stream

The Media Stream is a WebSocket that sends a sequence of JSON events: connected, start, media (the audio payload), and stop. The bridge opens a Speech Engine conversation WebSocket on start and relays audio in both directions until the stream closes.

1 import asyncio
2 import json
3 
4 import aiohttp
5 from aiohttp import web
6 
7 
8 async def media_stream(request: web.Request) -> web.WebSocketResponse:
9     twilio_ws = web.WebSocketResponse()
10     await twilio_ws.prepare(request)
11 
12     stream_sid: str | None = None
13     el_session: aiohttp.ClientSession | None = None
14     el_ws: aiohttp.ClientWebSocketResponse | None = None
15     pump_task: asyncio.Task | None = None
16 
17     async def pump_el_to_twilio(el: aiohttp.ClientWebSocketResponse):
18         async for msg in el:
19             if msg.type != aiohttp.WSMsgType.TEXT:
20                 continue
21             event = json.loads(msg.data)
22             etype = event.get("type")
23             if etype == "audio":
24                 await twilio_ws.send_str(json.dumps({
25                     "event": "media",
26                     "streamSid": stream_sid,
27                     "media": {"payload": event["audio_event"]["audio_base_64"]},
28                 }))
29             elif etype == "interruption":
30                 await twilio_ws.send_str(json.dumps({
31                     "event": "clear",
32                     "streamSid": stream_sid,
33                 }))
34             elif etype == "ping":
35                 event_id = event.get("ping_event", {}).get("event_id")
36                 await el.send_str(json.dumps({
37                     "type": "pong", "event_id": event_id,
38                 }))
39 
40     try:
41         async for msg in twilio_ws:
42             if msg.type != aiohttp.WSMsgType.TEXT:
43                 continue
44             event = json.loads(msg.data)
45 
46             if event["event"] == "start":
47                 stream_sid = event["start"]["streamSid"]
48                 el_session = aiohttp.ClientSession()
49                 el_ws = await el_session.ws_connect(await signed_url())
50                 await el_ws.send_str(json.dumps({
51                     "type": "conversation_initiation_client_data",
52                 }))
53                 pump_task = asyncio.create_task(pump_el_to_twilio(el_ws))
54 
55             elif event["event"] == "media" and el_ws is not None:
56                 await el_ws.send_str(json.dumps({
57                     "user_audio_chunk": event["media"]["payload"],
58                 }))
59 
60             elif event["event"] == "stop":
61                 break
62     finally:
63         if pump_task:
64             pump_task.cancel()
65         if el_ws and not el_ws.closed:
66             await el_ws.close()
67         if el_session and not el_session.closed:
68             await el_session.close()
69 
70     return twilio_ws

The interruption event from Speech Engine triggers a clear event on the Twilio stream, which discards any buffered audio so barge-in works cleanly. The ping event is answered with pong to keep the conversation WebSocket alive.

Run the brain server alongside

The brain server is the standard Speech Engine server shown in the quickstart. The only addition is the shared-secret check on the WebSocket upgrade — accept the connection only if x-api-key matches the value you set on the Speech Engine.

1 import os
2 
3 from elevenlabs import AsyncElevenLabs
4 
5 elevenlabs = AsyncElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
6 SHARED_SECRET = os.environ["SHARED_SECRET"]
7 
8 
9 async def brain_ws(request: web.Request) -> web.WebSocketResponse:
10     if request.headers.get("x-api-key") != SHARED_SECRET:
11         return web.Response(status=401, text="unauthorized")
12 
13     ws = web.WebSocketResponse()
14     await ws.prepare(request)
15 
16     engine = await elevenlabs.speech_engine.get(os.environ["SPEECH_ENGINE_ID"])
17     session = engine.create_session(ws)
18 
19     async def on_transcript(transcript):
20         # Replace this with your own LLM call; see the quickstart.
21         await session.send_response("Hello, you've reached the demo.")
22 
23     session.on("user_transcript", on_transcript)
24     await session.run()
25     return ws
26 
27 
28 def make_app() -> web.Application:
29     app = web.Application()
30     app.router.add_post("/incoming-call", incoming_call)
31     app.router.add_get("/media-stream", media_stream)
32     app.router.add_get("/ws", brain_ws)
33     return app
34 
35 
36 if __name__ == "__main__":
37     web.run_app(make_app(), port=3001)

See the Speech Engine quickstart for the full on_transcript implementation, including an LLM call and streamed response.

Point Twilio at the bridge

Start the bridge and a public tunnel

$ ngrok http 3001
$ python bridge.py

Note the https:// URL ngrok prints — Twilio will POST to it.

Update the Speech Engine ws_url

Set speech_engine.ws_url to the public WebSocket URL of your brain endpoint so ElevenLabs knows where to connect.

1 await elevenlabs.speech_engine.update(
2     speech_engine_id="seng_8k3m9xr4hjnfg983brhmhkd98n6",
3     speech_engine={"ws_url": "wss://abc123.ngrok.io/ws"},
4 )

Configure the Twilio number

In the Twilio console, open your phone number’s Voice Configuration:

A call comes in: Webhook
URL: https://abc123.ngrok.io/incoming-call
HTTP method: POST

If the number is attached to an Elastic SIP Trunk, detach it first — a Twilio number routes either to a trunk or to a webhook, not both.

Call the number

Dial the number from any phone. The agent answers; speak into the call and you should hear the agent respond. With debug logging enabled, the bridge logs the call SID, conversation ID, and audio format for each turn.

Production considerations

Webhook validation: always validate the X-Twilio-Signature on /incoming-call. The example above uses Twilio’s helper library; do not skip this step.
Shared secret: enforce the shared secret on the brain WebSocket. Without it, anyone who guesses your ngrok URL can connect and impersonate ElevenLabs.
Stable host: ngrok free tier URLs change on every restart. Use a reserved ngrok domain or a real hostname so you do not need to update the Speech Engine ws_url and the Twilio webhook after every restart.
Latency: each call adds two network hops on top of the LLM time-to-first-token. Use a low-latency model and stream responses to keep perceived latency low.
One process or two: the example colocates the bridge and the brain on the same port so a single ngrok tunnel covers everything. In production, you can split them across two services as long as each has a public URL.
Prompt injection: spoken input from a phone call is untrusted user input. Validate transcripts before they influence tool calls or database writes.

Next steps

Twilio native integration

Use the hosted LLM instead of a custom one.

Speech Engine quickstart

Build the brain server end-to-end with a streaming LLM.

Custom LLM (OpenAI-compatible)

An alternative custom-LLM mechanism using an OpenAI-compatible HTTP endpoint.

Python SDK reference

Classes, methods, and events for the Speech Engine Python SDK.

JavaScript SDK reference

Classes, methods, and events for the Speech Engine JavaScript SDK.