Custom LLM integration
Overview
ElevenAgents’ native Twilio integration covers the case where ElevenLabs hosts the LLM. Reach for this guide when you need full control of the LLM brain on your own server — your own model, RAG pipeline, function-call routing, or other server-side reasoning — and the agent is still on a Twilio phone number.
The custom-LLM half is delivered by the Speech Engine SDK, which opens a WebSocket between ElevenLabs and your server so your LLM can stream responses back as the call unfolds. The Twilio half uses Media Streams to relay call audio into the agent.
Architecture
The Speech Engine SDK exposes two WebSocket endpoints in the agent’s conversation system:
- The brain WebSocket runs on your server. ElevenLabs connects to it to deliver transcripts and receive LLM-generated text.
- The conversation WebSocket runs on ElevenLabs. Clients connect to it to send audio in and receive synthesised audio back. The Twilio bridge connects via a signed URL and relays μ-law audio in both directions.
Because Twilio Media Streams and the Speech Engine both speak ulaw_8000, the bridge relays base64-encoded audio with no transcoding.
The bridge and the brain server can run in the same process if it is convenient — the example below combines them.
When to use this pattern
Both this guide and the native Twilio integration put an agent on a Twilio phone number. The difference is who owns the LLM:
- Native integration: ElevenLabs hosts the LLM, you configure it through the agent. Simpler.
- Custom LLM via Speech Engine SDK (this guide): you host the LLM on your own server. Full control over the model, RAG, function calls, and business logic. More moving parts.
If your LLM logic fits within the standard agent configuration, prefer native integration. Reach for this guide when your brain needs to run code on your own infrastructure.
For a custom LLM that doesn’t require Twilio at all (browser-only), see Custom LLM, which uses an OpenAI-compatible HTTP endpoint instead of the Speech Engine SDK.
Prerequisites
- A Twilio account and a voice-capable phone number.
- A Speech Engine resource. Follow the Speech Engine quickstart to create one and learn the brain-server pattern.
- A public HTTPS tunnel (e.g. ngrok). Twilio dials your bridge over the public internet.
- Python 3.9+ or Node.js 18+.
Configure the agent for μ-law audio
Twilio Media Streams uses 8 kHz μ-law audio. Configure the Speech Engine to accept and emit the same format so the bridge does not need to transcode.
eleven_flash_v2 keeps text-to-speech latency low, which matters on a phone call. The request_headers block tells ElevenLabs to include x-api-key: <shared-secret> on every brain WebSocket connection — the brain server checks the header to ensure only your Speech Engine can reach it.
Build the bridge server
The bridge serves three routes:
POST /incoming-call— Twilio webhook. Returns TwiML telling Twilio to open a Media Stream to/media-stream.GET /media-stream— Twilio Media Streams WebSocket. Relays audio to and from the Speech Engine conversation WebSocket.GET /ws— Brain WebSocket. ElevenLabs connects here when a conversation starts. Runs the standardengine.serve()/engine.attach()server.
Mint a signed URL for the Speech Engine
The bridge requests a signed URL each time a new call arrives. The URL embeds the Speech Engine ID and a one-time signature, so the bridge never needs the raw API key.
Serve the TwiML response
When a call arrives, Twilio POSTs to /incoming-call. The response is TwiML that opens a Media Stream to the bridge’s own /media-stream WebSocket.
RequestValidator (Python) and twilio.webhook({ validate: true }) (Node) check the X-Twilio-Signature header against TWILIO_AUTH_TOKEN. Without validation, anyone on the public internet could POST to /incoming-call and bill calls to your account.
Bridge the Media Stream
The Media Stream is a WebSocket that sends a sequence of JSON events: connected, start, media (the audio payload), and stop. The bridge opens a Speech Engine conversation WebSocket on start and relays audio in both directions until the stream closes.
The interruption event from Speech Engine triggers a clear event on the Twilio stream, which discards any buffered audio so barge-in works cleanly. The ping event is answered with pong to keep the conversation WebSocket alive.
Run the brain server alongside
The brain server is the standard Speech Engine server shown in the quickstart. The only addition is the shared-secret check on the WebSocket upgrade — accept the connection only if x-api-key matches the value you set on the Speech Engine.
See the Speech Engine quickstart for the full on_transcript implementation, including an LLM call and streamed response.
Point Twilio at the bridge
Update the Speech Engine ws_url
Set speech_engine.ws_url to the public WebSocket URL of your brain endpoint so ElevenLabs knows where to connect.
Configure the Twilio number
In the Twilio console, open your phone number’s Voice Configuration:
- A call comes in: Webhook
- URL:
https://abc123.ngrok.io/incoming-call - HTTP method: POST
If the number is attached to an Elastic SIP Trunk, detach it first — a Twilio number routes either to a trunk or to a webhook, not both.
Production considerations
- Webhook validation: always validate the
X-Twilio-Signatureon/incoming-call. The example above uses Twilio’s helper library; do not skip this step. - Shared secret: enforce the shared secret on the brain WebSocket. Without it, anyone who guesses your ngrok URL can connect and impersonate ElevenLabs.
- Stable host: ngrok free tier URLs change on every restart. Use a reserved ngrok domain or a real hostname so you do not need to update the Speech Engine
ws_urland the Twilio webhook after every restart. - Latency: each call adds two network hops on top of the LLM time-to-first-token. Use a low-latency model and stream responses to keep perceived latency low.
- One process or two: the example colocates the bridge and the brain on the same port so a single ngrok tunnel covers everything. In production, you can split them across two services as long as each has a public URL.
- Prompt injection: spoken input from a phone call is untrusted user input. Validate transcripts before they influence tool calls or database writes.
Next steps
Use the hosted LLM instead of a custom one.
Build the brain server end-to-end with a streaming LLM.
An alternative custom-LLM mechanism using an OpenAI-compatible HTTP endpoint.
Classes, methods, and events for the Speech Engine Python SDK.
Classes, methods, and events for the Speech Engine JavaScript SDK.