LiveKit integration
This guide shows how to use ElevenLabs Speech Engine as the voice layer for a LiveKit room. A LiveKit Agents worker joins the room as a participant, subscribes to the user’s audio track, opens a WebSocket to Speech Engine, and publishes Speech Engine’s synthesized audio back to the room as its own track.
Architecture
Speech Engine accepts two kinds of WebSocket connections:
- The brain WebSocket that the ElevenLabs API connects to. Your server runs this with the Speech Engine SDK (
engine.serve()/engine.attach()) and receives transcripts to respond to. - The conversation WebSocket that clients connect to. Browsers connect via a WebRTC token; non-browser clients (like a LiveKit Agents worker) connect via a signed URL and stream raw PCM audio in both directions.
The LiveKit worker uses the second connection. It acts as a “client” of Speech Engine on behalf of the participants in the LiveKit room.
The brain server is unchanged from the Speech Engine quickstart — the LiveKit worker replaces the browser as the audio source but the LLM logic stays the same.
When to use this pattern
Reach for the LiveKit bridge when the room itself is part of the experience:
- Multi-participant sessions where users speak with the agent alongside each other
- Existing LiveKit deployments where switching transports would break clients
- Voice agents sharing a room with screen share, video, or text chat
- SIP-to-LiveKit dispatched calls that need an AI agent on the line
If you only need a browser-to-Speech-Engine voice loop with no other participants, the WebRTC client in the Speech Engine quickstart is simpler — Speech Engine speaks WebRTC directly to the browser, no LiveKit room required.
Prerequisites
- A LiveKit project (either LiveKit Cloud or a self-hosted server). The worker needs
LIVEKIT_URL,LIVEKIT_API_KEY, andLIVEKIT_API_SECRET. - An ElevenLabs Speech Engine. Follow the Speech Engine quickstart to create one and run the brain server.
- Python 3.9+ or Node.js 18+.
The Node bridge worker uses
@livekit/rtc-node, which is currently in
Developer Preview. For production deployments, prefer the Python worker.
Configure Speech Engine audio formats
LiveKit’s AudioStream resamples incoming Opus tracks to whatever PCM sample rate you request, so you can match Speech Engine’s input directly. Update the Speech Engine to accept 16 kHz PCM for ASR input and emit 24 kHz PCM for TTS output.
Speech Engine PCM is signed 16-bit little-endian throughout. See the audio format reference for other supported rates.
Build the bridge worker
The worker is a long-running process that connects to your LiveKit server, waits for jobs, joins assigned rooms, and bridges audio between the room and Speech Engine.
Mint a Speech Engine signed URL
The worker requests a short-lived signed URL for the Speech Engine conversation WebSocket. The signed URL embeds the engine ID and a one-time signature, so the worker can open the WebSocket without exposing your API key.
Define the worker entrypoint
Each time the worker is dispatched to a room, its entrypoint runs. The entrypoint connects to the room, opens a Speech Engine conversation WebSocket, and starts two audio bridges: one for caller audio going to Speech Engine, and one for synthesized audio coming back.
The worker filters out its own published audio in the track_subscribed handler by comparing against the local participant’s identity. Without this check, the worker would try to send its own synthesized audio back to Speech Engine.
Two ordering details matter for correctness:
- Listener timing:
TrackSubscribedis registered beforectx.connect(). LiveKit auto-subscribes to existing tracks during the connection handshake, and a listener registered afterwards may miss the event. The audio pump waits on aFuture/Promisefor the Speech Engine WebSocket so it can subscribe immediately and forward audio as soon as the connection is open. - TypeScript only — capture serialization:
@livekit/rtc-node’sAudioSource.captureFramethrowsInvalidStateif called concurrently. The TypeScript handler serializes captures with a promise chain. Python’s singleasync for el_to_roomloop is naturally sequential and does not need this.
Dispatch the worker to a room
Because the worker has an agent_name, it uses explicit dispatch — it only joins rooms when your backend tells it to. The simplest pattern is to include a RoomAgentDispatch in the LiveKit access token that the browser uses to connect.
When a browser uses this token to create or join a room, LiveKit dispatches the bridge worker into the same room automatically.
Connect from the browser
The browser only needs the standard LiveKit client — it does not interact with Speech Engine directly.
When the button is clicked, the browser fetches a LiveKit token, joins the room with the microphone enabled, and starts receiving the agent’s audio track. The worker is dispatched, opens its Speech Engine session, and bridges audio in both directions.
Audio format reference
Speech Engine supports the following audio formats. Configure them on the engine via asr.user_input_audio_format and tts.agent_output_audio_format.
AudioStream and AudioSource in LiveKit handle resampling for you — you can request any sample rate from AudioStream and the SDK converts from the underlying 48 kHz Opus track.
Production considerations
- Explicit dispatch: Always set
agent_name/agentNameonWorkerOptions. Auto-dispatch fires the worker for every room created on your LiveKit project, which is rarely what you want. - Brain server authentication: Set a shared secret on the Speech Engine and verify it in your brain server, so only the Speech Engine can reach your endpoint:
The brain server then checks
request.headers["x-api-key"]before accepting the WebSocket upgrade. - Token server: Mint LiveKit and Speech Engine tokens server-side. Never expose
LIVEKIT_API_SECRETorELEVENLABS_API_KEYto the browser. - Event loop hygiene: Keep CPU-bound work off the worker’s event loop.
AudioSource.capture_frameandAudioStreamiteration are time-sensitive; long synchronous calls will delay or drop interruption events. Useasyncio.to_thread()(Python) orworker_threads(Node) for blocking work. - Shutdown: Register
ctx.add_shutdown_callback/ctx.addShutdownCallbackto close the ElevenLabs WebSocket cleanly. By default, the room (and the job) is terminated when the last non-agent participant leaves.
Next steps
Build the brain server that responds to transcripts.
Use Pipecat as the LLM pipeline behind Speech Engine.
Classes, methods, and events for the Speech Engine Python SDK.
Classes, methods, and events for the Speech Engine JavaScript SDK.