LiveKit integration

Bridge a LiveKit room into Speech Engine using a LiveKit Agents worker.

This guide shows how to use ElevenLabs Speech Engine as the voice layer for a LiveKit room. A LiveKit Agents worker joins the room as a participant, subscribes to the user’s audio track, opens a WebSocket to Speech Engine, and publishes Speech Engine’s synthesized audio back to the room as its own track.

Architecture

Speech Engine accepts two kinds of WebSocket connections:

  • The brain WebSocket that the ElevenLabs API connects to. Your server runs this with the Speech Engine SDK (engine.serve() / engine.attach()) and receives transcripts to respond to.
  • The conversation WebSocket that clients connect to. Browsers connect via a WebRTC token; non-browser clients (like a LiveKit Agents worker) connect via a signed URL and stream raw PCM audio in both directions.

The LiveKit worker uses the second connection. It acts as a “client” of Speech Engine on behalf of the participants in the LiveKit room.

The brain server is unchanged from the Speech Engine quickstart — the LiveKit worker replaces the browser as the audio source but the LLM logic stays the same.

When to use this pattern

Reach for the LiveKit bridge when the room itself is part of the experience:

  • Multi-participant sessions where users speak with the agent alongside each other
  • Existing LiveKit deployments where switching transports would break clients
  • Voice agents sharing a room with screen share, video, or text chat
  • SIP-to-LiveKit dispatched calls that need an AI agent on the line

If you only need a browser-to-Speech-Engine voice loop with no other participants, the WebRTC client in the Speech Engine quickstart is simpler — Speech Engine speaks WebRTC directly to the browser, no LiveKit room required.

Prerequisites

  • A LiveKit project (either LiveKit Cloud or a self-hosted server). The worker needs LIVEKIT_URL, LIVEKIT_API_KEY, and LIVEKIT_API_SECRET.
  • An ElevenLabs Speech Engine. Follow the Speech Engine quickstart to create one and run the brain server.
  • Python 3.9+ or Node.js 18+.

The Node bridge worker uses @livekit/rtc-node, which is currently in Developer Preview. For production deployments, prefer the Python worker.

Configure Speech Engine audio formats

LiveKit’s AudioStream resamples incoming Opus tracks to whatever PCM sample rate you request, so you can match Speech Engine’s input directly. Update the Speech Engine to accept 16 kHz PCM for ASR input and emit 24 kHz PCM for TTS output.

1import asyncio
2import os
3from elevenlabs import AsyncElevenLabs
4
5elevenlabs = AsyncElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
6
7
8async def update_engine():
9 await elevenlabs.speech_engine.update(
10 speech_engine_id="seng_8k3m9xr4hjnfg983brhmhkd98n6",
11 asr={"user_input_audio_format": "pcm_16000"},
12 tts={"agent_output_audio_format": "pcm_24000"},
13 )
14
15
16asyncio.run(update_engine())

Speech Engine PCM is signed 16-bit little-endian throughout. See the audio format reference for other supported rates.

Build the bridge worker

The worker is a long-running process that connects to your LiveKit server, waits for jobs, joins assigned rooms, and bridges audio between the room and Speech Engine.

1

Install dependencies

$pip install "livekit-agents" "livekit-api" "elevenlabs" "aiohttp" "python-dotenv"
2

Mint a Speech Engine signed URL

The worker requests a short-lived signed URL for the Speech Engine conversation WebSocket. The signed URL embeds the engine ID and a one-time signature, so the worker can open the WebSocket without exposing your API key.

1from elevenlabs import AsyncElevenLabs
2
3elevenlabs = AsyncElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
4
5async def signed_url() -> str:
6 response = await elevenlabs.conversational_ai.conversations.get_signed_url(
7 agent_id=os.environ["SPEECH_ENGINE_ID"],
8 )
9 return response.signed_url
3

Define the worker entrypoint

Each time the worker is dispatched to a room, its entrypoint runs. The entrypoint connects to the room, opens a Speech Engine conversation WebSocket, and starts two audio bridges: one for caller audio going to Speech Engine, and one for synthesized audio coming back.

1import asyncio
2import base64
3import json
4import os
5
6import aiohttp
7from dotenv import load_dotenv
8from elevenlabs import AsyncElevenLabs
9from livekit import agents, rtc
10from livekit.agents import JobContext, WorkerOptions, cli
11
12load_dotenv()
13
14elevenlabs = AsyncElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
15SPEECH_ENGINE_ID = os.environ["SPEECH_ENGINE_ID"]
16
17USER_INPUT_RATE = 16000
18AGENT_OUTPUT_RATE = 24000
19
20
21async def signed_url() -> str:
22 response = await elevenlabs.conversational_ai.conversations.get_signed_url(
23 agent_id=SPEECH_ENGINE_ID,
24 )
25 return response.signed_url
26
27
28async def entrypoint(ctx: JobContext):
29 el_ws_ready: asyncio.Future[aiohttp.ClientWebSocketResponse] = (
30 asyncio.get_running_loop().create_future()
31 )
32
33 async def pump_user_audio(track: rtc.Track):
34 el_ws = await el_ws_ready
35 stream = rtc.AudioStream(
36 track, sample_rate=USER_INPUT_RATE, num_channels=1,
37 )
38 async for event in stream:
39 payload = base64.b64encode(bytes(event.frame.data)).decode()
40 await el_ws.send_str(json.dumps({"user_audio_chunk": payload}))
41
42 # Register the subscriber BEFORE ctx.connect() so we don't miss tracks
43 # that get auto-subscribed during the connection handshake.
44 @ctx.room.on("track_subscribed")
45 def on_track_subscribed(track, publication, participant):
46 if track.kind != rtc.TrackKind.KIND_AUDIO:
47 return
48 if participant.identity == ctx.room.local_participant.identity:
49 return
50 asyncio.create_task(pump_user_audio(track))
51
52 await ctx.connect()
53
54 # Publish a track for the agent's synthesized audio.
55 source = rtc.AudioSource(sample_rate=AGENT_OUTPUT_RATE, num_channels=1)
56 track = rtc.LocalAudioTrack.create_audio_track("elevenlabs-agent", source)
57 await ctx.room.local_participant.publish_track(
58 track,
59 rtc.TrackPublishOptions(source=rtc.TrackSource.SOURCE_MICROPHONE),
60 )
61
62 # Open the Speech Engine conversation WebSocket.
63 http = aiohttp.ClientSession()
64 el_ws = await http.ws_connect(await signed_url())
65 await el_ws.send_str(json.dumps({"type": "conversation_initiation_client_data"}))
66 el_ws_ready.set_result(el_ws)
67
68 async def el_to_room():
69 async for msg in el_ws:
70 if msg.type != aiohttp.WSMsgType.TEXT:
71 continue
72 event = json.loads(msg.data)
73 etype = event.get("type")
74 if etype == "audio":
75 pcm = base64.b64decode(event["audio_event"]["audio_base_64"])
76 samples_per_channel = len(pcm) // 2
77 frame = rtc.AudioFrame(
78 pcm, AGENT_OUTPUT_RATE, 1, samples_per_channel,
79 )
80 await source.capture_frame(frame)
81 elif etype == "interruption":
82 source.clear_queue()
83 elif etype == "ping":
84 event_id = event.get("ping_event", {}).get("event_id")
85 await el_ws.send_str(json.dumps({
86 "type": "pong", "event_id": event_id,
87 }))
88
89 pump_task = asyncio.create_task(el_to_room())
90
91 async def cleanup():
92 pump_task.cancel()
93 await el_ws.close()
94 await http.close()
95
96 ctx.add_shutdown_callback(cleanup)
97
98
99if __name__ == "__main__":
100 cli.run_app(WorkerOptions(
101 entrypoint_fnc=entrypoint,
102 agent_name="elevenlabs-bridge",
103 ))

The worker filters out its own published audio in the track_subscribed handler by comparing against the local participant’s identity. Without this check, the worker would try to send its own synthesized audio back to Speech Engine.

Two ordering details matter for correctness:

  • Listener timing: TrackSubscribed is registered before ctx.connect(). LiveKit auto-subscribes to existing tracks during the connection handshake, and a listener registered afterwards may miss the event. The audio pump waits on a Future / Promise for the Speech Engine WebSocket so it can subscribe immediately and forward audio as soon as the connection is open.
  • TypeScript only — capture serialization: @livekit/rtc-node’s AudioSource.captureFrame throws InvalidState if called concurrently. The TypeScript handler serializes captures with a promise chain. Python’s single async for el_to_room loop is naturally sequential and does not need this.
4

Start the worker

$python bridge.py dev

dev enables hot reload and colored logs. Use start in production for JSON logs and graceful shutdown.

The worker connects to your LiveKit server and waits for job assignments. It does not join any rooms until it is dispatched.

Dispatch the worker to a room

Because the worker has an agent_name, it uses explicit dispatch — it only joins rooms when your backend tells it to. The simplest pattern is to include a RoomAgentDispatch in the LiveKit access token that the browser uses to connect.

1import os
2
3from dotenv import load_dotenv
4from flask import Flask, jsonify, request
5from livekit.api import AccessToken, RoomAgentDispatch, VideoGrants
6
7load_dotenv()
8
9app = Flask(**name**)
10
11@app.route("/api/livekit-token")
12def get_token():
13room_name = request.args.get("room", "demo-room")
14identity = request.args.get("identity", "web-user")
15
16 token = (
17 AccessToken(
18 os.environ["LIVEKIT_API_KEY"],
19 os.environ["LIVEKIT_API_SECRET"],
20 )
21 .with_identity(identity)
22 .with_grants(VideoGrants(room_join=True, room=room_name))
23 .with_room_config(
24 room_configuration={
25 "agents": [RoomAgentDispatch(agent_name="elevenlabs-bridge")],
26 },
27 )
28 )
29
30 return jsonify(token=token.to_jwt(), url=os.environ["LIVEKIT_URL"])
31
32if **name** == "**main**":
33app.run(port=3002)

When a browser uses this token to create or join a room, LiveKit dispatches the bridge worker into the same room automatically.

Connect from the browser

The browser only needs the standard LiveKit client — it does not interact with Speech Engine directly.

App.tsx
1import { Room, RoomEvent, Track } from "livekit-client";
2import { useCallback, useState } from "react";
3
4export default function App() {
5 const [room] = useState(() => new Room());
6
7 const join = useCallback(async () => {
8 const response = await fetch("/api/livekit-token");
9 const { token, url } = await response.json();
10
11 room.on(RoomEvent.TrackSubscribed, (track) => {
12 if (track.kind === Track.Kind.Audio) {
13 document.body.appendChild(track.attach());
14 }
15 });
16
17 await room.connect(url, token);
18 await room.localParticipant.setMicrophoneEnabled(true);
19 }, [room]);
20
21 return <button onClick={join}>Start conversation</button>;
22}

When the button is clicked, the browser fetches a LiveKit token, joins the room with the microphone enabled, and starts receiving the agent’s audio track. The worker is dispatched, opens its Speech Engine session, and bridges audio in both directions.

Audio format reference

Speech Engine supports the following audio formats. Configure them on the engine via asr.user_input_audio_format and tts.agent_output_audio_format.

FormatSample rateEncodingNotes
pcm_80008 kHzSigned 16-bit LE PCMASR input only.
pcm_1600016 kHzSigned 16-bit LE PCMRecommended for LiveKit user input.
pcm_2205022.05 kHzSigned 16-bit LE PCM
pcm_2400024 kHzSigned 16-bit LE PCMRecommended for LiveKit agent output.
pcm_4410044.1 kHzSigned 16-bit LE PCMTTS output requires Independent Publisher tier or above.
pcm_4800048 kHzSigned 16-bit LE PCMASR input only.
ulaw_80008 kHzμ-lawUsed by Twilio Media Streams.

AudioStream and AudioSource in LiveKit handle resampling for you — you can request any sample rate from AudioStream and the SDK converts from the underlying 48 kHz Opus track.

Production considerations

  • Explicit dispatch: Always set agent_name / agentName on WorkerOptions. Auto-dispatch fires the worker for every room created on your LiveKit project, which is rarely what you want.
  • Brain server authentication: Set a shared secret on the Speech Engine and verify it in your brain server, so only the Speech Engine can reach your endpoint:
    1await elevenlabs.speech_engine.update(
    2 speech_engine_id="seng_8k3m9xr4hjnfg983brhmhkd98n6",
    3 speech_engine={"request_headers": {"x-api-key": os.environ["SHARED_SECRET"]}},
    4)
    The brain server then checks request.headers["x-api-key"] before accepting the WebSocket upgrade.
  • Token server: Mint LiveKit and Speech Engine tokens server-side. Never expose LIVEKIT_API_SECRET or ELEVENLABS_API_KEY to the browser.
  • Event loop hygiene: Keep CPU-bound work off the worker’s event loop. AudioSource.capture_frame and AudioStream iteration are time-sensitive; long synchronous calls will delay or drop interruption events. Use asyncio.to_thread() (Python) or worker_threads (Node) for blocking work.
  • Shutdown: Register ctx.add_shutdown_callback / ctx.addShutdownCallback to close the ElevenLabs WebSocket cleanly. By default, the room (and the job) is terminated when the last non-agent participant leaves.

Next steps