tech/elevenlabs

ELEVENLABS

ElevenLabs voice AI skill. Use when:

production ElevenLabs API v1, Python SDK, TypeScript/JavaScript SDK, React SDK
improves: tech

ElevenLabs

ElevenLabs is a voice AI platform providing ultra-realistic text-to-speech, voice cloning, speech-to-text, and conversational AI agents. In the 2nth.ai stack it fills the voice layer — adding spoken interaction to Cloudflare Worker-based agents, AWS Connect contact flows, or any web/mobile surface.

Stub — full skill pending. Core patterns documented below.

Capabilities

FeatureAPIModelsUse case
Text-to-SpeechPOST /v1/text-to-speech/:voice_ideleven_v3, flash_v2_5, turbo_v2_5, multilingual_v2Narration, IVR prompts, notifications
Streaming TTSPOST /v1/text-to-speech/:voice_id/streamSameReal-time voice agents, low-latency playback
Speech-to-TextPOST /v1/speech-to-textScribe v1Transcription, meeting notes, call analytics
Voice CloningPOST /v1/voices/addBrand voice, character voices, personalisation
Speech-to-SpeechPOST /v1/speech-to-speech/:voice_idVoice conversion with emotion preservation
Conversational AIAgents API + WebSocketflash_v2_5Phone bots, web chat agents, WhatsApp
DubbingPOST /v1/dubbingLocalise video/audio to other languages
Sound EffectsPOST /v1/sound-generationUI sounds, game audio, video production
MusicPOST /v1/musicBackground music, jingles
Voice LibraryGET /v1/voices10,000+ pre-built voices

Authentication

# All requests use xi-api-key header
export ELEVENLABS_API_KEY="your-api-key-here"

curl -H "xi-api-key: $ELEVENLABS_API_KEY" \
  https://api.elevenlabs.io/v1/voices

Models

ModelLatencyLanguagesBest for
eleven_v3~500ms70+Highest expressiveness, narration, long-form
eleven_flash_v2_5~75ms70+Real-time agents, IVR, live interaction
eleven_turbo_v2_5~250ms70+Interactive use, balanced quality/speed
eleven_multilingual_v2~400ms32+Consistent multilingual quality, up to 10K chars

Text-to-speech (TypeScript)

import ElevenLabs from 'elevenlabs';

const client = new ElevenLabs({ apiKey: process.env.ELEVENLABS_API_KEY });

// Generate and save to file
const audio = await client.textToSpeech.convert('JBFqnCBsd6RMkjVDRZzb', {
  text: 'Hello from ElevenLabs.',
  model_id: 'eleven_flash_v2_5',
  voice_settings: {
    stability: 0.5,        // 0–1: lower = more expressive variation
    similarity_boost: 0.75, // 0–1: how closely to match the cloned voice
    style: 0.0,
    use_speaker_boost: true,
  },
  output_format: 'mp3_44100_128',
});

// audio is a ReadableStream — pipe to file or Response
const fs = await import('fs');
const writer = fs.createWriteStream('output.mp3');
for await (const chunk of audio) writer.write(chunk);
writer.end();

Streaming TTS (Cloudflare Worker)

// Stream directly to the client — no buffering
export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    const { text } = await req.json() as { text: string };

    const upstream = await fetch(
      'https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream',
      {
        method: 'POST',
        headers: {
          'xi-api-key': env.ELEVENLABS_API_KEY,
          'Content-Type': 'application/json',
          'Accept': 'audio/mpeg',
        },
        body: JSON.stringify({
          text,
          model_id: 'eleven_flash_v2_5',
          output_format: 'mp3_44100_128',
        }),
      }
    );

    return new Response(upstream.body, {
      headers: {
        'Content-Type': 'audio/mpeg',
        'Transfer-Encoding': 'chunked',
        'Cache-Control': 'no-store',
      },
    });
  },
};

Speech-to-text (Python)

from elevenlabs import ElevenLabs

client = ElevenLabs(api_key="your-api-key")

with open("audio.mp3", "rb") as f:
    transcript = client.speech_to_text.convert(
        file=f,
        model_id="scribe_v1",
        language_code="en",  # omit for auto-detect
        diarize=True,        # speaker labels
        timestamps_granularity="word",
    )

for utterance in transcript.utterances:
    print(f"[{utterance.speaker}] {utterance.text}")

Voice cloning (instant)

// Instant clone — 30s+ audio sample, results in seconds
const formData = new FormData();
formData.append('name', 'My Brand Voice');
formData.append('description', 'Cloned from our CEO recording');
formData.append('files', new Blob([audioBuffer], { type: 'audio/mp3' }), 'sample.mp3');
formData.append('remove_background_noise', 'true');

const response = await fetch('https://api.elevenlabs.io/v1/voices/add', {
  method: 'POST',
  headers: { 'xi-api-key': process.env.ELEVENLABS_API_KEY! },
  body: formData,
});
const { voice_id } = await response.json();
// Use voice_id in subsequent TTS calls

Conversational AI agent (WebSocket)

import { ElevenLabs } from '@elevenlabs/react'; // React SDK
// or use raw WebSocket for non-React environments

const ws = new WebSocket(
  `wss://api.elevenlabs.io/v1/convai/conversation?agent_id=${AGENT_ID}`
);

ws.onopen = () => {
  // Send audio chunks (PCM 16kHz mono) or text input
  ws.send(JSON.stringify({ type: 'user_audio_chunk', user_audio_chunk: base64Audio }));
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type === 'audio') {
    // Play base64-encoded audio response
    playAudio(msg.audio_event.audio_base_64);
  }
  if (msg.type === 'agent_response') {
    console.log('Agent said:', msg.agent_response_event.agent_response);
  }
};

Pricing (API)

ModelPer 1,000 characters
Flash / Turbo (flash_v2_5, turbo_v2_5)$0.06
Standard / v3 (multilingual_v2, eleven_v3)$0.12
Speech-to-textper audio minute (see dashboard)

Free tier: 10,000 credits/month (~10 min high-quality TTS), no commercial rights.

Gotchas

See also