Audio Models Overview

Convert speech to text, generate natural-sounding speech from text, and process audio content with advanced AI speech models.

Available Models

Text-to-Audio Models

OpenAI TTS

  • TTS-1: Fast, efficient text-to-speech
  • TTS-1-HD: High-quality voice synthesis

ElevenLabs

  • Eleven Turbo: Ultra-fast voice generation
  • Eleven Multilingual: Support for 29 languages
  • Eleven Monolingual: Premium English voices

Google Cloud

  • WaveNet: Neural voice synthesis
  • Standard: Traditional concatenative TTS

Audio-to-Text Models

OpenAI Whisper

  • Whisper Large: Highest accuracy for transcription
  • Whisper Base: Balanced speed and accuracy
  • Whisper Tiny: Fast processing for real-time use

Google Speech

  • Chirp: Latest generation speech recognition
  • Standard: Reliable speech-to-text conversion

AssemblyAI

  • Universal-1: Advanced speech understanding
  • Best: Highest accuracy model

Model Capabilities

Text-to-Speech

Generate natural-sounding speech from text

Speech-to-Text

Transcribe audio files and real-time speech

Voice Cloning

Create custom voices from audio samples

Translation

Translate speech across multiple languages

Text-to-Audio API

Convert text to natural-sounding speech:
POST /v1/audio/speech

Basic Example

curl -X POST "https://api.anyapi.ai/v1/audio/speech" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is a test of text-to-speech conversion.",
    "voice": "alloy",
    "response_format": "mp3"
  }' \
  --output speech.mp3

Available Voices

OpenAI TTS Voices

  • alloy: Neutral, balanced tone
  • echo: Warm, engaging voice
  • fable: Expressive, storytelling voice
  • onyx: Deep, authoritative tone
  • nova: Bright, energetic voice
  • shimmer: Soft, gentle tone

ElevenLabs Voices

  • Rachel: Professional female voice
  • Drew: Conversational male voice
  • Clyde: Middle-aged male voice
  • Paul: Mature, authoritative male
  • Domi: Confident female voice
  • Dave: British male accent

Audio-to-Text API

Transcribe audio files to text:
POST /v1/audio/transcriptions

Basic Example

curl -X POST "https://api.anyapi.ai/v1/audio/transcriptions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F file="@audio.mp3" \
  -F model="whisper-1" \
  -F language="en"

Response Format

{
  "text": "Hello, this is a transcription of the audio file.",
  "language": "en",
  "duration": 3.2,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.2,
      "text": "Hello, this is a transcription of the audio file.",
      "confidence": 0.98
    }
  ]
}

Translation API

Translate speech from one language to another:
POST /v1/audio/translations
import requests

with open("spanish_audio.mp3", "rb") as audio_file:
    response = requests.post(
        "https://api.anyapi.ai/v1/audio/translations",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        files={"file": audio_file},
        data={
            "model": "whisper-1",
            "source_language": "es",
            "target_language": "en"
        }
    )

print(response.json())

Advanced Features

SSML Support

Use Speech Synthesis Markup Language for fine control:
{
  "model": "tts-1-hd",
  "input": "<speak><prosody rate='slow'>Hello there!</prosody> <break time='1s'/> How are you today?</speak>",
  "voice": "alloy",
  "response_format": "wav"
}

Voice Cloning

Create custom voices from audio samples:
# Upload voice sample
files = {'voice_sample': open('voice_sample.wav', 'rb')}

response = requests.post(
    "https://api.anyapi.ai/v1/voices/clone",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    files=files,
    data={
        "name": "My Custom Voice",
        "description": "Personal voice clone"
    }
)

voice_id = response.json()['voice_id']

# Use custom voice
requests.post(
    "https://api.anyapi.ai/v1/audio/speech",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "elevenlabs-turbo",
        "input": "This is my cloned voice speaking!",
        "voice": voice_id
    }
)

Real-time Streaming

Stream audio for real-time applications:
const ws = new WebSocket('wss://api.anyapi.ai/v1/audio/stream');

ws.onopen = () => {
  ws.send(JSON.stringify({
    model: 'whisper-base',
    language: 'en',
    format: 'pcm16'
  }));
};

// Send audio chunks
navigator.mediaDevices.getUserMedia({audio: true})
  .then(stream => {
    const mediaRecorder = new MediaRecorder(stream);
    mediaRecorder.ondataavailable = (event) => {
      ws.send(event.data);
    };
    mediaRecorder.start(100); // Send chunks every 100ms
  });

ws.onmessage = (event) => {
  const result = JSON.parse(event.data);
  console.log('Transcription:', result.text);
};

Model Comparison

Text-to-Speech Models

ModelQualitySpeedLanguagesPrice/Character
TTS-1GoodFast1$0.000015
TTS-1-HDExcellentMedium1$0.000030
ElevenLabs TurboExcellentVery Fast29$0.000030
ElevenLabs MultilingualPremiumMedium29$0.000100

Speech-to-Text Models

ModelAccuracySpeedLanguagesPrice/Minute
Whisper Large98%Medium100+$0.006
Whisper Base95%Fast100+$0.004
Google Chirp97%Fast125+$0.008
AssemblyAI Universal96%Medium100+$0.005

Supported Languages

Text-to-Speech

  • English: All models
  • Spanish: ElevenLabs, Google
  • French: ElevenLabs, Google
  • German: ElevenLabs, Google
  • Italian: ElevenLabs, Google
  • Portuguese: ElevenLabs, Google
  • Polish: ElevenLabs, Google
  • Turkish: ElevenLabs, Google
  • Russian: ElevenLabs, Google
  • Dutch: ElevenLabs, Google
  • Japanese: ElevenLabs, Google
  • Chinese: ElevenLabs, Google
  • Korean: ElevenLabs, Google
  • Hindi: ElevenLabs, Google

Speech-to-Text

Whisper supports 100+ languages including:
  • English, Spanish, French, German, Italian
  • Portuguese, Russian, Japanese, Korean, Chinese
  • Arabic, Hindi, Turkish, Polish, Dutch
  • And many more…

Audio Formats

Supported Input Formats

  • WAV: Uncompressed audio
  • MP3: Compressed audio
  • FLAC: Lossless compression
  • M4A: Apple audio format
  • OGG: Open source format
  • WEBM: Web audio format

Output Formats

  • MP3: Most compatible
  • WAV: Uncompressed quality
  • FLAC: Lossless compression
  • OGG: Open source format

Quality Settings

  • Sample Rate: 8kHz, 16kHz, 22kHz, 44.1kHz, 48kHz
  • Bit Depth: 16-bit, 24-bit
  • Channels: Mono, Stereo

Rate Limits

Speech model limits by plan:
PlanTTS Characters/MinSTT Minutes/HourCustom Voices
Free10,000600
Pro100,0006005
EnterpriseCustomCustomUnlimited

Common Use Cases

Content Creation

Podcasts, audiobooks, video narration

Accessibility

Screen readers, audio descriptions

Customer Service

IVR systems, chatbot voices

Education

Language learning, audio lessons

Transcription Services

Meeting notes, interview transcripts

Voice Assistants

Smart devices, mobile apps

Media Processing

Subtitle generation, content analysis

Translation

Real-time interpretation, multilingual content

Best Practices

Text-to-Speech Optimization

  • Use punctuation for natural pauses
  • Spell out numbers and abbreviations
  • Add SSML tags for emphasis
  • Choose appropriate voice for content tone

Speech-to-Text Optimization

  • Use high-quality audio (16kHz+)
  • Minimize background noise
  • Specify language for better accuracy
  • Use appropriate model for use case

Performance Tips

  • Batch process multiple files
  • Use streaming for real-time applications
  • Cache frequently used audio
  • Implement proper error handling

Getting Started