Audio Models Overview

Convert speech to text, generate natural-sounding speech from text, and process audio content with advanced AI speech models.

Available Models

Text-to-Audio Models

OpenAI TTS

TTS-1: Fast, efficient text-to-speech
TTS-1-HD: High-quality voice synthesis

ElevenLabs

Eleven Turbo: Ultra-fast voice generation
Eleven Multilingual: Support for 29 languages
Eleven Monolingual: Premium English voices

Google Cloud

WaveNet: Neural voice synthesis
Standard: Traditional concatenative TTS

Audio-to-Text Models

OpenAI Whisper

Whisper Large: Highest accuracy for transcription
Whisper Base: Balanced speed and accuracy
Whisper Tiny: Fast processing for real-time use

Google Speech

Chirp: Latest generation speech recognition
Standard: Reliable speech-to-text conversion

AssemblyAI

Universal-1: Advanced speech understanding
Best: Highest accuracy model

Model Capabilities

Text-to-Speech

Generate natural-sounding speech from text

Speech-to-Text

Transcribe audio files and real-time speech

Voice Cloning

Create custom voices from audio samples

Translation

Translate speech across multiple languages

Text-to-Audio API

Convert text to natural-sounding speech:

POST /v1/audio/speech

Basic Example

curl -X POST "https://api.anyapi.ai/v1/audio/speech" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is a test of text-to-speech conversion.",
    "voice": "alloy",
    "response_format": "mp3"
  }' \
  --output speech.mp3

Available Voices

OpenAI TTS Voices

alloy: Neutral, balanced tone
echo: Warm, engaging voice
fable: Expressive, storytelling voice
onyx: Deep, authoritative tone
nova: Bright, energetic voice
shimmer: Soft, gentle tone

ElevenLabs Voices

Rachel: Professional female voice
Drew: Conversational male voice
Clyde: Middle-aged male voice
Paul: Mature, authoritative male
Domi: Confident female voice
Dave: British male accent

Audio-to-Text API

Transcribe audio files to text:

POST /v1/audio/transcriptions

Basic Example

curl -X POST "https://api.anyapi.ai/v1/audio/transcriptions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F file="@audio.mp3" \
  -F model="whisper-1" \
  -F language="en"

Response Format

{
  "text": "Hello, this is a transcription of the audio file.",
  "language": "en",
  "duration": 3.2,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.2,
      "text": "Hello, this is a transcription of the audio file.",
      "confidence": 0.98
    }
  ]
}

Translation API

Translate speech from one language to another:

POST /v1/audio/translations

import requests

with open("spanish_audio.mp3", "rb") as audio_file:
    response = requests.post(
        "https://api.anyapi.ai/v1/audio/translations",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        files={"file": audio_file},
        data={
            "model": "whisper-1",
            "source_language": "es",
            "target_language": "en"
        }
    )

print(response.json())

Advanced Features

SSML Support

Use Speech Synthesis Markup Language for fine control:

{
  "model": "tts-1-hd",
  "input": "<speak><prosody rate='slow'>Hello there!</prosody> <break time='1s'/> How are you today?</speak>",
  "voice": "alloy",
  "response_format": "wav"
}

Voice Cloning

Create custom voices from audio samples:

# Upload voice sample
files = {'voice_sample': open('voice_sample.wav', 'rb')}

response = requests.post(
    "https://api.anyapi.ai/v1/voices/clone",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    files=files,
    data={
        "name": "My Custom Voice",
        "description": "Personal voice clone"
    }
)

voice_id = response.json()['voice_id']

# Use custom voice
requests.post(
    "https://api.anyapi.ai/v1/audio/speech",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "elevenlabs-turbo",
        "input": "This is my cloned voice speaking!",
        "voice": voice_id
    }
)

Real-time Streaming

Stream audio for real-time applications:

const ws = new WebSocket('wss://api.anyapi.ai/v1/audio/stream');

ws.onopen = () => {
  ws.send(JSON.stringify({
    model: 'whisper-base',
    language: 'en',
    format: 'pcm16'
  }));
};

// Send audio chunks
navigator.mediaDevices.getUserMedia({audio: true})
  .then(stream => {
    const mediaRecorder = new MediaRecorder(stream);
    mediaRecorder.ondataavailable = (event) => {
      ws.send(event.data);
    };
    mediaRecorder.start(100); // Send chunks every 100ms
  });

ws.onmessage = (event) => {
  const result = JSON.parse(event.data);
  console.log('Transcription:', result.text);
};

Model Comparison

Text-to-Speech Models

Model	Quality	Speed	Languages	Price/Character
TTS-1	Good	Fast	1	$0.000015
TTS-1-HD	Excellent	Medium	1	$0.000030
ElevenLabs Turbo	Excellent	Very Fast	29	$0.000030
ElevenLabs Multilingual	Premium	Medium	29	$0.000100

Speech-to-Text Models

Model	Accuracy	Speed	Languages	Price/Minute
Whisper Large	98%	Medium	100+	$0.006
Whisper Base	95%	Fast	100+	$0.004
Google Chirp	97%	Fast	125+	$0.008
AssemblyAI Universal	96%	Medium	100+	$0.005

Supported Languages

Text-to-Speech

English: All models
Spanish: ElevenLabs, Google
French: ElevenLabs, Google
German: ElevenLabs, Google
Italian: ElevenLabs, Google
Portuguese: ElevenLabs, Google
Polish: ElevenLabs, Google
Turkish: ElevenLabs, Google
Russian: ElevenLabs, Google
Dutch: ElevenLabs, Google
Japanese: ElevenLabs, Google
Chinese: ElevenLabs, Google
Korean: ElevenLabs, Google
Hindi: ElevenLabs, Google

Speech-to-Text

Whisper supports 100+ languages including:

English, Spanish, French, German, Italian
Portuguese, Russian, Japanese, Korean, Chinese
Arabic, Hindi, Turkish, Polish, Dutch
And many more…

Audio Formats

Supported Input Formats

WAV: Uncompressed audio
MP3: Compressed audio
FLAC: Lossless compression
M4A: Apple audio format
OGG: Open source format
WEBM: Web audio format

Output Formats

MP3: Most compatible
WAV: Uncompressed quality
FLAC: Lossless compression
OGG: Open source format

Quality Settings

Sample Rate: 8kHz, 16kHz, 22kHz, 44.1kHz, 48kHz
Bit Depth: 16-bit, 24-bit
Channels: Mono, Stereo

Rate Limits

Speech model limits by plan:

Plan	TTS Characters/Min	STT Minutes/Hour	Custom Voices
Free	10,000	60	0
Pro	100,000	600	5
Enterprise	Custom	Custom	Unlimited

Common Use Cases

Content Creation

Podcasts, audiobooks, video narration

Accessibility

Screen readers, audio descriptions

Customer Service

IVR systems, chatbot voices

Education

Language learning, audio lessons

Transcription Services

Meeting notes, interview transcripts

Voice Assistants

Smart devices, mobile apps

Media Processing

Subtitle generation, content analysis

Translation

Real-time interpretation, multilingual content

Best Practices

Text-to-Speech Optimization

Use punctuation for natural pauses
Spell out numbers and abbreviations
Add SSML tags for emphasis
Choose appropriate voice for content tone

Speech-to-Text Optimization

Use high-quality audio (16kHz+)
Minimize background noise
Specify language for better accuracy
Use appropriate model for use case

Performance Tips

Batch process multiple files
Use streaming for real-time applications
Cache frequently used audio
Implement proper error handling

Getting Started

Quick Start

Generate your first speech

Use Cases

See practical examples

SDKs

Use our libraries

API Reference

Explore all endpoints

Get started

Features

Use Cases

Developer guides

API Reference

Integrations

​Audio Models Overview

​Available Models

​Text-to-Audio Models

​OpenAI TTS

​ElevenLabs

​Google Cloud

​Audio-to-Text Models

​OpenAI Whisper

​Google Speech

​AssemblyAI

​Model Capabilities

Text-to-Speech

Speech-to-Text

Voice Cloning

Translation

​Text-to-Audio API

​Basic Example

​Available Voices

​OpenAI TTS Voices

​ElevenLabs Voices

​Audio-to-Text API

​Basic Example

​Response Format

​Translation API

​Advanced Features

​SSML Support

​Voice Cloning

​Real-time Streaming

​Model Comparison

​Text-to-Speech Models

​Speech-to-Text Models

​Supported Languages

​Text-to-Speech

​Speech-to-Text

​Audio Formats

​Supported Input Formats

​Output Formats

​Quality Settings

​Rate Limits

​Common Use Cases

Content Creation

Accessibility

Customer Service

Education

Transcription Services

Voice Assistants

Media Processing

Translation

​Best Practices

​Text-to-Speech Optimization

​Speech-to-Text Optimization

​Performance Tips

​Getting Started

Quick Start

Use Cases

SDKs

API Reference

Audio Models Overview

Available Models

Text-to-Audio Models

OpenAI TTS

ElevenLabs

Google Cloud

Audio-to-Text Models

OpenAI Whisper

Google Speech

AssemblyAI

Model Capabilities

Text-to-Audio API

Basic Example

Available Voices

OpenAI TTS Voices

ElevenLabs Voices

Audio-to-Text API

Basic Example

Response Format

Translation API

Advanced Features

SSML Support

Voice Cloning

Real-time Streaming

Model Comparison

Text-to-Speech Models

Speech-to-Text Models

Supported Languages

Text-to-Speech

Speech-to-Text

Audio Formats

Supported Input Formats

Output Formats

Quality Settings

Rate Limits

Common Use Cases

Best Practices

Text-to-Speech Optimization

Speech-to-Text Optimization

Performance Tips

Getting Started