Audio Models Overview
Convert speech to text, generate natural-sounding speech from text, and process audio content with advanced AI speech models.Available Models
Text-to-Audio Models
OpenAI TTS
- TTS-1: Fast, efficient text-to-speech
- TTS-1-HD: High-quality voice synthesis
ElevenLabs
- Eleven Turbo: Ultra-fast voice generation
- Eleven Multilingual: Support for 29 languages
- Eleven Monolingual: Premium English voices
Google Cloud
- WaveNet: Neural voice synthesis
- Standard: Traditional concatenative TTS
Audio-to-Text Models
OpenAI Whisper
- Whisper Large: Highest accuracy for transcription
- Whisper Base: Balanced speed and accuracy
- Whisper Tiny: Fast processing for real-time use
Google Speech
- Chirp: Latest generation speech recognition
- Standard: Reliable speech-to-text conversion
AssemblyAI
- Universal-1: Advanced speech understanding
- Best: Highest accuracy model
Model Capabilities
Text-to-Speech
Generate natural-sounding speech from text
Speech-to-Text
Transcribe audio files and real-time speech
Voice Cloning
Create custom voices from audio samples
Translation
Translate speech across multiple languages
Text-to-Audio API
Convert text to natural-sounding speech:Basic Example
Available Voices
OpenAI TTS Voices
- alloy: Neutral, balanced tone
- echo: Warm, engaging voice
- fable: Expressive, storytelling voice
- onyx: Deep, authoritative tone
- nova: Bright, energetic voice
- shimmer: Soft, gentle tone
ElevenLabs Voices
- Rachel: Professional female voice
- Drew: Conversational male voice
- Clyde: Middle-aged male voice
- Paul: Mature, authoritative male
- Domi: Confident female voice
- Dave: British male accent
Audio-to-Text API
Transcribe audio files to text:Basic Example
Response Format
Translation API
Translate speech from one language to another:Advanced Features
SSML Support
Use Speech Synthesis Markup Language for fine control:Voice Cloning
Create custom voices from audio samples:Real-time Streaming
Stream audio for real-time applications:Model Comparison
Text-to-Speech Models
Model | Quality | Speed | Languages | Price/Character |
---|---|---|---|---|
TTS-1 | Good | Fast | 1 | $0.000015 |
TTS-1-HD | Excellent | Medium | 1 | $0.000030 |
ElevenLabs Turbo | Excellent | Very Fast | 29 | $0.000030 |
ElevenLabs Multilingual | Premium | Medium | 29 | $0.000100 |
Speech-to-Text Models
Model | Accuracy | Speed | Languages | Price/Minute |
---|---|---|---|---|
Whisper Large | 98% | Medium | 100+ | $0.006 |
Whisper Base | 95% | Fast | 100+ | $0.004 |
Google Chirp | 97% | Fast | 125+ | $0.008 |
AssemblyAI Universal | 96% | Medium | 100+ | $0.005 |
Supported Languages
Text-to-Speech
- English: All models
- Spanish: ElevenLabs, Google
- French: ElevenLabs, Google
- German: ElevenLabs, Google
- Italian: ElevenLabs, Google
- Portuguese: ElevenLabs, Google
- Polish: ElevenLabs, Google
- Turkish: ElevenLabs, Google
- Russian: ElevenLabs, Google
- Dutch: ElevenLabs, Google
- Japanese: ElevenLabs, Google
- Chinese: ElevenLabs, Google
- Korean: ElevenLabs, Google
- Hindi: ElevenLabs, Google
Speech-to-Text
Whisper supports 100+ languages including:- English, Spanish, French, German, Italian
- Portuguese, Russian, Japanese, Korean, Chinese
- Arabic, Hindi, Turkish, Polish, Dutch
- And many more…
Audio Formats
Supported Input Formats
- WAV: Uncompressed audio
- MP3: Compressed audio
- FLAC: Lossless compression
- M4A: Apple audio format
- OGG: Open source format
- WEBM: Web audio format
Output Formats
- MP3: Most compatible
- WAV: Uncompressed quality
- FLAC: Lossless compression
- OGG: Open source format
Quality Settings
- Sample Rate: 8kHz, 16kHz, 22kHz, 44.1kHz, 48kHz
- Bit Depth: 16-bit, 24-bit
- Channels: Mono, Stereo
Rate Limits
Speech model limits by plan:Plan | TTS Characters/Min | STT Minutes/Hour | Custom Voices |
---|---|---|---|
Free | 10,000 | 60 | 0 |
Pro | 100,000 | 600 | 5 |
Enterprise | Custom | Custom | Unlimited |
Common Use Cases
Content Creation
Podcasts, audiobooks, video narration
Accessibility
Screen readers, audio descriptions
Customer Service
IVR systems, chatbot voices
Education
Language learning, audio lessons
Transcription Services
Meeting notes, interview transcripts
Voice Assistants
Smart devices, mobile apps
Media Processing
Subtitle generation, content analysis
Translation
Real-time interpretation, multilingual content
Best Practices
Text-to-Speech Optimization
- Use punctuation for natural pauses
- Spell out numbers and abbreviations
- Add SSML tags for emphasis
- Choose appropriate voice for content tone
Speech-to-Text Optimization
- Use high-quality audio (16kHz+)
- Minimize background noise
- Specify language for better accuracy
- Use appropriate model for use case
Performance Tips
- Batch process multiple files
- Use streaming for real-time applications
- Cache frequently used audio
- Implement proper error handling