Skip to main content

Audio Models Overview

Process audio input and generate speech output with multimodal AI models. AnyAPI provides access to models that can understand audio content and generate natural-sounding speech through the chat completions API.

Available Models

Audio Input & Output Models

These models can both understand audio input and generate audio output:
  • GPT-4o Audio Preview (OpenAI): Multimodal model with full audio input/output, function calling, and reasoning
  • GPT Audio Mini (OpenAI): Lightweight audio model for faster and more affordable audio processing

Audio Input Models

These models can process and understand audio content but do not generate audio output:

Google Models

  • Gemini 2.5 Pro: Advanced model with audio understanding, reasoning, and up to 8.4 hours of audio input
  • Gemini 2.5 Flash: Fast model with audio input, vision, and reasoning capabilities
  • Gemini 2.5 Flash Lite: Lightweight version of Gemini Flash with audio support
  • Gemini 3.1 Pro Preview: Latest generation with enhanced audio understanding
  • Gemini 3 Pro Preview: Next-gen model with audio input support
  • Gemini 3 Flash Preview: Fast next-gen model with audio capabilities
  • Gemini 2.0 Flash: Stable model with audio, video, and image understanding
  • Gemini 2.0 Flash Lite: Lightweight multimodal model with audio support

Mistral Models

  • Voxtral Small 24B: Speech-to-text model with audio understanding capabilities

Model Capabilities

Audio Understanding

Analyze and understand spoken content in audio files

Speech Generation

Generate natural-sounding speech within chat conversations

Audio Analysis

Extract insights, transcribe, and summarize audio content

Multimodal Reasoning

Combine audio understanding with text and image reasoning

Audio via Chat Completions API

All audio capabilities are available through the chat completions endpoint:
POST /v1/chat/completions

Audio Input Example

Send audio content for analysis or transcription:
curl -X POST "https://api.anyapi.ai/v1/chat/completions" \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-2.5-flash",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Transcribe this audio and provide a summary"
          },
          {
            "type": "input_audio",
            "input_audio": {
              "data": "<base64-encoded-audio>",
              "format": "mp3"
            }
          }
        ]
      }
    ]
  }'

Audio Output Example

Generate speech using models with audio output support:
Python
import requests

response = requests.post(
    "https://api.anyapi.ai/v1/chat/completions",
    headers={
        "Authorization": "Bearer your_api_key",
        "Content-Type": "application/json"
    },
    json={
        "model": "openai/gpt-4o-audio-preview",
        "modalities": ["text", "audio"],
        "audio": {
            "voice": "alloy",
            "format": "mp3"
        },
        "messages": [
            {
                "role": "user",
                "content": "Tell me a short story about a robot learning to paint"
            }
        ]
    }
)

print(response.json())

Model Comparison

ModelProviderAudio InputAudio OutputStrengthsAccess
GPT-4o Audio PreviewOpenAIYesYesFull audio I/O, function callingBasic
GPT Audio MiniOpenAIYesYesSpeed, affordabilityBasic
Gemini 2.5 ProGoogleYesNoQuality, long audio (8.4h)Basic
Gemini 2.5 FlashGoogleYesNoSpeed, long audio (8.4h)Basic
Gemini 3.1 Pro PreviewGoogleYesNoLatest generationBasic
Gemini 3 Flash PreviewGoogleYesNoFast next-genBasic
Voxtral Small 24BMistralYesNoSpeech understandingBasic

Audio Input Limits

ParameterValue
Max audio length8.4 hours (Gemini models)
Max audio files per prompt1
Supported formatsMP3, WAV, FLAC, M4A, OGG, WebM

Common Use Cases

Transcription

Convert audio recordings to text with multimodal models

Audio Summarization

Get summaries and insights from podcasts, meetings, and calls

Speech Generation

Generate natural-sounding speech in chat conversations

Audio Q&A

Ask questions about audio content and get detailed answers

Best Practices

Audio Input Optimization

  • Use high-quality audio (16kHz+) for better accuracy
  • Minimize background noise
  • Specify the task clearly in your text prompt (e.g., “transcribe”, “summarize”, “translate”)
  • For long audio files, consider using Gemini models which support up to 8.4 hours

Audio Output Optimization

  • Choose the appropriate voice for your content tone
  • Use the audio parameter to control voice and format settings
  • Consider GPT Audio Mini for high-volume, cost-sensitive use cases

Getting Started

Quick Start

Process your first audio

SDKs

Use our libraries