Skip to main content
POST
/
audio
/
transcriptions
curl --request POST \
  --url https://api.anyapi.ai/v1/audio/transcriptions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form file='@example-file' \
  --form model=MODEL_EXAMPLE
{
  "text": "<string>",
  "language": "<string>",
  "duration": 123,
  "words": [
    {
      "word": "<string>",
      "start": 123,
      "end": 123
    }
  ],
  "segments": [
    {
      "id": 123,
      "seek": 123,
      "start": 123,
      "end": 123,
      "text": "<string>",
      "tokens": [
        123
      ],
      "temperature": 123,
      "avg_logprob": 123,
      "compression_ratio": 123,
      "no_speech_prob": 123
    }
  ]
}

Authorizations

Authorization
string
header
required

Bearer token authentication. Get your API key from the dashboard.

Body

multipart/form-data
file
file
required

The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

model
string
required

ID of the model to use

Example:

"gpt-4o-transcribe"

language
string

The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency

prompt
string

An optional text to guide the model's style or continue a previous audio segment

response_format
enum<string>
default:json

The format of the transcript output

Available options:
json,
text,
srt,
verbose_json,
vtt
temperature
number
default:0

The sampling temperature, between 0 and 1

Required range: 0 <= x <= 1
timestamp_granularities
enum<string>[]

The timestamp granularities to populate for this transcription

Available options:
word,
segment
stream
boolean
default:false

If set, partial transcription results will be sent as server-sent events

include
enum<string>[]

Additional data to include in the response

Available options:
logprobs

Response

Successful response

text
string
required

The transcribed text

language
string

The language of the input audio

duration
number

The duration of the input audio in seconds

words
object[]

Extracted words and their corresponding timestamps (when timestamp_granularities includes 'word')

segments
object[]

Segments of the transcribed text and their corresponding details (when timestamp_granularities includes 'segment')