Skip to main content
The transcribe verb generates real-time transcriptions of speech. This verb can be nested only within the following verbs:
When nested within a dial verb, transcribe provides long-running transcription of a phone call.
dial
{
  "verb": "dial",
  "actionHook": "dial",
  "callerId": "+491173331212",
  "answerOnBridge": true,
  "dtmfCapture": ["*2", "*3"],
  "dtmfHook": {
    "url": "/dtmf",
    "method": "GET"
  },
  "amd": {
    "actionHook": "amd",
    "recognizer": {
      "vendor": "microsoft",
      "language": "en-US"
    }
  },
  "transcribe": {
    "transcriptionHook": "http://example.com/transcribe",
    "recognizer": {
      "vendor": "Google",
      "language": "en-US",
      "interim": true
    }
  },
  "target": [
    {
      "type": "phone",
      "number": "+49XXXXXXXXXXX",
      "trunk": "Twilio"
    },
    {
      "type": "sip",
      "sipUri": "sip:49XXXXXXXXXXX@sip.myTrunk.com",
      "auth": {
        "username": "John",
        "password": "Doe"
      }
    },
    {
      "type": "user",
      "name": "jane@sip.example.com"
    }
  ]
}

Configuration

The following table lists the available parameters:
ParameterTypeDescriptionRequired
transcriptionHookstringA webhook URL where the system sends an HTTP POST whenever a partial or final transcription result is available from the provider. This allows your application to process or store transcripts in real time.Yes
translationHookstringA webhook URL where the system sends an HTTP POST whenever a translation of the transcribed text is available. Only used if translation is enabled. Useful for multilanguage workflows.No
recognizerobjectContains configuration options for the speech recognition engine. This includes language selection, hints, diarization, and other advanced settings.No
recognizer.vendorstringThe speech recognition provider to use, for example, Google, Amazon, or Azure. The vendor determines transcription quality, supported languages, and feature availability.Yes
recognizer.labelstringA custom label to identify this recognizer instance in logs or dashboards. Helpful when multiple recognizers are configured.No
recognizer.languagestringThe primary language code for transcription, for example, en-US for English, fr-FR for French. Determines how speech is interpreted.No
recognizer.hintsarrayAn array of words or phrases that may appear in the audio and should be recognized more accurately. Useful for domain-specific terms, names, or technical vocabulary.No
recognizer.hintsBoostnumberA numeric value specifying how strongly the recognizer should prioritize the hint words. Higher numbers give stronger emphasis, improving accuracy for key terms.No
recognizer.altLanguagesarrayAn array of additional language codes that the recognizer can use for multilingual audio. Allows recognition of mixed-language content.No
recognizer.profanityFilterbooleanIf true, the recognizer will automatically remove or mask profanity from the transcription output.No
recognizer.interimbooleanIf true, returns partial transcription results as the audio is being processed. Useful for live captions or real-time feedback.No
recognizer.punctuationbooleanIf true, punctuation marks, for example, periods or commas, are included in the transcription to improve readability.No
recognizer.diarizationbooleanIf true, enables speaker diarization, which assigns segments of the transcript to individual speakers.No
recognizer.diarizationMinSpeakersnumberThe minimum number of speakers expected in the audio. Helps the diarization algorithm distinguish between speakers accurately.No
recognizer.diarizationMaxSpeakersnumberThe maximum number of speakers expected in the audio. Prevents the algorithm from splitting speech unnecessarily.No
recognizer.vadobjectVoice Activity Detection settings. Determines how the system detects when someone is speaking vs. silence, improving transcription timing and accuracy.No
recognizer.fallbackVendorstringSpecifies an alternative transcription vendor to use if the primary vendor fails. Ensures reliability in critical workflows.No
recognizer.fallbackLanguagestringLanguage code to use for the fallback vendor. Must match a language supported by the fallback provider.No
earlyMediabooleanIf true, transcription starts as soon as audio begins, even before the call is answered. Useful for capturing pre-call audio, for example, IVR prompts. The default value is false.No
channelnumberA number specifying which audio channel to transcribe in multichannel recordings. Each channel is a separate audio track—for example, 0 is left and 1 is right in stereo, or 0, 1, 2 for individual participants or microphones in multi-speaker recordings.No
Additional vendor-specific options are available through properties like deepgramOptions, googleOptions, azureOptions, awsOptions, nuanceOptions, ibmOptions, nvidiaOptions, cobaltOptions, sonioxOptions, verbioOptions, speechmaticsOptions, assemblyAiOptions, openaiOptions, and customOptions.

More Information