Skip to main content
The say verb converts text to speech and plays it to the caller. It supports multiple TTS vendors, languages, voices, and advanced options such as looping, early media, and fallback synthesizers.
{
  "verb": "say",
  "text": [
    "Hi there!",
    "Welcome to our service."
  ],
  "synthesizer": {
    "vendor": "google",
    "label": "primary-tts",
    "language": "en-US",
    "voice": "en-US-Wavenet-F",
    "engine": "neural",
    "gender": "FEMALE",
    "fallbackVendor": "aws",
    "fallbackLabel": "fallback-tts",
    "fallbackLanguage": "en-GB",
    "fallbackVoice": "Amy",
    "options": {
      "speakingRate": 1.0,
      "pitch": 0.0,
      "volumeGainDb": 0.0
    }
  },
  "loop": 2,
  "earlyMedia": true,
  "disableTtsCache": false
}

Configuration

The following table lists the available parameters:
ParameterTypeDescriptionRequired
textstring | arrayThe text to speak. The text may be either plain text or use SSML tags for advanced speech effects such as pauses, emphasis, or embedded audio. Can be a string or an array of strings. For example, <speak>Hello, <break time='1s'/> welcome!</speak>.Yes
synthesizerstring | objectDefines the text-to-speech options. Can be an object with detailed TTS settings or a string referencing a pre-configured synthesizer.No
synthesizer.vendorstringThe TTS service provider to use, for example, google or aws. This determines the available voices, languages, and engine options. See supported speech vendors or add a custom speech API.No
synthesizer.labelstringA custom label for this synthesizer instance. Useful for logging or identifying multiple TTS configurations in a single flow. Can be null or a string.No
synthesizer.languagestringThe language code for the speech output, for example en-US or de-DE. Required if a vendor is defined.No
synthesizer.voicestring | objectThe specific voice to use. Can be a string representing the vendor-specific voice name or an object with advanced properties. Defaults to the Application-level TTS voice if not provided. For example, en-US-Wavenet-F for Google TTS.No
synthesizer.enginestringThe TTS engine type. Options are:
  • standard – the default engine
  • neural – a high-quality natural voice
  • generative – an experimental AI voice
  • long-form – optimized for long text
No
synthesizer.genderstringThe desired voice gender: MALE, FEMALE, or NEUTRAL. Used for vendors that support gender selection.No
synthesizer.fallbackVendorstringAn alternative TTS vendor to use if the primary vendor fails or returns an error. For example, aws.No
synthesizer.fallbackLabelstringA label for the fallback TTS instance. Helps distinguish the primary and fallback synthesizers in logs.No
synthesizer.fallbackLanguagestringThe language code for the fallback synthesizer, for example en-US or de-DE. Defaults to the primary language if not provided.No
synthesizer.fallbackVoicestring | objectThe voice for the fallback synthesizer. Can be a string or object. For example, Amy for AWS.No
synthesizer.optionsobjectA vendor-specific TTS options object, such as speakingRate from 0.25 to 4.0, pitch from -20 to 20, or volumeGainDb from -96 to 16. These options control the speech speed, pitch, and volume.No
loopnumberThe number of times to repeat the utterance. Set 0 for infinite repetition. The default value is 1. Useful for hold music or repeated prompts.No
earlyMediabooleanIf true and the call isn’t yet answered, plays the audio without officially answering the call. Useful for IVR previews or ringback messages. The default value is false.No
disableTtsCachebooleanIf true, disables caching of the synthesized audio for this utterance. The default value is false.No