Explore and learn about voice and multimodal standards without making a huge investment in time and money. Even if you find a standard doesn’t fit the bill, what you learn might be relevant for the next application. And if a standard does meet your requirements, then your project will have a big leg up over applications where the components are proprietary.
Some of the more prominent standards for speech, natural language, and multimodal systems.
Speech Synthesis Markup Language (SSML)
A W3C standard that puts the finishing touches on how speech is pronounced at a detailed level. SSML instructs the text-to-speech (TTS) synthesizer about specifics of pronunciation, such as which words should be emphasized and where pauses should be. These adjustments can make a huge difference in how the synthesized speech affects listeners, like comparing the line readings of a skilled actor and an ordinary person. A moving speech recited by James Earl Jones will likely sound flat and unconvincing in the voice of your neighbor’s 15-year-old son. With SSML, developers aren’t limited to a TTS system’s default pronunciations; they can make them sound exactly as they want.
SSML is widely supported by TTS systems; the TTS technology used in the Amazon Alexa Skills Kit, Microsoft Cognitive Services, IBM Watson, and Nuance cloud services all have SSML commands as an option, and most of these products offer online demos. An open-source TTS platform, the Mary system (http://mary.dfki.de/), allows you to experiment with SSML. Authoring SSML directly can be difficult, but authoring tools like the Chant VoiceMarkup Kit or the open-source SSML builder on GitHub can help.
State Chart Extensible Markup Language (SCXML)
Popular standard with open-source support. SCXML is a powerful tool for defining state-based speech and multimodal dialogues. When a user says something or interacts with the screen, an SCXML-based system can react and move to a new state, triggering a display change or a spoken prompt. The state-based approach is helpful for defining how the users progress through an app.
To make authoring SCXML easier, several editors and visualizers have been developed, such as SCXMLGUI and VisualSC.
The standard for defining voice dialogues, is widely implemented and needs no introduction. An open-source implementation, JVoiceXML, is available and would be a good way to start experimenting with VoiceXML.
While SSML, SCXML, and VoiceXML are probably the most widely implemented speech standards, implementations of some of the newer standards can also be found.
The Multimodal Architecture
Emotion Markup Language (EmotionML)
A language for representing emotion, has also been implemented in the Mary TTS system. EmotionML can change pronunciation at a higher level than SSSML; rather than emphasizing a word, EmotionML can make a voice sound angry or happy. The Mary website has an online demo for using EmotionML to tweak the emotions expressed by a synthetic voice.