Engineering Full Stack Apps with Java and JavaScript
Here you will find a small introduction to some of the important concepts of the vxml language like strutcture of a vxml document, application and application root document, dialogs, forms, subdialogs, session, grammars, events, links and utterance.
The <vxml> tag is the root tag of a vxml application. Basic structure of a VXML document is as follows:
<?xml version="1.0"?>
<vxml version="2.1" xmlns="http://www.w3.org/2001/vxml">
. . .
</vxml>
We need to specify the XML version using the xml declaration and the vxml version in the vxml tag as an attribute.
An application root document is a vxml document specified in the optional application attribute of the vxml tag. An application is a set of documents sharing the same application root document.
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" application="../../some-path/vxml/common_application_root.vxml">
Whenever the user interacts with a document in an application, its application root document is also loaded and remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application.
While it is loaded, the application root document's variables are available to the other documents as application variables, and its grammars remain active for the duration of the application.
A VoiceXML document is primarily composed of top-level elements called dialogs. There are two types of dialogs: forms and menus. A document may also have other elements like <meta>, <metadata>, <var>, <script>, <property>, <catch>, and <link>.
The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs, which define the next document and dialog to use.
Forms are the key component of VoiceXML documents. The <form> tag groups sections of input and output together. There can be multiple <form> tags within a vxml document.
A form contains:
A menu can be viewed as a form containing a single field whose grammar and whose <filled> action are constructed from the <choice> elements.
A subdialog provides a mechanism for invoking a new interaction, and returning to the original form and is like a function call. Variable instances, grammars, and state information are saved and are available upon returning to the calling document.
A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context.
Each dialog has one or more speech and/or DTMF grammars associated with it. A grammar defines the words and patterns of words that a user can say or the keys or key sequences that can be entered at any particular point in a dialogue.
In machine directed applications, each dialog's grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialog's active grammars, execution transitions to that other dialog, with the user's utterance treated as if it were said in that dialog.
VoiceXML provides a form-filling mechanism for handling "normal" user input. In addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism.
Events are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn't respond intelligibly, requests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document.
Events are caught by catch elements or their syntactic shorthand. Each element in which an event can occur may specify catch elements or inherited from enclosing elements ". In this way, common event handling behavior can be specified at any level, and it applies to all lower levels.
A link supports mixed initiative forms. It specifies a grammar that is active whenever the user is in the scope of the link. If user input matches the link's grammar, control transfers to the link's destination URI. A link can be used to throw an event or go to a destination URI.
An utterance is the summary of what the user said or keyed in, including the specific grammar matched, and a semantic result consisting of an interpretation structure or, where there is no semantic interpretation, the raw text of the input.