Ecma International Home
  Activities
contact us
Ecma Site Map
What is Ecma Activities News Standards



 

Public Draft Technical Report
Offering CSTA (ECMA-269) Voice Services in Web Browsers

(July 2011)

Introduction

One notable addition to CSTA (ECMA-269) since Edition 6 is the enhanced voice services for automatic speech recognition, speech verification, speaker identification, speaker verification, and text to speech synthesis. Although historitically these functions are mostly made available through CSTA switching function implementations in the call centers, the strong needs to access internet from mobile and other web enabled devices have given rise to the development where ECMA-269 interactive voice devices are implemented as an integral part of the web browsers. One such implementation is the Speech Application Language Tags (SALT) made available by Microsoft in 2004 as a browser plug-in for Internet Explorer.

Recently, the World Wide Web Consortium (W3C) has begun the Hypertext Markup Language (HTML) recommendation version 5, the first major revision since HTML 4.01 was adopted as ISO/IEC 15445 in May 2000. Among the new functionalty being proposed to HTML 5 is the native support for multimedia capabilities. Following this trend, Google, on March 22, 2011, released a new version of the Chrome browser that implements a proposal for HTML 5 Speech Input Application Program Interface (API).

This Technical Report examines the speech input capabilities in the two web brwosers and conclude they are largely compliant with the interactive voice device specification in ECMA-269.

This TR is part of a suite of CSTA Phase III Standards and Technical Reports. All of the Standards and Technical Reports in this Suite are based upon the practical experience of Ecma member companies and each one represents a pragmatic and widely based consensus.

References

This TR provides examples of how subsets of CSTA Interactive Voice services can be included to facilitate a browser based speech processing. ECMA-269, Services for Computer Supported Telecommunications Applications (CSTA) Phase III, should be used as the definitive reference for CSTA. This TR also makes reference to how CSTA Interactive Voice services, adapted from SALT, can be implemented in web browsers in an objected oriented manner. ECMA TR/88 should be used as the reference.

In addition, this TR refers to an HTML speech input API proposal that has been implemented and distributed via Google's Chrome browser. The proposed specification examined by this TR has been published by W3C HTML Speech Incubation Group at http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Feb/att-0020/api-draft.html.

ECMA-269 Speech Input Capabilities

The Listerner Interactive Voice Device (ECMA-269 Clause 6.1.1.4.9) specifies a CSTA device that provides speech input capabilities. The operational model for the Listener resource consists of the following three states:

  1. the NULL state in which the call and the Listener resource are not interacting,
  2. the STARTED state in which the Listener resource is processing the incoming audio, and
  3. the SPEECH DETECTED state in which speech-like sound is detected in the audio and the Listener resource has started the process of matching the audio to the speech grammar.

The services that can prompt the state transitions (ECMA-269 Clause 26.1) and the events that can be raised as the result of the transitions (ECMA-269 Clause 26.2) are depicted in Figure 6-17 of ECMA-269 that is copied below:

ECMA-269 listener

A Listener resource can operate under one of the three modes:

  1. the “multiple” mode allows a self transition on the SPEECH DETECTED state and multiple Recognized event being raised,
  2. the “automatic” mode (default) will transition from the SPEED DETECTED state to the NULL state upon the first Recognized event, and
  3. the “single” mode that would only make the transition from the SPEED DETECTED to the NULL state and raise a single Recognized event upon an explicit Stop service, preventing occasional pauses in user’s speech to inadvertently cut off the recognition process.

Upon leaving the NULL state, CSTA further specifies three timeout conditions that may lead the Listener resource to reset itself to the NULL state:

  1. Silence timeout: the maximum time allowed before entering the SPEECH DETECTED state,
  2. Babble timeout: the maximum time allowed to stay in the SPEECH DETECTED state, and
  3. Maximum timeout: the maximum time allowed before returning to the NULL state.

Other than the Silence timeout that will raise an individual Silence Timeout Expired event (Clause 26.2.12), the other two timeout expirations raise the Voice Error Occurred event (Clause 26.2.18).

A Listener resource can additionally be configured with a speech grammar to govern its behaviours. For speech recognition purposes, the W3C Speech Recognition Grammar Specification (SRGS) format with W3C Semantic Interpretation for Speech Recognition (SISR) annontation is required to specify the grammar. Two CSTA services, Activate and Deactivate (Clause 26.1.1 and 26.1.4, respectively), allow the SRGS grammar rules to be put in the active or dormant state. W3C Extensible Multimodal Annotation (EMMA) or Microsoft Semantic Markup Language is required to describe the outcomes when Recognized Event is raised (Clause 26.2.8). All the configuration parameters of a Listener resource can be set or retrieve using Set Voice Attribute (Clause 26.1.13) and Query Voice Attribute (Clause 26.1.7) request, respectively.

Listener Implementation in SALT and Chrome

All CSTA IVDs are implemented by SALT as an Internet Explorer browser plug-in. Specifically, the Listener resource is embodied as an <listen> element. The element may use a “src” attribute or one or multiple <grammar> subelements to specify speech grammars. An optional sub-element <param> allows the application to specify the URI of the remote server resources needed.

All the three modes and the three timeout mechanisms are implemented by SALT. The mode of the Listener can be specified with an namesake attribute, and the timeout values can be specified as XML attributes of the element. The names of these timeout attributes are “initialtimeout”, “babbletimeout”, and “maxtimeout”, respectively. When the Recognized event is raised, SALT makes available the “result” and “text” parameters (Clause 26.2.8.1) as the namesake properties of the <listen> element.

The Chrome implementation of speech input adheres to the CSTA Listener operational model. The Listener resource is made available by declaring a “speech” attribute to the HTML <input> or <textarea> element. A “grammar” attribute is used to specify the speech grammar. The current proposal only imlements the Silence timeout, which is specified in Chrome through a “nospeechtimeout” attribute of the HTML element. Only the “automatic” mode is considered in the currect specification. Instead of W3C Extensible Multimodal Annotation (EMMA) format, Chrome implementation makes available the recognition outcome exclusively as an ECMAScript (ECMA-262) object.

The following table summarises the syntactical mapping of the Listener resource service requests and the events to SALT and Chrome implementations:

ECMA-269 (clause) SALT in Internet Explorer  Chrome
Start (26.1.14) Start startSpeechInput
Stop (26.1.15) Stop stopSpeechInput
Clear (26.1.2) Cancel cancelSpeechInput
Activate (26.1.1) Activate -
Deactivate (26.1.4) Deactivate -
Emptied (26.2.4) - speechend
Interruption Detected (26.2.5) audiointerrupt -
Not Recognized (26.2.6) noreco speecherror (error code 5)
Recognized (26.2.8) reco speechchange
Silence Timeout Expired (26.2.11) silence speecherror (error code 4)
Speech Detected (26.2.12) speechdetected   speechstart
Started (26.2.13) - capturestart
Voice Error Occurred (26.2.18) error speecherror

Sample Programs

This section illustrates the similarities between Chrome’s and SALT’s implementation of ECMA-269 Listener resource, using the sample codes in SALT’s and Chrome’s specifications as examples. As can be seen, the two impementations bear strong resemblances to each other.

Even though the differences are minor, they notably reflect the distinct design philosophies. While Chrome speech API aims to modify the specification and be applicable only in a new version of HTML, SALT is designed to be a general XML application that can be embedded into any markup languages (e.g. SVG, openXML, etc.) to provide speech functionality for its hosting environment. As such, all the SALT samples below are able to introduce speech features without changing HTML specification.

Click to talk Example

This example shows a HTML web page that allows users to enter a city name into a text input element.

SALT:

Because speech input is an add-on to HTML, applications can assign speech recognition outcome to any HTML element (or elements) by scripting:

    <html xmlns:salt="urn:schemas.saltforum.org/2002/02/SALT">
    ...
    <script type="text/javascript">
      function handleSpeechInput() {
        textBoxCity.value = event.srcElement.text;
      }
    </script>
	...
    <input id="textBoxCity" type="text">
    <input type="button" name="q" onclick="listenCity.Start()">
    <salt:listen id="listenCity" src="city.grxml" onreco="handleSpeechInput()">
    ...
   

Chrome:

By changing HTML, applications can speech-enable individual HTML input element in Chrome in a very straightfoward manner:

     <input id="textBoxCity" type="text" speech grammar="city.grxml">
   

Search by voice, with "Did you say..."

Chrome:

This example demonstrates how the second best recognition result can be parsed out and submit to the web search engine so that the search result page can display a link with the text "Did you say second_best?". The example is taken from Section 2 of Google's HTML 5 Speech API proposal.

    <script type="text/javascript">
      function startSearch(event) {
        if (event.target.results.length > 1) {
          var second = event.target.results[1].utterance;
          document.getElementById("second_best").value = second;
        }
        event.target.form.submit();
      }
    </script>

    <form action="http://www.google.com/search">
    <input type="search" name="q" speech required onspeechchange="startSearch">
    <input type="hidden" name="second_best" id="second_best">
    </form>
   

SALT:

The same scripting method in the previous example can be equally applicable to implement this scenario.

    <script type="text/javascript">
      function startSearch() {
        var nBest = event.srcElement.recoresult.childNodes;
        q.value = nBest.item(0).text;
        if (nBest.length > 1) {
          second_best.value = nBest.item(1).text;
        }
        event.srcElement.form.submit();
      }
    </script>

    <form action="http://www.google.com/search">
    <input type="search" name="q">
    <input type="button" onclick="listenSearch.Start()">
    <salt:listen id="listenSearch" onreco="startSearch()">
    <input type="hidden" name="second_best" id="second_best">
    </form>
   


Back Back

 

 

 

 

 
 
only search Ecma International

Level A conformance icon, 
          W3C-WAI Web Content Accessibility Guidelines 1.0