Speech Kit Library Guide¶
The Speech Kit library provides the classes necessary to perform network-based speech recognition and text-to-speech synthesis. This library provides a simple, high-level speech service API that automatically performs all the tasks necessary for speech recognition or synthesis, including audio recording, audio playback, and network connection management.
Organization of This Document
The following sections describe how to connect to a speech server and perform speech recognition or synthesis:
- “Speech Kit Basics ” provides an overview of the Speech Kit library.
- “Connecting to a Speech Server ” details the top-level server connection process.
- “Recognizing Speech ” describes how to use a network recognizer to transcribe speech.
- “Converting Text to Speech ” shows how to use the network-based vocalizer to convert text to speech.
Speech Kit Basics
The Speech Kit library allows you to add voice recognition and text-to-speech services to your applications easily and quickly. This library provides access to speech processing components hosted on a server through a clean asynchronous network service API, minimizing overhead and resource consumption. The Speech Kit library lets you provide fast voice search, dictation, and high-quality, multilingual text-to-speech functionality in your application.
Speech Kit Architecture
The Speech Kit library is a full-featured, high-level library that automatically manages all the required low-level services.
Speech Kit Architecture
At the application level, there are two main components available to the developer: the recognizer and the text-to-speech synthesizer.
In the library there are several coordinated processes:
- The library fully manages the audio system for recording and playback.
- The networking component manages the connection to the server and, at the start of a new request, automatically re-establishes connections that have timed-out.
- The end-of-speech detector determines when the user has stopped speaking and automatically stops recording.
- The encoding component compresses and decompresses the streaming audio to reduce bandwidth requirements and decrease latency.
The server system is responsible for the majority of the work in the speech processing cycle. The complete recognition or synthesis procedure is performed on the server, consuming or producing the streaming audio. In addition, the server manages authentication as configured through the developer portal.
Using Speech Kit
To use Speech Kit, you will need to have the Android SDK installed. Instructions for installing the Android SDK can be found at http://developer.android.com/sdk/index.html . You can use the Speech Kit library in the same way that you would use any of the standard jar library.
To start using the Speech Kit library, add it to your new or existing project, as follows:
- Copy the libs folder into the root of the project folder for your android project. The libs folder contains an armeabi subfolder that contains the file libnmsp_speex.so
- From the menu select Project ‣ Properties....
- In the popup menu, select Java Build Path from the menu at the left.
- In the right panel of the popup menu, select the Libraries tab.
- Use the Add External JARs button to add nmdp_speech_kit.jar
Enabling Javadoc for the Speech Kit Library in Eclipse
To view the Javadoc for Speech Kit in Eclipse, you must tell Eclipse where to find the class documentation. This can be done with the following steps:
- In the Package Explorer tab for your project, a Referenced Libraries
- Right click nmdp_speech_kit.jar and select Properties
- In the popup menu, select Javadoc Location from the menu at the left.
- In the right panel of the popup menu, select the Javadoc URL
- Click the Browse button to the right of Javadoc location path
- Browse to and select the Speech Kit javadoc
You also need to add the necessary permissions to AndroidManifest.xml
- In the Package Explorer tab for your project, open AndroidManifest.xml
- Add the following lines immediately before the end of the manifest tag.
<uses-permission android:name= "android.permission.ACCESS_NETWORK_STATE" ></uses-permission> <uses-permission android:name= "android.permission.INTERNET" ></uses-permission> <uses-permission android:name= "android.permission.RECORD_AUDIO" ></uses-permission> <uses-permission android:name= "android.permission.READ_PHONE_STATE" ></uses-permission> ... </manifest>
- If you want to use prompts that vibrate, you will need to include the following additional permission:
<uses-permission android:name= "android.permission.VIBRATE" ></uses-permission>
You are now ready to start using recognition and text-to-speech services.
Speech Kit Errors
While using the Speech Kit library, you will occasionally encounter errors. In this library the SpeechError class includes SpeechError.Codes
There are effectively two types of errors that can be expected in this framework.
- The first type are service connection errors and include the SpeechError.Codes.ServerConnectionError and SpeechError.Codes.ServerRetryError
- The second type are speech processing errors and include the SpeechError.Codes.RecognizerError and SpeechError.Codes.VocalizerError
It is essential to always monitor for errors, as signal conditions may generate errors even in a correctly implemented application. The application’s user interface needs to respond appropriately and elegantly to ensure a robust user experience.
Connecting to a Speech Server
The Speech Kit library is a network service and requires some basic setup before you can use either the recognition or text-to-speech classes.
This setup performs two primary operations:
- First, it identifies and authorizes your application.
- Second, it optionally establishes a connection to the speech server immediately, allowing for fast initial speech requests and thus enhancing the user experience.
Note
This network connection requires authorization credentials and server details set by the developer. The necessary credentials are providedthrough the Dragon Mobile SDK portal at http://dragonmobile.nuancemobiledeveloper.com .
Speech Kit Setup
The application key SpeechKitApplicationKey
Your unique credentials, provided through the developer portal, include the necessary line of code to set this value. Thus, this process is as simple as copying and pasting the line into your source file. You must set this key before you initialize the Speech Kit system. For example, you configure the application key as follows:
static
final
byte
[]
SpeechKitApplicationKey
=
{(
byte
)
0x12
,
(
byte
)
0x34
,
...,
(
byte
)
0x89
};
The setup method, SpeechKit.initialize()
- An application Context (Android.content.Context)
- An application identifier
- A server address
- A port
- The SSL setting
- The application key defined above.
The appContext
Context
context
=
getApplication
().
getApplicationContext
();
The ID
The host and port
The ssl
The applicationKey
The library is configured in the following example:
SpeechKit
sk
=
SpeechKit
.
initialize
(
context
,
speechKitAppId
,
speechKitServer
,
speechKitPort
,
speechKitSsl
,
speechKitApplicationKey
);
Note
This method is meant to be called one time per application execution to configure the underlying network connection. This method does not attempt to establish the connection to the server.
At this point the speech server is fully configured. The connection to the server will be established automatically when needed. To make sure the next recognition or vocalization is as fast as possible, connect to the server in advance using the optional connect
sk
.
connect
();
Note
This method does not indicate failure. Instead, the success or failure of the setup is known when the Recognizer and Vocalizer classes are used.
When the connection is opened, it will remain open for some period of time, ensuring that subsequent speech requests are served quickly as long as the user is actively making use of speech. If the connection times out and closes, it will be re-opened automatically on the next speech request or call to connect
The application is now configured and ready to recognize and synthesize speech.
Recognizing Speech
The recognizer allows users to speak instead of type in locations where text entry would generally be required. The speech recognizer returns a list of text results. It is not attached to any UI object in any way, so the presentation of the best result and selection of alternative results is left up to the UI of application.
Speech Recognition Process
Initiating a Recognition
- Before you use speech recognition, ensure that you have set up the core Speech Kit library with the SpeechKit.initialize
- Then create and initialize a Recognizer
recognizer
=
sk
.
createRecognizer
(
Recognizer
.
RecognizerType
.
Dictation
,
Recognizer
.
EndOfSpeechDetection
.
Short
,
"en_US"
,
this
,
handler
);
- The SpeechKit.createRecognizer
- The type parameter is a String , generally one of the recognition type constants defined in the Speech Kit library and available in the class documentation for Recognizer . Nuance may provide you with a different value for your unique recognition needs, in which case you will enter the raw String
- The detection parameter determines the end-of-speech detection model and must be one of the Recognizer.EndOfSpeechDetection
- The language
Note
For example, the English language as spoken in the United States is en_US . An up-to-date list of supported languages for recognition is available on the FAQ at http://dragonmobile.nuancemobiledeveloper.com/faq.php . - The this parameter defines the object to receive status, error, and result messages from the recognizer. It can be replaced with any object that implements the RecognizerListener
- handler should be an android.os.Handler
Handler handler = new Handler();
Handler
- Start the recognition by calling start
- The Recognizer.Listener passed into SpeechKit.createRecognizer
Using Prompts
Prompts are short audio clips or vibrations that are played during a recognition. Prompts may be played at the following stages of the recognition:
- Recording start: the prompt is played before recording. The moment the prompt completes, recording will begin.
- Recording stop: the prompt is played when the recorder is stopped.
- Result: the prompt is played if a successful result is received.
- Error: the prompt is played if an error occurs.
The SpeechKit.defineAudioPrompt method defines an audio prompt from a raw resource ID packaged with the Android application. Audio prompts may consume significant system resources until release is called, to try to minimize the number of instances. The Prompt.vibrate
Call SpeechKit.setDefaultRecognizerPrompts to specify default audio or vibration prompts to play during all recognitions by default. To override the default prompts in a specific recognition, call setPrompt prior to calling start
Receiving Recognition Results
To retrieve the recognition results, implement the Recognizer.Listener.onResults
public
void
onResults
(
Recognizer
recognizer
,
Recognition
results
)
{
String
topResult
;
if
(
results
.
getResultCount
()
>
0
)
{
topResult
=
results
.
getResult
(
0
).
getText
();
// do something with topResult...
}
}
This method will be called only on successful completion, and the results list will have zero or more results.
Even in the absence of an error, there may be a suggestion, present in the recognition results object, from the speech server. This suggestion should be presented to the user.
Handling Errors
To be informed of any recognition errors, implement the onError method of the Recognizer.Listener interface. In the case of errors, only this method will be called; conversely, on success this method will not be called. In addition to the error, a suggestion, as described in the previous section, may or may not be present. Note that both the Recognition and the SpeechError class have a getSuggestion
public
void
onError
(
Recognizer
recognizer
,
SpeechError
error
)
{
// Inform the user of the error and suggestion
}
Managing Recording State Changes
Optionally, to be informed when the recognizer starts or stops recording audio, implement the onRecordingBegin and onRecordingDone methods of the Recognizer.Listener interface. There may be a delay between initialization of the recognizer and the actual start of recording, so the onRecordingBegin
public
void
onRecordingBegin
(
Recognizer
recognizer
)
{
// Update the UI to indicate the system is now recording
}
The onRecordingDone
public
void
onRecordingDone
(
Recognizer
recognizer
)
{
// Update the UI to indicate that recording has stopped and the speech is still being processed
}
This message is sent both with and without end-of-speech detection models in place. The message is sent regardless, whether recording was stopped due to calling the stopRecording
Power Level Feedback
In some scenarios, especially for longer dictations, it is useful to provide a user with visual feedback of the volume of their speech. The Recognizer interface supports this feature by use of the method getAudioLevel , which returns the relative power level of the recorded audio in decibels. The range of this value is a float between 0.0 and -90.0 dB where 0.0 is the highest power level and -90.0 is the lowest level. This method should be accessed during recordings, specifically in the time between receiving the messages onRecordingBegin and onRecordingDone
Converting Text to Speech
The Vocalizer
Text-to-Speech Process
Initiating Text-To-Speech
- Before you use speech synthesis, ensure that you have setup the core Speech Kit library with the SpeechKit.initialize
Then create and initialize a Vocalizer
Vocalizer
voc
=
sk
.
createVocalizerWithLanguage
(
"en_US"
,
this
,
handler
);
- The Vocalizer.createVocalizerWithLanguage
- The language parameter is a String that defines the spoken language in the format of the ISO 639 language code, followed by an underscore “_”, followed by the ISO 3166-1 country code. For example, the English language as spoken in the United States is en_US
Note
An up-to-date list of supported languages for text-to-speech is available at http://dragonmobile.nuancemobiledeveloper.com/faq.php . The list of supported languages will be updated when new language support is added. The new languages will not necessarily require updating an existing Dragon Mobile SDK. - The this parameter defines the object to receive status and error messages from the speech synthesizer. It can be replaced with any object that implements the Vocalizer.Listener
- handler should be an android.os.Handler
Handler handler = new Handler ();
Handler
- The Vocalizer.createVocalizerWithLanguage method uses a default voice chosen by Nuance. To select a different voice, use the createVocalizerWithVoice
- The voice parameter is a String that defines the voice model. For example, the female US English voice is Samantha
Note
The up-to-date list of supported voices is provided with the supported languages at http://dragonmobile.nuancemobiledeveloper.com/faq.php .
- To begin converting text to speech, you must use either the speakString or speakMarkupString
voc
.
speakString
(
"Hello world."
,
context
);
Note
The speakMarkupString method is used in exactly the same manner as speakString except that it takes a String filled with SSML, a markup language tailored for use in describing synthesized speech. An advanced discussion of SSML is beyond the scope of this document, however you can find more information from the W3C at http://www.w3.org/TR/speech-synthesis/ .
As speech synthesis is a network-based service, these methods are all asynchronous, and in general an error condition is not immediately reported. Any errors are reported as messages to the Vocalizer.Listener that was passed to createVocalizerWithLanguage or createVocalizerWithVoice
The speakString and speakMarkupString methods may be called multiple times for a single Vocalizer instance. To change the language or voice without having to create a new Vocalizer , call setLanguage or setVoice
Managing Text-To-Speech Feedback
The synthesized speech will not immediately start playback. Rather there will be a brief delay as the request is sent to the speech server and speech is streamed back. For UI coordination, to indicate when audio playback begins, the optional method Vocalizer.Listener.onSpeakingBegin
public
void
onSpeakingBegin
(
Vocalizer
vocalizer
,
String
text
,
Object
context
)
{
// update UI to indicate that text is being spoken
}
The context in the message is a reference to the context that was passed to one of the speakString or speakMarkupString
On completion of the speech playback, the Vocalizer.Listener.onSpeakingDone message is sent. This message is always sent on successful completion and on error. In the success case, error is null
public
void
onSpeakingDone
(
Vocalizer
vocalizer
,
String
text
,
SpeechError
error
,
Object
context
)
{
if
(
error
!=
null
)
{
// Present error dialog to user
}
else
{
// Update UI to indicate speech is complete
}
}