Speech Recognition - Command And Control

"Command & Control" speech recognition allows the user to speak a word, phrase, or sentence from a list of phrases that the computer is expecting to hear. For example, a user might be able to speak the command, "Send mail to Fred Smith", "Send mail to Bob Jones", or "Turn on the television."

The number of different commands a user might speak at any time can easily number in the hundreds. Furthermore, the commands are not just limited to a "list" but can also contain other fields, like "Send mail to " or "Call {digits}" or "Record the movie at {time}." With all of the possibilities, the user is able to speak thousands of different commands.

Why Use Command and Control?

In general, use Command and Control recognition when:

It makes the application easier to use.
It makes features in the application easier to get to.
It makes the application more fun/realistic.

If an application uses speech recognition solely to impress people, it will work well for demos but will not be used by real users.

More Specific Reasons for Implementing Speech

Command and Control recognition might be used in some of the following situations:

Answering questions. An application can easily be designed to accept voice responses to message boxes and wizard screens. All speech recognition engines can easily identify "Yes," "No," and a few other short responses.
Streamlining access to features. Speech recognition enables the user to speak any one of a potentially huge set of commands, making it faster to say, "Change the font to Arial," "Use 13-point bold," or "Print the current page" than to maneuver through the corresponding dialog boxes.
Activating macros. Speech recognition lets a user speak a more natural word or phrase to activate a macro. For example, "Spell check the paragraph" is easier for most users to remember than the CTRL+F5 key combination.
Accessing Large Lists. In general, it's faster for a user to speak one of the names on a list, such as "Start running calculator" than to scroll through the list to find it.
Facilitating dialogue between the user and the computer. Speech recognition works very well in situations where the computer essentially asks the user "What do you want to do?" and branches according to the reply (somewhat like a wizard). For example, the user might reply, "I want to book a flight from New York to Boston." After the computer analyzes the reply, it clarifies any ambiguous words ("Did you say New York?"). Finally, the computer asks for any information that the user did not supply, such as "What day and time do you want to leave?"
Providing hands-free computing. Speech recognition is an essential component of any application that requires hands-free operation; it also can provide an alternative to the keyboard for users who are unable or prefer not to use one. Users with repetitive-stress injuries or those who cannot type may use speech recognition as the sole means of controlling the computer.
Humanizing the computer. Speech recognition can make the computer seem more like a person -- that is, like someone whom the user talks to and who speaks back. This capability can make games more realistic and make educational or entertainment applications friendlier.

Potential Uses By Application Category

The specific use of command and control recognition will depend on the application. Here are some sample ideas and their uses:

Games and Edutainment

Game and edutainment software titles will be some of the heaviest users of Command & Control speech recognition in the near term. Christmas 1995 saw the appearance of several games and half a dozen language-learning titles that use speech recognition. Only high-end machines sold in Christmas 1995 could run speech recognition. Because memory and CPU in machines will increase, and because of the introduction of Microsoft's speech API, many more games and edutainment titles in future Christmas markets will be using speech recognition.

What do these titles use speech recognition for?

One of the most compelling uses of speech recognition technology is in interactive verbal exchanges and conversation with computer-based characters. With games, for example, traditional computer-based characters can now evolve into characters that the user can actually talk to.

While speech recognition enhances the realism and fun in many computer games, it also provides a useful alternative to keyboard-based control of games and applications-voice commands provide new freedom for the user in all sorts of applications, from entertainment to productivity.

Data Entry

Many applications such as database front-ends and spreadsheets require users to keyboard paper-based data into the computer. It is much easier for users to read the data directly to the computer, and speech recognition can significantly speed up data entry.

A data-entry application can use speech recognition if the data is specific enough. While speech recognition cannot effectively be used to enter names, it's very good at entering numbers and selecting items out of a small (less than 100) list. Some recognizers can even handle spelling fairly well. If an application uses speech recognition, the user no longer has to look at the keyboard. If speech recognition is combined with text to speech playback of the recognized entry, then the user doesn't even need to look at the screen, and is able to focus on the paper.

Furthermore, because speech recognition is not as "modal" as a keyboard, some applications don't even need to require a specific field to have focus. If the form that is being filled in has fields with mutually exclusive data types -- one field allows "male" or "female", the other is an age, and the third is a city -- then speech recognition can hear the command and automatically determine which field to fill in. After all, if only one field accepts "New York City" as a valid entry and the user speaks "New York City" then the application knows which field to fill in.

Document Editing

Command and control recognition is useful for document editing when the user wishes to keep his/her hands on the keyboard to type, or on the mouse to drag and select. He/she can simultaneously speak commands for manipulating the data that he/she is working on. A word processor might provide commands like "bold", "italic", "change to Times New Roman font", "use bullet list text style," and "use 18 point type". A paint package might have "select eraser" or "choose a wider brush."

Of course, there are users who won't find speaking a command to be preferable to using keyboard equivalents. People have been using keyboard equivalents for so long that the combinations have become for them a routine part of program control. But for many (if not most) people, keyboard equivalents are a lot of unused shortcuts. Voice commands will provide these users with the means to execute a command without first mousing through cascading menus.

Telephony

For a full description of telephony, see the Text-To-Speech for Telephony article.

Hardware and Software Requirements

A speech application requires certain hardware and software on the user's computer to run. Not all computers have the memory, speed, microphone, or speakers required to support speech, so it is a good idea to design the application so that speech is optional.

These hardware and software requirements should be considered when designing a speech application:

Processor speed. The speech recognition and text-to-speech engines currently on the market typically require a 486/33 (DX or SX) or faster processor.
Memory. On the average, speech recognition for command and control consumes 1 megabyte (MB) of random-access memory (RAM) in addition to that required by the running application. Speech recognition for dictation requires an additional 8 MB. Text-to-speech uses about 1 MB of additional RAM.
Sound card. Almost any sound card will work for speech recognition and text-to-speech, including Sound Blaster™, Media Vision™, and ESS Technology cards that are compatible with the Microsoft Windows Sound System, and the audio hardware built into multimedia computers. A few speech recognition engines still need a DSP (digital signal processor) card.
Microphone. The user can choose between two kinds of microphone: either a close-talk or headset microphone that is held close to the mouth or a medium-distance microphone that rests on the computer 30 to 60 centimeters away from the speaker. A headset microphone is needed for noisy environments.
Operating system. The Microsoft Speech application programming interface (API) requires either Windows 95 or Windows NT version 4.0.
Speech recognition and text-to-speech engine. Speech recognition and text-to-speech software must be installed on the user's system. Many new audio-enabled computers and sound cards are bundled with speech recognition and text-to-speech engines. As an alternative, many engine vendors offer retail packages for speech recognition or text-to-speech, and some license copies of their engines.

For a list of engine vendors that support the Speech API, see the ENGINE.DOC file included with the Speech Software Development Kit.

Limitations

Currently, even the most sophisticated speech recognition engine has limitations that affect what it can recognize and how accurate the recognition will be. The following list illustrates many of the limitations found today. The limitations do pose some problems, but they do not prevent the design and development of applications that use voice commands.

Microphones and sound cards

The microphone is the largest problem that speech recognition encounters. Microphones inherently have the following problems:

Not every user has a sound card. Over time more and more PCs will bundle with a sound card.
Not every user has a microphone. Over time more and more PCs will bundle a microphone.
Sound cards (being in the back) don't make it very easy for users to plug in the microphone.
Most microphones that come with computers are cheap. Because of that, the microphones don't do as well as more expensive microphones that retail for $50 to $100. Furthermore, many of the cheap microphones that are designed to be worn are uncomfortable. A user will not use a microphone if it is uncomfortable.
Users don't know how to use a microphone. Either through inexperience, lack of training, or poor models, they are prone to wearing headsets incorrectly, holding hand-helds too close, and leaning towards desktop microphones.

Most applications can do little about the microphone. Luckily, when the user installs the speech recognition engine, the engine should come with software that makes sure the user's microphone is correctly plugged in and working.

Problems with ambient noise

In general, the user should be using a microphone as close to his/her mouth as possible to reduce noise coming from their environment. Users in quiet environments can afford to have the microphone positioned several feet away, but users in noisier environments, such as office cubicles, will need a headset that positions the microphone a few centimeters from the mouth. Unfortunately, speech recognition is limited in its utility for many people in noisy environments, because they find the headset uncomfortable, or its cord restrictive.

Problems with computer generated sounds

What's often worse than ambient noises are intentional sounds like the sounds generated by the user's computer and played through a powerful stereo system.

There are several ways to make sure that the microphone isn't hearing the speakers:

If the user is wearing a close-talk headset, then the microphone is so close to the user that it won't pick up the speakers.
Make sure the user is wearing headphones. That way the microphone won't pick up the sound coming from the computer.
Have a "push-to-talk" button. When the user pushes the button, the computer's audio output is muted and speech recognition is turned on. When the user releases the button speech recognition is turned off and the audio is turned back on. Users get the hang of this pretty quickly. If used in a wireless device, this also saves battery life.

Half-Duplex Sound Cards

Many sound cards are only "half duplex" (as opposed to "full duplex"). If a sound card is "half duplex" it cannot record and play audio at the same time. For speech recognition, half-duplex sound cards cannot be listening while the card is playing sound. Fortunately, with plug-and-play the number of full duplex sound cards is increasing.

Speech Recognition Likes to Hear

Speech recognition engines like to hear -- no surprise. They like to hear so much that if the user is having a phone conversation in the room while speech recognition is listening, the recognizer will think that the user is talking to it, and it will hear random words. Sometimes the speech recognizer even hears a sound, like a slamming door, as words.

There are several ways to overcome this obstacle:

Allow the user to turn speech recognition on/off quickly and easily. This can be done by the keyboard, mouse, voice, or joystick.
Have a "push-to-talk" button. This can be a key on the keyboard, mouse button, hot-spot that the cursor has to be over, a joystick button, or anything else. The user presses the button to get the computer to listen. When the button isn't pressed the computer isn't listening.
Give the computer a name, like "Computer," and make the user say it prior to speaking. It's kind of like "Simon Says". The computer will only act when it hears its name.
Have the computer verify every command with the user. For example, the Microsoft Voice application will display the command that it heard and then in small text display, "Say 'Do it' to accept the command." The command is not actually acted upon unless the user says "Do it" within a few seconds of the command being spoken.

Command and Control Engines need exact commands

Before an application starts a command and control recognizer listening it must first give the recognizer a "list" of commands to listen for. The list might include commands like "minimize window," "make the font bold," "call extension ," and "send mail to ."

If the user speaks the command as it is written they are going to get very good accuracy. However, if they word the command differently (and the application hasn't provided the alternate wording) then recognition will either not recognize anything or, even worse, it will recognize something completely different. So, if a user speaks, "bold that" instead of "make the font bold" there's a pretty good chance that the computer will hear "minimize window".

Applications can work around this problem by:

Making sure the command names are intuitive to users. For many operations like minimizing a window, nine out of ten users will say "minimize" or "minimize" window without prompting.
Showing the command on the screen. Sometimes an application will be able to display a list of commands on the screen. Users will naturally speak the same text they see. Microsoft Voice uses the application names shown on the task-bar for the "Switch to <application>" command.
Using word spotting. Many speech recognizers can be told to just listen for one keyword, like "mail". This way the user can speak, "Send mail", or "Mail a letter", and the recognizer will get it. Of course, the user might say, "I don't want to send any mail" and the computer will still end up sending mail.
Having the computer verify every command with the user. The Microsoft Voice application will display the command that it heard and then in small text display, "Say 'Do it' to accept the command." The command is not actually acted upon unless the user says "Do it" within a few seconds of the command being spoken.

Over time speech recognizers will start applying natural language processing and this problem will go away.

Speech Recognizers make mistakes

Speech recognizers make mistakes, and will always make mistakes. The only thing that is changing is that every two years recognizers make half as many mistakes as they did before. But, no matter how great a recognizer is it will always make mistakes.

An application can minimize some of the misrecognitions by:

Designing the list of commands that the computer is listening for so that commands sound different. As a rule of thumb, the more phonemes different between two commands the more different they sound to the computer. The two commands, "go" and "no" only differ by one phoneme so when the user says, "Go" the computer is likely to recognize "No." However, if the commands were "Go there" and "No way" recognition would be much better.
Not providing speech commands that do really bad things like formatting a user's hard disk.
Verifying commands that might be dangerous.
Having the computer verify every command with the user. The Microsoft Voice application will display the command that it heard and then in small text display, "Say 'Do it' to accept the command." The command is not actually acted upon unless the user says "Do it" within a few seconds of the command being spoken.

Application Design Considerations

Here are some design considerations for applications using command and control speech recognition.

Design Speech Recognition in From the Start

Don't make the mistake of using Speech Recognition as an add-on feature. It's poor design to just bolt speech recognition onto an application that is designed for a mouse and keyboard. Applications designed for just the keyboard and mouse get little benefit from speech recognition. After all, how many DOS applications that were designed for just the keyboard came up with effective uses for the mouse?

Do Not Replace the Keyboard and Mouse

Speech recognition is not a replacement for the keyboard and mouse. In some, but not all, circumstances it is a better input device than they keyboard/mouse. Speech recognition makes a terrible pointing device, just like the mouse makes a terrible text entry device, or the keyboard is bad for drawing. When speech recognition systems were first bolted onto the PC, it was thought that speaking menu names would be really useful. As it turns out, very few users use speech recognition to access a window menu because the mouse is much faster and easier.

Generally speaking, every feature in an application should be accessible from all input devices, keyboard, mouse, and speech recognition. Users will naturally use whichever input mechanism provides them the quickest or easiest access to the feature. The ideal input device for a feature may vary from user to user.

Work Around Recognizer Limitations

Speech recognizers have a lot of limitations, as listed in the previous section. Make sure that the application isn't using or requiring speech recognition to be used for purposes where it performs poorly.

Communicate Speech Awareness

Since most applications today do not include speech recognition, users will find speech recognition a new technology. They probably won't assume that your application has it, and won't know how to use it.

When you design a speech recognition application, it is important to communicate to the user that your application is speech-aware and to provide him or her with the commands it understands. It is also important to provide command sets that are consistent and complete.

Managing User Expectations

When users hear that they can speak to their computers, they instantly think of Star Trek and 2001: A Space Odyssey, expecting that the computer will correctly transcribe every word that they speak, understand it, and then act upon it in an intelligent manner.

You should convey as clearly as possible exactly what an application can and cannot do and emphasize that the user should speak clearly, using words the application understands.

Communicating the Command Set

A graphic user interface provides users with tremendous feedback about what they can do by displaying menus, buttons, and other controls on the screen. Furthermore, the keyboard and mouse typically do not send erroneous signals to the application.

This is not so with speech recognition. The number of voice commands that can be recognized at any given time can easily number in the hundreds, perhaps even thousands. Although cues for the most likely commands could be displayed on the screen, it is impossible to display the full set at once. To compensate, an application can provide mechanisms to scan through the large list of active commands or can prompt the user for the most common voice responses through visuals or text-to-speech. For example, the application might say "Do you want to save the file? Say Yes or No." If the application does not recognize a command, it can also provide more extensive help. For example, "Please say either Yes or No, or say Help if you need more help."

Providing Feedback to the User

Whenever a voice command is spoken, you should give some sort of feedback to the user indicating that the command was understood and acted upon. Visual indications are usually sufficient, but if it is impossible to have noticeable visuals, you should verify with a short text-to-speech or recorded phrase.

Breaking Up Long Series of Numbers

Most engines have a very high error rate for long series of digits that are spoken continuously. For phone numbers or other long series, either break the number into groups of four or fewer digits or have the user speak each digit as an isolated word.

Avoiding Speech for Directional Commands

Speech input should not be used as a means of moving the mouse cursor because it is inefficient and annoying for the user. For example, the user would need to repeat directional commands, such as up, many times in succession to move the cursor to the desired screen position.

Where the Engine Comes From

Of course, for speech recognition to work on an end user's PC the system must have a speech recognition engine installed on it. The application has two choices:

The application can come bundled with a speech recognition engine and install it itself. This guarantees that speech recognition will be installed and also guarantees a certain level of quality from the speech recognizer. However, if an application does this, royalties will need to be paid to the engine vendor.
An application can assume that the speech recognition engine is already on the PC or that the user will purchase one if they wish to use speech recognition.

The user may already have speech recognition because many PCs and sound cards will come bundled with an engine. Alternatively, the user may have purchased another application that included an engine. If the user has no speech recognition engine installed then the application can tell the user that they need to purchase a speech recognition engine and install it. Several engine vendors offer retail versions of their engines.

Company

2nd Speech Center

2nd Speech Center is Award-Winning Text-To-Speech Player to converts any text into spoken words or even MP3/WAV audio files.

Full version only $39.95
Try it free 30 days!