We have investigated a wide range of information services so far enabling us to make generalizations and draw conclusions for develop some voice application guidelines. Our prototype system provides access to both network and desktop-based services, such as weather forecasts, stock market results, and personal messages (e-mail, voice mail and faxes). Access is initiated by a telephone call, after which the user issues voice commands for various services. The system responds with the information requested.
For instance, a user may call to find out if she has any messages. She finds that a colleague wants to meet that afternoon. Checking her calendar, she discovers that she is free and responds to the e-mail with a voice-attachment. Finally, she checks the West Coast weather for tomorrow because she has a trip to California planned. She sets a meeting reminder to herself and hangs up.
The scenario described above is idealized and is mimicked by many corporate concept videos. For the user, the interaction is natural. She switches between tasks seamlessly without losing the context of recent interactions. There are no excessive demands to remember mappings such as "press or say one for sports scores," familiar to most users of voice-based menu systems.
Without the desired speech recognition technology it might seem impossible to empirically investigate other important aspects of voice-only interaction, such as user acceptance or satisfaction. We want to conduct realistic user studies that lead to guidelines for future developers of voice-only applications, but we do not want to wait for the technology to catch up. Herein lies the crux of the design problem: to design an application with sufficient functionality to determine user acceptance, without the use of this core technology. We decided, therefore, to use a Wizard of Oz approach, with a human operator performing the speech recognition.
A large list of services was constructed using input from brainstorming sessions, a literature survey, and the focus groups. Of these, we initially prototyped four: stock market results, US weather forecasts, headline news, and messages (e-mail, voice mail, and faxes). The World Wide Web was the information source for all the services except for the message service.
The prototype was used in a series of usability studies. The purpose of these studies was to determine the usefulness of the provided services, build user grammars for future speech recognition integration, determine common navigation paths between and within services, and, of course, to determine the usability of the prototype. The usability studies were performed in two stages: an initial study with a small group of participants with an early version of the prototype and a second study with a larger group of participants using a mature version of the prototype. The results reported here refer to the second study.
Both stages of the usability studies were conducted in the same manner. The participants were given a short written description of the application and a list of tasks to perform. A participant telephoned the Wizard prototype and attempted to use natural language voice commands to complete the assigned tasks. The user was then given the opportunity to explore the system freely. The prototype generated a log file for each user containing time-stamped requests for data and time-stamped replies. A complete audio record was kept for later analysis. The final portion of the usability test was a short questionnaire.
General principles developed over decades of research obviously still apply; feedback is still important, as is direct control over an application's actions. If anything, working within a voice-environment highlights many important, well-known lessons. Though our work, we tried to apply these general principles in order to develop a first cut at guidelines that we hope will form the basis for further discussion and development.
Below are some initial observations that have come out of our usability tests.
A question to ask is whether knowledge of the computer implicitly leads a user to speak with a more limited vocabulary. We suspect this is the case, and if so, it actually makes the recognition problem simpler, and the key design principle is to create a human-computer conversation that implicitly limits user responses to a reduced vocabulary. It is a question of user perception; if users believe the system has a small working vocabulary, they will interact using a small vocabulary, to the speech recognition system's benefit. However speech recognition is made more difficult if users believe the system has a large working vocabulary because they will interact using a large vocabulary. This observation is illustrated below.
We believe this was a result of having a human perform natural language speech recognition, rather than recognizing a constrained vocabulary. As users became more confident that the speech recognition "system" could understand a more conversational input language, they took advantage of it. This notion brings us back to the concept of user perception and the research question posed above: what is the relationship between user perception of how "intelligent" the computer is and the vocabulary they use? In order to understand this relationship better, further testing must be done.
The obvious solution is to avoid overloading the user with long lists. But this observation leads to the more general question of how to provide adequate support for users trying to navigate through the system. The support should enable users to feel in control of the interaction, and not cognitively overload them. The interface should be designed to work well with novice as well as expert users. We saw users that could adapt to the system and others that struggled. Obviously, one static interface will not be sufficient when dealing with a heterogeneous user population. Rather, a dynamic interface that changes with the user is needed. Prompts should be expanded when users require more help and should be succinct and non-intrusive when users are in control - part of a process known in the education community as scaffolding[2].
With an voice-only interface, only audio feedback can be used. Different cues can be used to tell a user when they should speak, when the system has or has not understood their requests, when it is fetching information (if long delays), and so on. Distinct and intuitive sounds are an effective way of providing audio feedback to users [1].
2. Guzdial, M. Software-realized Scaffolding to Facilitate Programming for Science Learning. Interactive Learning Environments, Vol. 4, No. 1, 1995, 1-44.
3. Hansen, B., Novick, D.G., and Sutton, S. Systematic Design of Spoken Prompts. In Proceedings of CHI `96 (Vancouver, Canada, April 1996), ACM Press, 157-164.
4. Marx, M., and Schmandt, C. MailCall: Message Presentation and Navigation in a Nonvisual Environment. In Proceedings of CHI `96 (Vancouver, Canada, April 1996), ACM Press, 165-172.
5. Resnick, P., and Virzi, R.A. Relief from the Audio Interface Blues: Expanding the Spectrum of Menu, List, and Form Styles. ACM Transactions on Computer-Human Interaction, Vol. 2, No. 2, June 1995, 145-176.
6. Stifelman, L.J., Arons, B., Schmandt, C., and Hulteen, E.A. VoiceNotes: A Speech Interface for a Hand-Held Notetaker. In Proceedings of INTERCHI `93 (Amsterdam, The Netherlands, April 1993), ACM Press, 179-186.
7. Wildfire Communications, Inc. Homepage. Available at http://www.wildfire.com.
8. Yankelovich, N., Levow, G., and Marx, M. Designing SpeechActs: Issues in Speech Interfaces. In Proceedings of CHI `95 (Denver, CO, May 1995), ACM Press, 369-376.