Imagine a computer assistant that sees what you see and hears what you hear. This assistant has been with you since you were born, and follows you through life. The assistant can remind and advise you, based on your past experiences. If you wanted to recall what was said at a previous meeting or in a previous class, the assistant can recreate those experiences. For a student, the assistant could take into account what was really learned (versus what was taught) in previous classes in helping explain a difficult lecture or doing a problem set. For a doctor or lawyer, the assistant could recall relevant previous cases from training, or from previous practice. For a manager, the assistant could recognize recurrent problems and what preceded them.
I propose building such a Shadow. Early versions will be crude, involving body mounted sensors and displays with wires running to nearby workstations. Later versions will be small and relatively invisible to the human user. Several volunteers will be instrumented and their experiences captured. This will force us to deal with several issues:
The idea of a personalized machine ``servant'' is a very old one, predating the idea of electronic computers. What has changed is that we can now afford to put relatively powerful computing and interface devices on our bodies, and in every room of our homes, offices and classrooms. We can now build ``shadows''. The research challenge is to figure out how to use the information they collect effectively. It is clear that we can use a lifelong personal scribe to deliver higher performance speech, handwriting, and gesture recognition, and also to develop a more accurate model of how each of us interacts with computers. A research question is whether we can do more than that, and develop an assistant that not only sees what we see, and hears what we hear, but also thinks what we think.
To make this vision a little more concrete, I would like to describe an initial experiment. Can we tell where a user is looking, and identify what the user is looking at, during their daily lives? Clearly, we could surgically implant eye coils and put the subject in a special test apparatus to measure gaze direction, but this would not be acceptable to most users. Can we recover gaze direction in a less invasive manner?
I propose a video camera based approach to figuring out what is being attended to at any one moment. The user would be instrumented with at least four video cameras. Two cameras provide a view of the external world, and two other cameras attempt to provide a view of the user's internal world by tracking the position of the eyes in the orbit.
At least two cameras would provide an external visual system for the shadow. One head mounted camera would provide a wide angle view of what was in the potential visual field of the user. A second head mounted camera would provide a narrow angle view of what was directly ahead of the user. These two cameras would play similar roles as the fovea and the peripheral visual field. We would be able to reconstruct high resolution images where the user attended to by taking advantage of the constraint that humans orient their heads to center their eyes during extended viewing of an object. The wide angle camera could provide information about changes in objects that are not currently being attended to.
A second pair of video cameras would be used to track the eye movements of the viewer. Each camera would be mounted on either side of the head, at about the position of the hinge of an eyeglass frame, and would be looking at the corresponding eye. Each would be able to detect the elevation of the corresponding eye. A single camera would also be able to measure how far the view direction was moved towards the camera from a straight ahead view. Integrating information from the pair of cameras would provide a measurement of the direction of view.
One research question is whether this approach is adequate for estimating what the user is looking at. We will use controlled studies in which the user is asked to gaze at a sequence of targets to assess our measurement system.
The second research question is whether we can recognize whatever the user is looking at, in order to provide semantically meaningful labels. We will use additional information from a position measurement system such as differential GPS to provide information on where the user is, and a crude head orientation measurement system (using a compass and pendulum based gravity direction measurement device) to provide information on what direction the user is looking. These additional measurements will help us to build or use maps in addition to the video images, and to better detect when we have looked at something before and already have a label for it.
We will largely use a memory-based or case-based approach to recognizing objects we have seen a second time. However, instead of matching entire images of objects, we will match image fragments. Essentially, image fragments become the primitives of our object recognition system. Early experiments with this approach indicate it is much more robust than template matching, and can easily learn good matching primitives (the image fragments).
The performance of the recognition portion of the shadow system will be assessed in two ways. We will first test image based recall. If we go back to a particular room or look at a familiar object, can we access the previous times we have been in that room or seen that object? The second test will be semantic recall. If the user has labelled a room or an object, can we access experiences of that room or object using the label? In order to explore semantic recall we will have to develop paradigms for user labelling. One method is to use a library of labelled objects as the source of labels for new objects. Another is to allow the user to point at an object and provide a label.
This research area is new to me, although it draws on previous work I have done in robot learning of new movements, and numerical approaches to machine learning. As part of re-thinking my research directions I have asked myself the question, ``Where will machine learning be applied?''. My belief is that an important application of machine learning will be modeling humans. We already see machine learning used in handwriting recognition, gesture recognition, and speech recognition. I think machine learning will be applied at higher levels as well, to model user behavior. The work described in this proposal is a key part of my effort to develop a research program in the area of machine learning about humans.
This proposal is also part of my vision of what future computer systems will be like. If we use science fiction as a guide, one vision of future computer systems is a robot that accompanies us and shares in our experiences, as much as a fellow human standing or walking next to us could. It is currently too difficult to build all of this robot, so an easier vision to achieve is the monkey or parrot perched on our shoulder. We could build such a system with independent movement capabilities, but I see that as the next phase in this research. That leaves us with the notion of a third eye in our foreheads, that is locked to our own gaze, but provides sensory information to our computer assistant.
The shadow concept is well known in science fiction, but as far as I am aware there are not any current attempts to implement it. There is complementary work which has similar goals but is oriented around specific events or places, such as using video to record a meeting and correlating the video record with observer's notes. Other research groups are exploring components needed in the shadow concept (Xerox Parc's ``Ubiquitous computing'', CMU's ``Wearable computers'', MIT's ``Intelligent Room'', and work on intelligent agents at many places, for example). However, there is room for many different visions and many players, and I think it is very important to develop our own approach to future computing environments.
This research challenge could be used to link various research efforts and groups in the College. There is a clear relationship between the style of machine learning suggested here and case-based learning and reasoning. I hope to take advantage of the College's expertise in this area, and also provide a resource for experiments on case-based learning. Clearly, the initial emphasis of this research will be very close to the sensory signals. However, the shadow will be much more useful if we can interact with it at a fairly high level. The challenge of bridging the gap between sensory signals and high level information is faced by the Intelligent Systems group, the Cognitive Science group, and GVU. I hope to form bridges between these groups. I am proposing automatically building a database of life experiences. Clearly, interaction with the Database group would be useful. I hope that this project could provide an interesting challenge and an experimental resource for all of these groups.
Students at all levels could be involved in work on this project. In addition to being involved in the research, students are an ideal test group for computer assistants. Imagine the building full of students taking notes assisted by ``shadows''. Multimedia notes could be produced in real time during each lecture. The student could annotate those notes during or after the lecture. The readings could be automatically linked to the notes as well.
I propose to build prototype computer assistants, or {\em shadows}. These assistants would capture a user's experiences in a database. Key challenges include making the acquisition process technically possible and finding ways to utilize the stored information effectively. This project can serve as a bridge project, providing an experimental resource that can be shared by several groups. I would expect that a working prototype would be tremendously helpful in pursuing external funding, so seed money can make a real difference.