Observing a task
being performed or attempted by someone else often accelerates human learning. If
robots can be programmed to use such observations to accelerate learning their
usability and functionality will be increased and programming and learning time
will be decreased. This research explores the use of task primitives in robot
learning from observation. A framework has been developed that uses observed
data to initially learn a task and then the agent goes on to increase its
performance through repeated task performance (learning from practice). Data
that is collected while a human performs a task is parsed into small parts of
the task called primitives. Modules are created for each primitive type that
encodes the movements required during the performance of the primitive, and
when and where the primitives are performed. The feasibility of this method is
currently being tested with agents that learn to play a virtual and an actual
air hockey game. The term robot and
agent are used interchangeably to refer to an algorithm that senses its
environment and has the ability to control objects in either a hardware or
Observing the Task
The task to be
performed must first be observed. For a human learner this mostly involves
vision. In order for the robot to learn from observing a task being performed
it must have some way to sense what is occurring in the environment. This research does not seek to find ways to
use the robot's current sensors to observe performance. The agents will be
given whatever equipment is necessary to observe the performance or be given
information that represents the performance. The equipment may include a camera
or some type of motion capture device. Research is also being performed in
virtual environments and the state of objects is directly available from the
must generate commands to all their actuators at regular intervals. The analog
controllers for our 30-degree of freedom humanoid robot are given desired torques
for each joint at 420Hz. Thus, a task with a one second duration is
parameterized with 30X420=12600 parameters. Learning in this high dimensional
space can be quite slow or can fail totally. Random search in such a space is
hopeless. In addition, since robot movements take place in real time, learning
approaches that require more than hundreds of practice movements are often not
feasible. Special purpose techniques have been developed to deal with this
problem, such as trajectory learning and learning from observation.
It is our hope that
primitives can be used to reduce the dimensionality of the learning problem.
Primitives are solutions to small parts of a task that can be combined to
complete the task. A solution to a task may be made up of many primitives. In
the air hockey environment, for example, there may be primitives for hitting
the puck, capturing the puck, and defending the goal.
The above figure
shows our view of a primitive. Currently, a human, using domain knowledge,
designs the candidate primitive types that are to be used. The primitive
recognition module segments the observed behavior into the chosen primitives.
This segmented data is then used to provide the encoding for the primitive
selection, sub-goal generation, and action generation modules. The primitive
selection module provides the agent with the primitive to use for the observed
state of the environment. After a primitive type to use has been chosen, the
sub-goal generation module specifies the desired outcome, or goal, of that
primitive. Lastly the actuators must be
moved to obtain the desired outcome. The action generation module provides the
actuator commands needed to execute the chosen primitive type with the current
After the agent has obtained initial training from observing human performance, it should then increase its skill at that task through practice. Up to this point the agent's only high-level goal is to perform like the teacher. Its only encoding of the goal of the entire task is in the implicit encoding in the primitives performed. The learning from execution module contains the information needed to evaluate the performance of each of the modules toward obtaining a high-level task objective. This information is then used to update the modules and improve their performance, possibly beyond that of the teacher.
Learning from observation research is currently being performed on a variety of domains. A grid-world maze, an air- hockey game, and a marble maze in virtual environments are being explored. A hardware version of the marble maze is also being explored. These domains were chosen because of the ease with which they can be simulated in virtual environments and provide a starting point to obtain more information on learning from demonstration. They can easily be created on a computer and played using a mouse. They are also small enough to be operated in a laboratory. Since the basic movements in these domains are only in two dimensions, motion capture and object manipulation is simplified. A camera based motion capture system can easily be used to collect data in a hardware implementation of air hockey and the marble maze. A stationary arm or some other similar robotic device can be programmed to play air-hockey on an actual table.
The grid-world maze
consists of a virtual robot in a maze. The robot is put in a starting position
and must find its way through the maze to the goal position. Reinforcement
learning is used in this domain. The software was created with MVC++ and
uses the Tcl/Tk library.
uses a very straightforward Q-learning algorithm. We are using it to explore
techniques in which observed data can be incorporated into this algorithm to
decrease learning time. The robot
decides on the action to perform by looking at the values of the next possible
actions that can be taken from the current state. The value of a state/action pair, Q(s,a), is
the future discounted reward that the agent can expect to receive by taking action
a from state s. Some examples of
state/action pairs would be ((1,1), down) and ((1,3), up). The goal of the agent is to reach the goal in
the shortest amount of steps. The agent
receives a reward of -1 for each step that is taken. The value of the goal state is 0. The values are updated each time a move is
made using the following function.
The learning rate controls the amount that the state/action value is changed at each step. The discount rate takes into account that rewards that can be received soon are more valuable than equivalent ones that can be received much later in the process.
Air-hockey, AVI movie (1.2MB)
A cyber air hockey game was created that can be
played on any computer that supports OpenInventor and Tcl/TK. The game consists
of two paddles, a puck and a board to play on.
A human player using a mouse controls one paddle. At the other end is a cyber-human.
primitives are currently being explored:
Left Bank Shot the
player hits the puck, the puck hits the left wall once and then travels toward the
Straight Shot the
player hits the puck, the puck travels straight toward the goal without hitting
Right Bank Shot the
player hits the puck, the puck hits the right wall once and then travels toward
Block the player does
not make a shot but attempts to block the puck from entering the players goal
Setup the player is
positioning their paddle in preparation to make a shot.
- Multi-shot the player has blocked or made a shot and the puck does not have enough velocity to return to the other side of the board. Therefore the player has the opportunity to make another shot.
This research is
also being conducted in a hardware version.
The onboard cameras and a vision system that locates colored objects in the image are used to observe the state of the environment. This image shows the four corners of the board and the puck as seen by the vision system.
Labyrinth (Marble Maze)
Maze outfitted with motors, encoders, sensors, a camera and vision processor. Computer playing MPEG (3.6MB).
Software Labyrinth game. Human Play AVI (3.8MB)
primitives is also being explored in the Labyrinth environment in software and
on hardware. As a human plays the game
the board and ball positions are recorded.
Primitives are extracted from this data. The following primitives are
currently being explored:
Wall Roll Stop The
ball rolls along a wall and stops when it is in a corner.
Roll Off Wall The
ball rolls along a wall and then rolls off the end of the wall.
Roll From Wall The
ball is on a wall and then is maneuvered off it.
No Wall The ball is
guided from on location to another without touching a wall.
Corner The ball is in
a corner and the board is position in preparation to move the ball from the
Wall Roll Stop
Roll Off Wall
Roll From Wall
Humanoid Robot Learning and Game Playing Using PC-Based Vision, Darrin C. Bentivegna, Ales Ude, Christopher G. Atkeson, and Gordon Cheng. Presented at IROS 2002, Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland, October, 2002.
Learning How to Behave from Observing Others, Darrin C. Bentivegna and Christopher G. Atkeson. Presented at the SAB'02-Workshop on Motor Control in Humans and Robots: on the interplay of real brains and artificial devices, Edinburgh, UK, August, 2002.
A Framework for Learning From Observation Using Primitives, Darrin C. Bentivegna and Christopher G. Atkeson. Presented at the Symposium of Robocup 2002, Fukuoka, Japan, June, 2002.
Learning From Observation Using Primitives, Darrin C. Bentivegna and Christopher G. Atkeson. Presented at ICRA 2001 in Seoul, Korea, May 2001.
Using Primitives in Learning From Observation, Darrin C. Bentivegna and Christopher G. Atkeson. Presented at Humanoids 2000 in Boston, Mass. September 2000.
Testbeds Used for Exploring Learning from Observation, Darrin C. Bentivegna and Christopher G. Atkeson. Published in the proceedings of the workshop for the AAAI2000 Robot Competition and Exhibition.
Using Primitives in Learning from Observation: A Preliminary Report, Darrin C. Bentivegna and Christopher G. Atkeson. Published in the proceedings of the workshop of the Eighth AAAI Mobile Robot Competition and Exhibition held at AAAI99.
Learning and Robot Links