CS 2360 Lecture 9: February 7

Searching

Touretzky Reading: None

I worked from Kurt Eiselt's lecture notes on this subject.

Now that we have all this new knowledge about representation in trees, hierarchical structures, networks, and the like, we need some means for exploring these knowledge structures to get at the information we want at the time we want it. How do we do this? The answer is a bunch of techniques which collectively fall under the heading of "search". Search is a concept which permeates computer science. We'll only touch on a couple of kinds of search in this course, but they'll be sufficient to demonstrate the basic difference between brute- force, exhaustive, or "dumb" search and heuristic or "intelligent" search.

Linear search

You probably already know how to do a linear search. You probably did linear searches in previous programming courses. For example, starting at the beginning of a file structure and looking at record after record for a specific entry is a linear search. (If you've ever seen my office, you know that the only way I could find something in there is by linear search: I start at one end of the desk and look at everything until I find what I'm looking for.) Linear searches take a long time -- O(n), that kind of time. (Actually, assuming an even distribution of stuff in the file, you're looking at 1/2 * O(n), but the constants are more or less unimportant.)

We can impose a separate indexing scheme on our file structure, so that we can cut down on some search time. For example, we could apply a binary search mechanism to look for an employee record in a file. If the employee's name starts with a letter in the range A-M, we could start the search at the beginning of the file, but if the name starts with the letter N-Z, we would start the search at approximately the midway point in the file. We could continue to divide the big groups into smaller groups, until eventually the time to find a single record is governed not by the behavior of the linear search but by the behavior of the binary search. There are other indexing mechanisms that we could use, such as hashing functions, that would give us different kinds of advantages.

Searching a hierarchical structure

As we discussed in the previous lecture, we don't always store our stuff in linear formats. We can also organize knowledge in hierarchies. Consider, for example, the Flintstone Family Tree:



                         Rocky
                         /   \
                        /     \
               has-mom /       \ has-dad
                      /         \
                    \/_         _\/
                 Pebbles       Bam-Bam
                 / \ has-dad       / \
        has-mom /   \     has-mom /   \ has-dad
               /     \           /     \
             \/_     _\/       \/_     _\/
            Wilma    Fred     Betty   Barney

In structures like this, as before, we may want to search for useful information. But structures like this, unlike linear file structures, make it easier to search for the answers to questions like "What's the relationship of Barney to Rocky?" or "Who is Rocky's grandfather on his mother's side?"

Depth-first search

The simplest form of search in a hierarchical or network structure is called "depth-first search". Here's an algorithm for depth-first search on a tree, looking for a specific node in the tree:

df-search

look at the root
if it's what you're looking for, then return success
if the root has no descendants, then return failure
call df-search on the subtree whose root is the leftmost descendant and return success if that search is successful
call df-search on the subtree whose root is the rightmost descendant and return success if that search is successful

This algorithm should look somewhat familiar, since it's just a variant of the preorder tree traversal algorithm you've all seen before:

preorder

visit the root
call preorder on the left subtree
call preorder on the right subtree

The big differences between the preorder algorithm and the depth-first search algorithm are these:

depth-first search stops before searching the whole tree, if it finds what it's looking for; preorder traversal always examines the entire tree
with depth-first search, searching the right subtree occurs only if the search of the left subtree failed to find what was being looked for; with preorder traversal, the right subtree is always explored (this is sort of a corollary to the first difference listed just above)

These differences make implementation of depth-first search more complicated than preorder traversal, but not drastically so. Here's a simple depth-first search implementation for the Flintstone Family Tree, using the representation format for trees given in problem 1 on your first midterm exam. The tree looks like this in LISP:

'(rocky (pebbles (wilma nil nil) (fred nil nil)) (bam-bam (betty nil nil) (barney nil nil))) And the LISP code itself looks like this: (defun dfs (item tree) (cond ((done? tree) nil) ((found-item? item (get-root tree)) item) (T (or (dfs item (get-left-subtree tree)) (dfs item (get-right-subtree tree)))))) (defun done? (tree) (null tree)) (defun found-item? (item tree) (eql item tree)) (defun get-root (tree) (first tree)) (defun get-left-subtree (tree) (second tree)) (defun get-right-subtree (tree) (third tree)) The use of "or" in the "dfs" function is an easy way to fulfill the requirement that the right subtree isn't searched if what we're looking for is found in the left subtree. However, this may not be great programming style. It's not an especially obvious use of "or", which is typically used as a Boolean predicate, not as program control mechanism. Also, this use of "or" takes advantage of an implementation detail (i.e., that "or" evaluates its arguments left to right, and stops as soon as it finds an argument which evaluates to a non-nil value), which also is not necessarily a great thing to do. Furthermore, this assumes that any given node has at most two children; if you want to cope with any number of children at any node, you'll have to code up a slightly different version of this anyway. For now, we'll leave the "or" there, but feel free to do something better.

Getting past yes or no

Sadly, the search function above doesn't tell me much---just whether or not an item I'm looking for is in the tree. I'd get more information if I could get the search function to tell me how to get from the root of the tree to the item I'm looking for, assuming the item I'm looking for is in the tree. That path from the root to the item would at least be an approximation of the relationship between those two nodes in the tree; in the case of the Flintstones, for example, the path "Rocky -has-dad-> Bam-Bam -has-dad-> Barney" tells me something about the relationship between Rocky and Barney. How can I get my depth-first search procedure shown above to return this path, instead of just the item itself, when it finds the item in the tree? It's pretty easy. All you do is introduce an additional argument as a sort of variable to store the path from the root to wherever the procedure is looking in the tree. You get that additional argument by adding a helping function, just like in tail recursion. Then it's just a question of building up the result as the procedure searches deeper in the tree:

(defun dfs (item tree) (dfs-helper item tree nil)) (defun dfs-helper (item tree result) (cond ((done? tree) nil) ((found-item? item (get-root tree)) (cons item result)) (T (or (dfs item (get-left-subtree tree) (cons (get-root tree) result)) (dfs item (get-right-subtree tree) (cons (get-root tree) result)))))) (defun done? (tree) (null tree)) (defun found-item? (item tree) (eql item tree)) (defun get-root (tree) (first tree)) (defun get-left-subtree (tree) (second tree)) (defun get-right-subtree (tree) (third tree)) And note that because I've taken the time to do a great deal of data abstraction, separating the functions that access the LISP data structure from the higher-level algorithm, that all I had to do was make a few changes to the top-level procedure; the lower-level ones are untouched because we didn't make any changes to the LISP data structure.

If you can adapt this procedure to a-lists, and you can figure out how to avoid searching infinitely through cycles in a network, then you've pretty much got the associated homework assignment wired.

The state space

The metaphor of searching a tree is also a convenient one for describing the state of a process (i.e., a program in execution). The state of a process changes over time, and at any given time the state of a process is a little slice of its history.

At a very low level, the state of a process is described by the values of the arguments being passed, the instruction being executed, and if you're programming with side effects, the bindings of variables to values. (Obviously, it's easier to describe the state of a process if you don't have to worry about side effects, as there's just that much less to keep track of.) However, thinking about state at this low level becomes very tedious very quickly. So, we might be better off using a higher-level abstraction in thinking about the state of a process. Consider, for example, a program to solve the 8-tile puzzle. Instead of thinking in terms of which instruction is being executed, the values bound to arguments, and so on, we can look at the process in terms of the state of the puzzle itself. Thus, the initial state of the process would be the initial state of the puzzle. Say the initial state looks like this:


                  2 8 3
                  1 6 4
                  7   5

We could move any of three tiles, the 7, the 6, or the 5, to generate the three possible next states from this one:


                  2 8 3
                  1 6 4
                  7   5
                   /|\
                  / | \
                 /  |  \
                /   |   \
               /    |    \
              /     |     \
             /      |      \
            /       |       \
          2 8 3   2 8 3   2 8 3
          1 6 4   1   4   1 6 4
            7 5   7 6 5   7 5

If we then choose, say, the lower leftmost state of those three new states, and generate the two possible next states from that one, we get this:


                  2 8 3
                  1 6 4
                  7   5
                   /|\
                  / | \
                 /  |  \
                /   |   \
               /    |    \
              /     |     \
             /      |      \
            /       |       \
          2 8 3   2 8 3   2 8 3
          1 6 4   1   4   1 6 4
            7 5   7 6 5   7 5
           / \
          /   \
         /     \
        /       \
       /         \
     2 8 3     2 8 3
       6 4     1 6 4
     1 7 5     7   5

Note that one of these new states is just a repeat of the initial state. We wouldn't want to explore that direction any further, because we'd just be doing work we've already done.

Now, if we think of the movement of tiles as the significant operations in this process, we can describe the history of the process in terms of puzzle boards and the operations necessary to get from one board to the next. And since the nature of the operations in this case are such that only at most four new boards can be generated from any given board, we can safely say that the current behavior of the process depends on its history--the process couldn't have been in the current state without having just been in one of a very few previous states.

If we keep applying operations (i.e., moving tiles) to the leftmost board in the tree, we're going to get a depth-first search. But we're not searching some pre-existing data structure; instead we're searching something that's being "built" as the program executes. This something is called a "state space" (or a "problem space"), and our hypothetical 8- tile program is performing a "state-space search" by following a depth-first search algorithm.

A state-space is defined as the set of all possible states generated by the repeated application of a finite set of operations to some initial state. In performing a state- space search, the intention is usually to find a sequence of operations that gets one from the initial state to some goal state. In the case of the 8-tile puzzle, that goal state might be:


                  1 2 3
                  8   4
                  7 6 5

Why generate the state space at run-time, and not just have it all built in advance? For some applications, that might not be much of a problem. For example, in the 8-tile puzzle, the number of different ways to arrange the tiles isn't overwhelming. On the other hand, if you were working on a program that could play a decent game of chess, and you wanted to pre-build a data structure that was comprised of all possible boards, you'd want to make sure that you set aside a little disk space to store the approximately 10^120 (i.e., 1,000,000,000,000,000,000,000,000,000,000,000,000,000, 000,000,000,000,000,000,000,000,000,000,000,000,000,000,000, 000,000,000,000,000,000,000,000,000,000,000,000) different boards that are possible. Or maybe you'd be better off writing your program to generate just those boards that were relevant to the specific chess game it was playing at that particular time, and not worry about the rest of them.

Back To The CS2360 Home Page

Ian Smith (iansmith@cc.gatech.edu)