Table of Contents
The Domain Based Design Documentation and Component Reuse and Their Application to a System Evolution Record project (DARE project) is investigating the recovery of design rationale from software artifacts. The derived information is intended to comprise part of the System Evolution Record for ARL's Rapid Prototyping for System Evolution Record STO.
The DARE project makes use of the Synchronized Refinement (SR) reverse engineering method, basing design recovery on a model of a software system's application domain. Domain analysis is used to drive code analysis. The two analyses interact to construct an architectural model of the software system. Additionally, We have tried throughout all three aspects (domain, code, and architecture) to explore the automatic generation of interlinked documentation.
As part of the project, we have engaged in two case studies. The first examined the Mosaic Internet web browser. In this case, the source code is written in the C language, comprises about 100KLOC, and is publicly accessible. The second case study examined the domain of software loader/verifiers (SLVs) for embedded systems. In this case, the software we examined was written in the Ada programming language, comprised about 25KLOC, and was obtained from the Army Tank and Automotive Command (TACOM) where it was part of their SPAIDS effort (Software Program for Affordability Improvements in Depot Systems). In addition to the source code, various documents were available including a Software Requirements Specification, a Software Design Document, Operating Instructions, and two Test Reports.
The alternative to formal specifications are informal descriptions; there are
many such informal descriptions of the World Wide Web available both on and off
the Web. All the considerations that make a description informative also make
it a good source for domain analysis: conciseness, completeness, clarity, and
so on. In addition, to allow for machine-aided object analysis, it is helpful
if the material is available in machine-readable form. For the case study, we
chose Chapter 1 from Introduction to CGI/PERL (S. Brenner and E. Aoki, M
& T Books, 1996).
Lexical analysis consisted of word-frequency analysis is a first pass over the
World Wide Web descriptions to determine the important words. After eliminating
noise words such as "the", the remaining words are counted; more frequently
occurring words are assumed to be more important than less frequently occurring
words. To get a feel for the larger structure of the descriptions, we also
looked at digram (e.g. "server sends") and trigram counts. Table 1 shows some
frequency analysis results.
Objects and Relations
We then performed a noun and verb analysis on the World Wide Web description to
extract possible object classes and associations. The analysis resulted in the
25 object classes, and 6 associations depicted in our
Mosaic Object-Model Diagram.
OMT Object-Model Diagram
We have Object Modeling Technique Object-Model Diagrams of the Object
Architecture for Mosaic and the
We have an HTMLized version of the Requirements
Software Refinery Code
We have some Refine and Dialect code to assist in our
We have some filters that do simple lexical analysis on
text as a prelude to doing a more complete object analysis.
One simple way to bridge the gap between code and domain is to examine one side
for words used in the other side. For example, by examining the domain, it may
be possible to determine that functions containing the word "cache" are used to
cache retrieved pages or open socket connections. In the other direction, the
requirement that caching be used to improve performance raises the expectation
that various parts of the code may be involved with aspects of caching.
We have examined the possibilities from the domain side by constructing a Web page that performs keyword searches on a design document, returning those pages containing matches to the search. The Web page captures the search request and sends it back to a CGI program on the server side. The CGI program applies the search request to a concordance generated from the design document, noting the pages containing matches. The CGI program then generates a Web page in reply to the search request; the reply page contains a thumbnail of each matching page with the matches highlighted in color. Each thumbnail is too small to read individually, but the density and color of the matches easily stands out to indicate relevance. Clicking on a thumbnail retrieves a full-sized version of the associated page.
In addition to indicating the general usefulness of keyword-directed searching
for domain concepts, this application also nicely ties together several of the
other tools created during this project, including the lexical analysis tools
used to generate the concordance and the free-text formatter used to generate
the full-sized versions of matched pages.
High-level models are useful in the description and comprehension of complex
constructs such as programs. Their utility is due primarily to their
abstraction of the details of the systems they describe. In contrast, the
source code of a program is full of such details. A rough analogy between the
high-level models of an application program (in terms of a domain model, design
decisions, and a proto-$typical architectural model) and the field of building
architecture is the difference between the blueprint of the classical Frank
Lloyd Wright Prairie House and a particular instance of it. For example, there
may be differences in the instantiation that are derived from pragmatic
concerns due to the site peculiarities.
Synchronized Refinement matches a body of source code to an application-domain
model. Through code analysis, an abstraction of the code is developed that is
resolved with the expectations generated by the models. This is a iterative
process reflected by the dataflow cycle in Figure 1 from the Abstraction
process to Abstract program description to the Detection process to the
Mappings/ annotations and finally back to the Abstraction process.
Additionally, the high level model is also enhanced to fit the source code. In
reference to the building architecture analogy, SR also constructs a blueprint
of a Prairie House that incorporates the peculiarities of this instantiation
(i.e., the deviations from the classical).
software reflexion model (Murphy, Notkin, and Sullivan, "Software
Reflexion Models: Bridging the Gap between Source and High-Level Models." ACM
SIGSOFT 95 Symposium on the Foundations of Software Engineering, October 1995)
compares an architectural model of a program to its actual implementation. It
does this by using a map between source code and an architectural model
suggested by the analyst. A reflexion model can be used to incrementally refine
an architectural model. The process begins with an analyst specifying what the
suspected architectural model is for the source code. By incrementally adapting
the mappings between major structural elements of the source code (e.g. file
names, directories, name fragments), an architectural model is matched to the
source code. The differences between the specified and derived models reflect
mismatches between the two models. The mappings may be modified to construct a
closer convergence between the two.
--> One way to construct an initial architectural model is to analyze procedure calls. Call-graph analysis determines which functions call others in a program. For some languages this may be a problem. In C, for example, a function may be called indirectly through a function pointer. In polymorphic languages, determination of the most appropriate version of a function may not be known until runtime. Even with these limitations, call-graph analysis can still be useful in generating abstract program representations.
Analysis of source-code data types may be used to develop an object-oriented
representation of the code. This may be done even if the source language
doesn't natively support object-oriented constructs. Part-of associations
(aggregation data structures) may be derived from analysis of the hierarchical
structure of the types. This may be supplemented by call-graph analysis. For
example, a datatype's subcomponent may refer to a collection. If a member
function associated with the datatype retrieves elements from the collection,
that would indicate a part-of association. This deduction can be made if the
selector interface to the object (or abstract datatype) in question returns the
elemental type of the collection instead of the collection itself and there are
no selectors which do return the collection. In this case the elements of the
collection have a relationship to the object. The collection is a manifestation
of this relationship's many-to-one property. This analysis involves observing
the functions' formal arguments and return types and associating functions to
the abstraction interface of one particular type. Additionally, datatypes
serving as collections have to be identified. Implementation of this in the
Software Refinery requires interaction with both the low level (parse
tree/symbol table) and high level (who-calls, etc.) representations of the
program. The Refine language makes both representations manipulable by many of
the same operations and transformations.
We have a program that reads
sbdump Filter Tools
files and produces various kinds of static program analysis on the data read.
Analysis by Relational Modeling
Refine is an excellent tool for representing program text as abstract syntax
trees, and provides many powerful, but low-level, mechanisms for manipulating
abstract syntax trees. However, the low level with which one engages the
abstract syntax tree, coupled with the number of details involved with using
Refine itself, means that analyses tend to be rather heavy-weight in terms of
the effort required to produce them.
We felt it desirable to be able to perform tentative, speculative program analysis with less overhead than would be required by using Refine, and decided to investigate whether relational database systems (RDBS) could be used to perform the kind of lightweight program analysis we had in mind.
We used the Leap relational database system to prototype our ideas. Leap is a teaching tool for learning about relational databases, but it does offer the usual compliment of relational operators, and we were betting its simplicity would extract a low overhead when implementing our ideas. If we judge our prototype worthy of further consideration, we can move on to a more production-oriented relational database system.
The first step was to extract program information and store it as relational
tables. Our program information came in the form of
information produced by the Sun SunSoft C compiler. Our relational table
design involved a straightforward translation of the tabular information
sbdump. We wrote several AWK scripts to parse the
sbdump tables and generate Leap
scripts that create the tables.
Some of our analyses requires non-relational operations, for example,
regular-expression matching, which meant our analyses had to be specified in
something a bit more involved than just Leap scripts. Leap provides only a
line-oriented client interface; it has no programming-level interface. We used
emacs, which can run Leap as a background process and deal with Leap's output
in buffers with the usual emacs text-manipulation facilities. Using these
techniques, we were able to, for example, create a analysis of which
subroutines access which global variables in the Mosaic World Wide Web browser.
The key to domain-driven program understanding is
bridging the gap between the conceptual domain model (the
application description) and the results of code analysis. We
believe an architectural abstraction of the program can
bridge this gap. This raises three primary issues:
During the refinement and elaboration of the domain, architecture, and program
models, expectations are generated about how they are connected. It is likely
that a truth-maintenance (or reason-maintenance) system will be needed to
handle confirmation and refutation of expectations.
You can compare the results of our architectural
analysis for the Software Loader/Verifier with a data-flow diagram taken from the Software Loader/Verifier design document.
Part of our architectural analysis is bottom-up, driven by information
extracted from code. We have two resources for bottom-up architectural
analysis: Refine and
By extracting various information from
sbdump files, we can
construct assorted architectural analyses. We have a
program to produce call graphs.
We also have another call-graph generator that extracts
its information from a relational database containing the
Automatic Linking of Artifacts
We have a filter that reads free-formatted text and
produces text decorated with HTML statements
Conclusions and Future Work
The DARE project has made significant progress in satisfying its research
goals, but there are several areas where interesting further work is
One of the original goals of our research was to explore how domain
knowledge could be used to further the goal of program
understanding. One way of accomplishing this is to have the domain
model expressed in a form easily accessible to the tools that will
undertake the source code analyses. In particular, we have
expressed our domain model in the Refine language and stored it in
the Software Refinery's object-oriented repository. We now need to
take the kinds of code analyses that we performed using the SBF
tools and do them using the Refinery's code analysis capabilities.
In particular, the domain model should guide the exploration
process by generating a set of "expectations", and, when the code
analyses detect constructs satisfying the expectations, the
object-oriented domain model can be "instantiated" with the
results. This process is called Synchronized Refinement, and we
have had suggest applying it manually. We would like to continue
working towards its automation.
We have performed three kinds of code analysis: invocation analysis,
type analysis, and coupling analysis. All of these generated large
amounts of data that we have viewed with the dot graph-drawing
tool. While dot is robust enough to deal with large graphs, the
results are not necessarily useful to the analyst. We would like
to look at several software visualization techniques that might
further the analyst's understanding.
o Make use of the third-dimension and allow the analyst to navigate
through the resulting visualizations to detect patterns.
o Apply some experimental algorithms that propose node-groupings
that might suggest modules or architectural components to the
o Make use of information from actual executions to color an
otherwise confusing graph, indicating "hot spots" similar to an
MRI scan of a human brain taken while the subject is performing
Dynamic View Generation
Our document generation tools make a set of documents easier to
access by converting them to HTML, providing internal links, and
offering a dynamic keyword search capability. One relatively small
missing piece is the ability to take an existing diagram and
generate an HTML image map from it. The resulting page would allow
the software maintainer, using the diagram, to navigate to relevant
documentation. Currently, we generate the image maps by hand.
The final, and most ambitious, remaining task is to deal with the
issue of external knowledge. In order to make a tangible
connection between a requirements document mentioning "performance"
and source code implementing a "cache", there needs to be knowledge
somewhere that a cache is one way of improving performance. How
should this knowledge be structured and accessed? One possibility
is to extend our concordance approach to look for common word use
in requirements, architecture, and code documents. Another is to
actually try and encode programming knowledge, in the form of a set
of "program plans" and use cliche recognition techniques to find
them in the code.
This page last modified on 29 October 1997.