DISL
at Georgia Tech and Jim Gray at Microsoft
The SkyServer traffic log is
partially available at http://skyserver.sdss.org/en/tools/traffic.asp
and we get 1000 log entries from the skyserver click stream log using the
following SQL statement:
select top 10 * from weblog..weblog
At http://skyserver.sdss.org/en/tools/search/sql.asp
From the schema of the
Weblog, we select the following four fields as the most significant columns for
understanding these 1000 log records. With larger set of log data, one can
choose all the fields or different set of fields if necessary.
Sequence Number |
ClientIP |
Command |
Error Code |
Sequence Number: represents the time of command arrival. The larger
the sequence number, the earlier the arrival time.
ClientIP: IP of the client who sends the command
Command: all commands are ¡°GET¡± commands, so we only process
the required URL.
Error Code: defined in HTTP, for each command.
¡°200¡± :Success
"302":Moved Temporarily
"304":Not Modified
"404":Not Found
"500":Internal Server Error
Total
Click Stream Log Records : 1000
Here is the sample data we
used: Download
the dataset
Vista reads the sample data
and performs a series of transformations:
Figure
1. Initial visualization
Figure 1 is the initial
visualization. The 4 dimensions are marked by ¡°ClientIP¡±, ¡°Command¡±, ¡°Error¡±
and ¡°Seq(time)¡±. In the initial
visualization, it is not easy to find useful patterns. The first mode of interaction is to
adjust the weight of each dimension by clicking on the blue widget at the end
of the axis. This is called
alpha-adjustment a
value¡± in the above screen.
Additional operations include zooming, random rendering, subset
selecting and marking. Also, the
raw data of current mouse-pointed item is automatically displayed at the bottom
of the window.
Figure 2. Find the distinct clients in the dataset
First, we try to find the
number of clients in the dataset by clicking the blue widget labeled with ¡°ClientIP¡±.
By increasing the weight of this dimension, we can see two clusters of points
forming, as seen in Figure 2. This
means there are 2 distinct IP addresses in this dataset. And also we can see
that client 2 issues more commands than client 1 because there are more points
in the cluster of client2. The number of commands in each cluster can be
observed by clicking the ¡°summary¡± button on the screen. The client IP of each
cluster is seen immediately by positioning the mouse pointer on any point in
the cluster. This is a simple illustration of visualizing clusters in one
subspace. To find clusters in terms of
other dimensions, we need to set the a of ¡°non-interesting¡± dimensions to 0, and then
change the a
of the dimensions of interest.
This is illustrated in the next step.
Figure 3. Observing the distribution of error code
To emphasize the clustering
results of other dimensions, we visually merge the data from the two
clients by setting the a
of ¡°ClientIP¡± to 0. In Figure 3 we see 5 groups along the Error axis. (Note the disappearance of the green
line that represents the ClientIP dimension. This can be imagined as ¡°looking at the points along the
ClientIP axis¡±.) There are four
obvious horizontal lines plus the subtle distinction between ¡°302¡± and ¡°304¡±
shown by a slight vertical difference.
The Error code of each group is shown on the bottom line by positioning
the mouse on any point of the group.
The number of points on each
horizontal line shows the number of each Error (more 200¡¯s than 30X, and some
¡°404¡± errors). The Error codes
from the two clients may be separated by resetting the a of ¡°ClientIP¡± to a positive value.
Figure 4. Analyzing the relation of error code & command
To display the relationship
between different kinds of Commands and Error codes, we set the a of ¡°ClientIP¡± and ¡°Seq(time)¡± to 0 and set the value
of a for ¡°Command¡± to
maximum. This merges the two clients and also eliminates the time difference,
and stretches the Command axis at the same time. It groups all the same commands (over different
times), facilitating a detailed analysis of commands that cause certain Error code. For example, at the right end of the
green cluster we have a point corresponding to Error ¡°302¡±. It means that only one command causes
the Error ¡°302¡±, which is ¡°/default.asp?-¡°. By positioning the mouse pointers on other groups, user can
see the commands that cause other errors, e.g., purple for ¡°404¡± and pink for
¡°500¡± that are serious errors for web pages. Typically, ¡°404¡± indicates a web
page with out of date URLs that are no longer valid. Similarly, ¡°500¡± shows server error that requires debugging.
Figure 4. Time sequence of different dimensions
The relationship
for any two dimensions can be easily repeated. Setting a of ¡°Command¡± and ¡°ClientIP¡± to 0, we can analyze the
time sequence for each Error code.
Although we have found interesting results from the first three steps, we
didn¡¯t see obvious relationships from the visualization in this step.
User can also move
¡°ClientIP¡± or ¡°Command¡± dimension to the position of ¡°seq¡± (by setting the a of seq to 0), we can observe the time
relationship between different clients or different commands. Additional features are described in
the papers below.
.
[1] Keke Chen, Ling Liu:
¡°Cluster rendering of skewed datasets via visualization
¡±, ACM SAC 2003
[2] Keke Chen, Ling Liu:
¡° VISTA: validating and refining clusters via visualization, J. of Information Visualization, Sept. 2004¡±