SWAN: SubWeb Analyzer

[ Last modified: 18 February 1998. Latest version of SWAN: 1.3b, 16 Feb 1998 ]

Introduction
Definitions
Usage
Implementation
Limitations and Future
Related Tools
Acknowledgments

Introduction

SWAN is a program that analyzes a local subweb of WWW. Its purpose is to help maintain independent subwebs. Definitions are given below.

SWAN can collect some statistics about the subweb and it can report internal inconsistencies in the subweb.

SWAN can also generate a cross reference for the subweb, consisting of one (often huge) HTML document, and/or an HTML document for each HTML document in the subweb. These cross references constitute, in a sense, the inverse of the subweb; that is, they enable one to follow hyperlinks in the reverse direction.

Here are two applications for cross references:

When you (as webmaster) change a document in a subweb, it is not uncommon that you have to change referencing documents as well. The cross reference provides easy access to all such referencing documents.
When you (as user) read an interesting document in a subweb, you might wonder where this material is actually used. A cross reference helps you find reverse related documents.

The example page shows the application of SWAN to the subweb with SWAN material.

SWAN is available as a UNIX shell script, see Implementation for details. The current version is 1.3b dated 16 February 1998 (the source file contains a version history).

Definitions

A subweb is a subset of the resources on the World-Wide Web (WWW). The subweb rooted in a directory (on a particular net location) contains as resources all world-readable files accessible from that directory via a path of world-readable-and-searchable directories.

We sometimes refer to the directory inducing a subweb as the root directory or just the root of the subweb. Whether the root directory itself is world-readable-and-searchable or not plays no role in the definition above. Of course, the subweb is accessible to the outside world only if its root is accessible.

N.B. Directories themselves are not considered resources of the subweb. See the discussion of independence below for a motivation.

In the context of a given subweb, a hyperlink is called

internal if it originates inside and points inside the subweb;
external if it originates inside and points outside the subweb.

Hyperlinks originating outside a subweb are (necessarily) ignored by SWAN.

A subweb is called independent when

it can be accessed via various schemes (in particular, http, ftp, or file) without a user noticing the difference (except possibly for speed), and
it can be transferred to another location without affecting its appearance.

To accomplish independence, a subweb must adhere to the following conventions:

Internal hyperlinks use relative URLs (see RFC 1808).
Hyperlinks to index.html are explicit, that is, use `dirpath/index.html', rather than `dirpath/' or just `dirpath'.

The reason for the latter convention is the following. When a directory with URL dirpath/ or dirpath is accessed via http, the file dirpath/index.html is returned, if present, and a directory listing, otherwise. Accessing such a directory by ftp or file always returns a directory listing. When a directory contains no file index.html, a direct access of it results in a directory listing under all access schemes.

Currently, SWAN flags all hyperlinks to directories as inconsistencies (even hyperlinks to directories not containing a file index.html). SWAN does not flag as inconsistencies internal hyperlinks using absolute URLs. However, these hyperlinks are grouped among the external hyperlinks where they can easily be found.

Usage

Synopsis

swan [ options ] [ directory ]

Description

SWAN analyzes the subweb rooted in directory. If no directory is supplied, then the current directory is assumed.

Without options, SWAN constructs an inventory of the subweb and reports on standard output all detected inconsistencies and some statistics. The following inconsistencies are reported:

internal hyperlinks to resources that are not available,
internal hyperlinks to document fragments (anchor names) that are not defined,
duplicate fragment definitions, and
resources that are not referenced.

Inconsistencies concerning internal hyperlinks are reported at the destination only. It is for you to determine the actual cause of the problem, which could also be with the source.

N.B. External hyperlinks are not checked for availability. (There are other tools for that.)

Options

-a: Analyze all resources of the subweb, also invisible ones, and all hyperlinks, also to invisible resources.
-h: Provide help; no processing is done.
-i: Read the subweb inventory from .SWAN in the root directory of the subweb being analyzed.
-o: Write the subweb inventory to .SWAN in the root directory of the subweb being analyzed.
-q: Do not report anything on standard output (quiet mode).
+r / -r: Force world-read permission on/off for cross reference files (by default, your umask is in effect).
-s: Report only statistics on standard output, do not list inconsistencies.
-x: Create an individual cross reference .file.html for each HTML document file.html.
-X: Create an overall cross reference ..SWAN.html.

Files

standard output: The report is written to standard output.
directory/.SWAN: Inventory for subweb rooted in directory.
directory/..SWAN.html: Overall cross reference for subweb rooted in directory.
directory/.file.html: Cross reference for directory/file.html.
/tmp/SWANpid.inv: Temporary file with subweb inventory for -o option.
/tmp/SWANpid.sh: Temporary file with shell script to create individual cross references for -x option.
/tmp/SWANpid.html: Temporary file with overall cross references for -X option.

Linking a SubWeb to Its Cross Reference

There are some pitfalls when linking a subweb to the cross reference generated by SWAN. Once a cross reference has been generated and SWAN is rerun, there is the danger that the old cross reference gets incorporated in the new cross reference (like a virus). However, two features of SWAN help avoid this problem.

The cross references reside in invisible files. They are not seen by SWAN (unless option -a is used) but they can be available to the outside world.
Internal hyperlinks to invisible files are ignored by SWAN (again, unless option -a is used).

The thing you want to avoid is running SWAN with option -a while cross reference files already exist. (Maybe there should be an option to remove them on the fly.)

Implementation

The current version of SWAN is a prototype implemented as a UNIX shell script. Its operation proceeds in four phases. The first three phases (front end) generate an inventory of the subweb. The fourth phase (back end) produces the output from the inventory.

SWAN uses find and egrep
- to supply definitions for all resources in the subweb, and
- to select fragment definitions and hyperlinks, together with their line numbers, from all HTML documents in the subweb.
Definitions and hyperlinks are selected by the occurrence of an anchor or image tag. The output of this phase is a text file with lines of the form
filepath:linenumber:text
where
- filepath always starts with `./',
- linenumber is empty for supplied (implicit) definitions, and
- text contains `<A ' or `<IMG '.
SWAN uses sed to transform the output of the preceding phase. It does so in four steps:
1. Isolate definitions and hyperlinks, splitting lines when necessary. SWAN looks for the following patterns (in some disguise):
  - <A ...NAME=...
  - <A ...HREF=...
  - <IMG ...SRC=...
  N.B. When the tag name and attribute definition are not on the same line, they are invisible to SWAN.
  Lines now are put in the form
  kind#filepath#linenumber#url#fragment
  where kind equals `d' for definitions and `r' for hyperlinks (references). We switched to `#' as field separator instead of `:' because the former is less likely to cause trouble when splitting URLs (see next step).
2. Split (possibly relative) URLs into their components, see RFC 1808. Lines now have the form
  kind#dirpath#filename#linenumber#scheme#netloc#path#params#query#fragment
3. Resolve relative destination URLs, taking the document's source URL as base URL (see RFC 1808). Lines retain their form.
4. Remove `./' and `segment/../' from paths (see RFC 1808). Lines retain their form.
SWAN uses sort to make each definition and all hyperlinks to it consecutive lines. Definitions, if present, appear before hyperlinks. The output of this phase is called the inventory of the subweb.
SWAN uses awk to report statistics and inconsistencies, and to create the cross references. It generates a shell script which is fed into sh to create the individual cross reference files (working around the limit of 10 output files in awk).

Limitations and Future

The correctness of SWAN hinges on the following assumptions about the subweb:

All HTML documents are syntactically correct. Actually, SWAN is pretty tolerant, since it ignores most HTML code anyway. The only tags that play a role are anchor tags (recognized by `<A ', i.e. a closing `>' is not required, but the space after A is), image tags (recognized by `<IMG '), and (as of version 1.3b) body tags (recognized by `<BODY '),. and frame tags (recognized by `<FRAME '). Although the HTML 2.0 Standard requires that values of NAME, HREF, and SRC attributes are properly delimited by double quotes ("), SWAN does not require them.
The complete start tag of each definition and hyperlink in an HTML document appears on a single line.
Resources do not contain embedded base URL definitions (<BASE HREF="..."> in HTML). When resolving embedded relative URLs, SWAN always uses the document's retrieval URL as base URL.

Here are some other limitations and possibilities for future development:

Hyperlinks to URLs of the form /path are counted as external to the subweb, but are listed as internal (due to sorting).
SWAN could be extended to handle other hyperlinks, such as occur in
- <BODY BACKGROUND="URL" (done in 1.3b)
- <IMG USEMAP="URL"
- <IMG LOWSRC="URL"
- <AREA HREF="URL"
- <META URL="URL"
- <EMBED SRC="URL"
- <FORM ACTION="URL"
- <FRAME SRC="URL" (done in 1.3b)
SWAN could be extended to `know' more HTML syntax. That way it could flag syntax errors and be (even) more foolproof when isolating definitions and hyperlinks. Currently, you have to use other tools to help guarantee syntactical correctness.
The definition of which resources and hyperlinks in a subweb are visible to SWAN could be more flexible. Actually, visibility to SWAN is only an issue when generating a cross reference, not when checking consistency. Of course, there may be other reasons why you want to make some resources in a subweb invisible to SWAN.
It would be nice for the cross references to have a facility to link directly to an absolute line number in an HTML document. For this to be a useful feature, browsers should more explicitly indicate where a hyperlink leads to. More particularly, when you follow a hyperlink to a document fragment, many browsers do not show clearly what is actually referenced, especially near the end of a document. (Click here and find out where it is intended to get you.)
SWAN could be extended to check availability of external resources.
Sometimes you might wish that SWAN could also provide the reverse of hyperlinks to the subweb originating from outside it. Problem: How to find out about those in a reasonable amount of time?
In the overall cross reference, `coarser' grouping could be done. Now all fragments are grouped per (internal) resource, but each (internal) resource is separated from others. As of version 1.3b, external resources are grouped per net location (`mailto:' and `news:' links are all put together).
In cross references, it might be nice for users to include, besides the resource file name, also the resource title, and besides the fragment name, also the named text (appearing between <A NAME=...> and </A>).
A UNIX-style manual page could be supplied.
Additional options that might be considered:
- Verify that the subweb is strongly connected, that is, each HTML document is reachable from every other HTML document.
- Omit line numbers from cross references. They are intended for webmasters; they make little sense to users, though their number might be a useful indicator.
- Omit error messages from cross references (same reason as above).
- Include only error messages in overall cross reference.
- Omit unreferenced resources from overall cross reference.
- Omit external resources from overall cross reference.
- Omit internal non-HTML resources from overall cross reference.
- Omit internal HTML resources from overall cross reference, optionally include a link to the individual cross reference page.
- Omit statistics from overall cross reference.
- Include statistics in individual cross references.
- Report only inconsistencies involving certain files (either as source or destination of a hyperlink).
- Make cross references for certain files only.
- Use alternate directory for temporary files.
- Allow the specification of alternate file names for the inventory and overall cross reference.

Related Tools

Below is a brief list of tools related to SWAN. The Weblint Home Page also has ``information which may be of interest''.

EIT's Link Verifier: An optional extension of the Webmaster's Starter Kit made available by Enterprise Integration Technologies.
htmlchek: Developed and maintained by Henry Churchyard. [ FTP ]
html_analyzer: Developed, but no longer maintained, by James Pitkow. [ FTP ]
MOMspider: Developed (and maintained?) by Roy Fielding. [ FTP ]
weblint: Developed and maintained by Neil Bowers. [ FTP ]
MacWebLint: Ported by Jon S. Stevens. [ FTP ] Also need MacPerl 5.
webxref: Developed and maintained by Rick Jansen.

Acknowledgments

The support of the MAVERIC Research Group and the Department of Computer Science at the University of Waterloo, Canada, enabled me to develop a prototype for SWAN.

Tom Verhoeff / wstomv@win.tue.nl