SWAN: SubWeb Analyzer

[ Last modified: 18 February 1998. Latest version of SWAN: 1.3b, 16 Feb 1998 ]

Introduction

SWAN is a program that analyzes a local subweb of WWW. Its purpose is to help maintain independent subwebs. Definitions are given below.

SWAN can collect some statistics about the subweb and it can report internal inconsistencies in the subweb.

SWAN can also generate a cross reference for the subweb, consisting of one (often huge) HTML document, and/or an HTML document for each HTML document in the subweb. These cross references constitute, in a sense, the inverse of the subweb; that is, they enable one to follow hyperlinks in the reverse direction.

Here are two applications for cross references:

The example page shows the application of SWAN to the subweb with SWAN material.

SWAN is available as a UNIX shell script, see Implementation for details. The current version is 1.3b dated 16 February 1998 (the source file contains a version history).

Definitions

A subweb is a subset of the resources on the World-Wide Web (WWW). The subweb rooted in a directory (on a particular net location) contains as resources all world-readable files accessible from that directory via a path of world-readable-and-searchable directories.

We sometimes refer to the directory inducing a subweb as the root directory or just the root of the subweb. Whether the root directory itself is world-readable-and-searchable or not plays no role in the definition above. Of course, the subweb is accessible to the outside world only if its root is accessible.

N.B. Directories themselves are not considered resources of the subweb. See the discussion of independence below for a motivation.

In the context of a given subweb, a hyperlink is called

Hyperlinks originating outside a subweb are (necessarily) ignored by SWAN.

A subweb is called independent when

To accomplish independence, a subweb must adhere to the following conventions: The reason for the latter convention is the following. When a directory with URL dirpath/ or dirpath is accessed via http, the file dirpath/index.html is returned, if present, and a directory listing, otherwise. Accessing such a directory by ftp or file always returns a directory listing. When a directory contains no file index.html, a direct access of it results in a directory listing under all access schemes.

Currently, SWAN flags all hyperlinks to directories as inconsistencies (even hyperlinks to directories not containing a file index.html). SWAN does not flag as inconsistencies internal hyperlinks using absolute URLs. However, these hyperlinks are grouped among the external hyperlinks where they can easily be found.

Usage

Synopsis

swan [ options ] [ directory ]

Description

SWAN analyzes the subweb rooted in directory. If no directory is supplied, then the current directory is assumed.

Without options, SWAN constructs an inventory of the subweb and reports on standard output all detected inconsistencies and some statistics. The following inconsistencies are reported:

Inconsistencies concerning internal hyperlinks are reported at the destination only. It is for you to determine the actual cause of the problem, which could also be with the source.

N.B. External hyperlinks are not checked for availability. (There are other tools for that.)

Options

-a
Analyze all resources of the subweb, also invisible ones, and all hyperlinks, also to invisible resources.
-h
Provide help; no processing is done.
-i
Read the subweb inventory from .SWAN in the root directory of the subweb being analyzed.
-o
Write the subweb inventory to .SWAN in the root directory of the subweb being analyzed.
-q
Do not report anything on standard output (quiet mode).
+r / -r
Force world-read permission on/off for cross reference files (by default, your umask is in effect).
-s
Report only statistics on standard output, do not list inconsistencies.
-x
Create an individual cross reference .file.html for each HTML document file.html.
-X
Create an overall cross reference ..SWAN.html.

Files

standard output
The report is written to standard output.
directory/.SWAN
Inventory for subweb rooted in directory.
directory/..SWAN.html
Overall cross reference for subweb rooted in directory.
directory/.file.html
Cross reference for directory/file.html.
/tmp/SWANpid.inv
Temporary file with subweb inventory for -o option.
/tmp/SWANpid.sh
Temporary file with shell script to create individual cross references for -x option.
/tmp/SWANpid.html
Temporary file with overall cross references for -X option.

Linking a SubWeb to Its Cross Reference

There are some pitfalls when linking a subweb to the cross reference generated by SWAN. Once a cross reference has been generated and SWAN is rerun, there is the danger that the old cross reference gets incorporated in the new cross reference (like a virus). However, two features of SWAN help avoid this problem.

  1. The cross references reside in invisible files. They are not seen by SWAN (unless option -a is used) but they can be available to the outside world.

  2. Internal hyperlinks to invisible files are ignored by SWAN (again, unless option -a is used).
The thing you want to avoid is running SWAN with option -a while cross reference files already exist. (Maybe there should be an option to remove them on the fly.)

Implementation

The current version of SWAN is a prototype implemented as a UNIX shell script. Its operation proceeds in four phases. The first three phases (front end) generate an inventory of the subweb. The fourth phase (back end) produces the output from the inventory.
  1. SWAN uses find and egrep Definitions and hyperlinks are selected by the occurrence of an anchor or image tag. The output of this phase is a text file with lines of the form
    filepath:linenumber:text
    where

  2. SWAN uses sed to transform the output of the preceding phase. It does so in four steps:

    1. Isolate definitions and hyperlinks, splitting lines when necessary. SWAN looks for the following patterns (in some disguise):
      • <A ...NAME=...
      • <A ...HREF=...
      • <IMG ...SRC=...
      N.B. When the tag name and attribute definition are not on the same line, they are invisible to SWAN.

      Lines now are put in the form

      kind#filepath#linenumber#url#fragment
      where kind equals `d' for definitions and `r' for hyperlinks (references). We switched to `#' as field separator instead of `:' because the former is less likely to cause trouble when splitting URLs (see next step).

    2. Split (possibly relative) URLs into their components, see RFC 1808. Lines now have the form
      kind#dirpath#filename#linenumber#scheme#netloc#path#params#query#fragment

    3. Resolve relative destination URLs, taking the document's source URL as base URL (see RFC 1808). Lines retain their form.

    4. Remove `./' and `segment/../' from paths (see RFC 1808). Lines retain their form.

  3. SWAN uses sort to make each definition and all hyperlinks to it consecutive lines. Definitions, if present, appear before hyperlinks. The output of this phase is called the inventory of the subweb.

  4. SWAN uses awk to report statistics and inconsistencies, and to create the cross references. It generates a shell script which is fed into sh to create the individual cross reference files (working around the limit of 10 output files in awk).

Limitations and Future

The correctness of SWAN hinges on the following assumptions about the subweb:

  1. All HTML documents are syntactically correct. Actually, SWAN is pretty tolerant, since it ignores most HTML code anyway. The only tags that play a role are anchor tags (recognized by `<A ', i.e. a closing `>' is not required, but the space after A is), image tags (recognized by `<IMG '), and (as of version 1.3b) body tags (recognized by `<BODY '),. and frame tags (recognized by `<FRAME '). Although the HTML 2.0 Standard requires that values of NAME, HREF, and SRC attributes are properly delimited by double quotes ("), SWAN does not require them.

  2. The complete start tag of each definition and hyperlink in an HTML document appears on a single line.

  3. Resources do not contain embedded base URL definitions (<BASE HREF="..."> in HTML). When resolving embedded relative URLs, SWAN always uses the document's retrieval URL as base URL.
Here are some other limitations and possibilities for future development:

Related Tools

Below is a brief list of tools related to SWAN. The Weblint Home Page also has ``information which may be of interest''.
EIT's Link Verifier
An optional extension of the Webmaster's Starter Kit made available by Enterprise Integration Technologies.

htmlchek
Developed and maintained by Henry Churchyard. [ FTP ]

html_analyzer
Developed, but no longer maintained, by James Pitkow. [ FTP ]

MOMspider
Developed (and maintained?) by Roy Fielding. [ FTP ]

weblint
Developed and maintained by Neil Bowers. [ FTP ]

MacWebLint
Ported by Jon S. Stevens. [ FTP ] Also need MacPerl 5.

webxref
Developed and maintained by Rick Jansen.

Acknowledgments

The support of the MAVERIC Research Group and the Department of Computer Science at the University of Waterloo, Canada, enabled me to develop a prototype for SWAN.


Tom Verhoeff / wstomv@win.tue.nl