
|
Characteristics |
URx |
Permanence |
Longevity |

There are also structural Web document characteristics that can be described
and included as parts of Web catalogs.
Catalogers frequently document various physical characteristics of the material for which they provide bibliographic control. These characteristics may include number and type of pages, frequency and type of illustrations, dimensions, binding, as well as other features. Some of these characteristics are always captured; others are only documented if their appearance is unusual or non-standard.
Webpages and Websites can be characterized by their "physical"
attributes. Some of these characteristics are analogous to print, others
are unique to hypertext documents. There is software available usually
marketed for Website diagnostics that can be used to provide quantitative
measures for bibliographic control. The software used to capture data for
this paper is WebAnalyzer 2.0, a product of InContext (www.incontext.com).
WebAnalyzer and similar software packages can be used to measure Webpage
and Website object mix, size, and hypertext link depth and density.
Webpages consist of a collection of Web objects. Most Webpages have
a text document as their base and any number of hypertext attached Web
object arrayed from the base object much in the same way as one of Alexander
Calder's mobiles. The number and type of Web objects are growing. These
objects include text, graphics, audios, videos, mail, Java, ftp, gophers,
and so on. These can be reduced to five main types: text, graphic,
multimedia, file retrieval, and mail.
Web object types are not evenly distributed within Websites. First,
some Web objects are more frequently found closer or further from the analyzed
propositus page. The term propositus is borrowed from the genealogical
lexicon and refers to the individual upon whom a genealogy is built both
forward and back. For example, mail, file retrieval, and multimedia
objects are often further away on a hypertext basis from the propositus
than are text and graphic objects. This finding may contribute to a diplomatics
analysis.
Second, the average (mean) number of each type of Web object varies, as is shown in Table 1. On average, Websites consist primarily of text and graphics objects. The typical Website in late 1996 contained less than one- percent multimedia, file retrieval, and mail objects combined.
Table 1 Website Percent Web Object Statistics, December/January 1996-97
| Web Object Type | Mean | Median | Range | Std. Dev. |
| Text | 56.1 | 57 | 7-100 | 21.8 |
| Graphic | 33.8 | 31 | 0-92 | 215 |
| Multimedia | 0.78 | 0 | 0-67 | 0.42 |
| File Retrieval | 1.7 | 0 | 0-63 | 0.53 |
| 7.6 | 3 | 0-93 | 12.6 |
The object dominance standard reported here or one like it can be generated using commercially available off-the-shelf software. Individual Website Web object distributions can be automatically calculated and those statistics can be migrated to Web document catalogs. It must be noted that as Websites and Webpages change, so do their Web object mixes. Individual Websites should therefore be reassessed periodically.
Website and Webpage Size. Website and Webpage size can be used to classify Web documents. Website size can be assessed in one of three ways: byte-weight, number of objects, and the number of Webpages in a Website. The number of Webpages on a Website is analogous to the number of text objects on the site. In addition, the number of hypertext links from various Webpages within a Website to other pages on the site (internal links) and the number of links from a Website to other Websites (external links) can be measured.
The average number of objects and the byte-weight for each object class per Website are shown in Table 2. As the statistics indicate, the size and construct of Websites vary greatly. The number of text objects ranged for the sample reported in Table 4 from one to more than 13,000. The total number of all objects ranged from two to over 14,000. The total byte-weight of Websites ranged from 292 bytes to over 52 megabytes (exclusive of audios and videos, which averaged 12 megabytes each).
Table 2. Website Size Averages December 1996 and January 1997
| Web Object | Mean | Median | Std Dev |
| Number of Objects | |||
| Text | 564.3 | 106 | 1472.4 |
| Graphics | 181.4 | 52 | 314.5 |
| Audio | 4.8 | 1663 | 37.33 |
| Video | 0.5 | 0 | 3.5 |
| FTP | 12.6 | 0 | 89.5 |
| Gopher | 6.3 | 0 | 21.9 |
| 61.4 | 4 | 240.2 | |
| Total | 833.1 | 217 | 1712.3 |
| Byte-Weight | |||
| Text | 1,360,733 | 222,829 | 3,336,292 |
| Graphics | 2,174,769 | 465,473 | 4,733,840 |
| Total | 3,539,064 | 977,203 | 6,480,651 |
Website size can be classified in any number of ways. Because the byte-weight of individual Web objects vary both among types as well as from one like object to another, it may be useful to combine both byte-weight and object count. To achieve that end, byte-weight and object count values can be normalized, summed, then ordinal values assigned to the range. In this case (Koehler 1997b), an average value (4) was assigned to all Websites with values plus or minus 0.5 standard deviations from the mean. For each increment of one standard deviation from average, the assigned value was increased or decreased by one. This resulted in a size range from 1 to 6 or "smallest" to "bigger." None qualified as "biggest." The Website sample distribution by size is shown in Figure 2.
Figure 2
Webpage size is most usefully measured in byte-weight. Like Websites,
Webpages vary greatly in size. Individual Webpages range in size from zero
kilobytes (kb) to more than 2,000 kb, with no theoretical upward limit.
A "zero kilobyte" Webpage occurs when a URL has been defined but little
or no content has been placed on the Webpage. It is in effect a blank
page.
Webpages also tend to increase in size over time. Data have been collected
weekly since January 1997 and over the period ending on May 22, 1998, the
average byte-weight of a non-comatose Webpage increased from 58.66 kb to
111.17 kb, resulting in an annualized increased "byte-creep" of more than
34 percent. Much of this is growth by accretion, new material is added
to the Webpage, while the older content is edited but not removed.
Over the same period, the number of non-responding Webpages (both comatose and intermittent) increased from zero to 43.8 percent of the sample. Thus, in an aging Webpage collection, the size of the collection decreases over time measured by the number of non-comatose Webpages. But for those Webpages extant at any given time, their individual, average byte-weight tends to increase. It may therefore be desirable to not only indicate the intermittence rate of extant Webpages, but their individual growth rates as well. This is particularly true if byte-weight change is an imperfect yet useful surrogate for content change. The Webpage size and attrition rates are shown in Figure 3.
Link Depth and Density. Website structures can be described by their
directory structures as is discussed above. They may also be analyzed according
the hypertext links to and from Website pages as well as according to the
density of Webpages by distance from the propositus page. The hypertext
structure is presented as concentric rings. Those Webpages on the first
ring have immediate hypertext ties from the propositus to themselves. Those
on subsequent concentric rings are connected from the propositus through
intermediate pages and are not directly linked from the propositus. It
is both possible and likely that Webpages are connected through more than
one route. Ring depth is determined as the most direct route. Density is
a measure of the average number of Web objects or byte-weight on each ring
on the Website or from the propositus. Website densities are measured here
from the Website homepage or index page. Home- or index pages are those
which resolve from the SLD only or from an identified point of discontinuity.
Ring depth and density include all Webpages at a Website that are located
on the same SLD. Webpages are included as part of a Website that are not
located on the SLD when and only when they are linked directly to one of
the pages on the SLD. Thus, for this analysis, links from a non-SLD Webpage
are not included within the hypertext structure of a Website unless they
are directly linked from another qualifying page. Such a page might be
linked to the propositus on the third level because it is directly linked
to an on-SLD page linked to the propositus at the second level.
The number of hypertext Website levels varies. The minimum number of
levels, including the propositus, encountered in the sample is one, the
maximum fifty-nine. There is no limit to the number of levels possible;
the most the author has measured is 179. This "lord of the rings" is an
English university Website. Most Websites do not exceed eight levels. Website
level statistics for two collection periods are shown in Table 3
.
Table 3. Website Levels Statistics, Two Periods
December and January 1996/97 July and August 1997
| December and January 1996/97 | July and August 1997 | |
| Mean | 4.82 | 5.22 |
| Median | 4 | 5 |
| Range: Min.-Max. | 0-59 | 0-50 |
| Standard Deviation | 4.88 | 4.67 |
| Mean Objects Count Per Level | 159.4 | 231.2 |
| Mean Level Density in Megabytes | 626.3 | 854.8 |
The number of ring levels and densities provide structural data. Structure
infers the organization of information. That, in turn, may provide insights
into the importance of one "piece of information" over another, the emphasis
placed by the Web author on priority and order, as well as suggest information
groupings or clusters. Further research is necessary to establish
whether different ring and density configurations indicate different information
qualities, quantities, authority, or other pertinent considerations. At
minimum, individual Website ring counts and densities can be measured and
reported. Further work can establish whether there are different configuration
types of significance to the library community.