course logo


MODULE 4
Characteristics
PAGE 1
URx
PAGE 2
Permanence
PAGE 3
Longevity

home logo site map

Web Document Characteristics

There are Web document charactersitics that differ significantly from their more static, traditional counterparts. One of these is the transitory nature of Web documents. They are both ephemeral and changing. The URx and Permanence pages describe efforts to address the ephemeral nature of the Web. The Longevity page address potential responses both to the ephemeral nature as well as the changing content of the Web.

There are also structural Web document characteristics that can be described and included as parts of Web catalogs.
 

Quantitative Characteristics

(The following discussion is taken from my paper: "Classifying Websites and Webpages: The Use of Metrics and URL Characteristics as Markers," Journal of Librarianship and Information Studies 31, 1 (March 1999), pp 21-31.)

Catalogers frequently document various physical characteristics of the material for which they provide bibliographic control. These characteristics may include number and type of pages, frequency and type of illustrations, dimensions, binding, as well as other features. Some of these characteristics are always captured; others are only documented if their appearance is unusual or non-standard.

 Webpages and Websites can be characterized by their "physical" attributes. Some of these characteristics are analogous to print, others are unique to hypertext documents. There is software available usually marketed for Website diagnostics that can be used to provide quantitative measures for bibliographic control. The software used to capture data for this paper is WebAnalyzer 2.0, a product of InContext (www.incontext.com). WebAnalyzer and similar software packages can be used to measure Webpage and Website object mix, size, and hypertext link depth and density.
 

Object Mix

Websites consist of a collection of Webpages with related meaning or themes specified by the Website author located on server level domains but also incorporating Webpages often authored by others on other SLDs. Most SLDs consist of a single Website (more than 80 percent in the sample).  However, a single SLD may host several Websites. These are often distinguished by discontinuity markers. Discontinuity markers include directory structure naming and the use of tildes.

Webpages consist of a collection of Web objects. Most Webpages have a text document as their base and any number of hypertext attached Web object arrayed from the base object much in the same way as one of Alexander Calder's mobiles. The number and type of Web objects are growing. These objects include text, graphics, audios, videos, mail, Java, ftp, gophers, and so on.  These can be reduced to five main types: text, graphic, multimedia, file retrieval, and mail.
Web object types are not evenly distributed within Websites. First, some Web objects are more frequently found closer or further from the analyzed propositus page. The term propositus is borrowed from the genealogical lexicon and refers to the individual upon whom a genealogy is built both forward and back.  For example, mail, file retrieval, and multimedia objects are often further away on a hypertext basis from the propositus than are text and graphic objects. This finding may contribute to a diplomatics analysis.

Second, the average (mean) number of each type of Web object varies, as is shown in Table 1. On average, Websites consist primarily of text and graphics objects. The typical Website in late 1996 contained less than one- percent multimedia, file retrieval, and mail objects combined.

Table 1 Website Percent Web Object Statistics, December/January 1996-97
Web Object Type Mean Median  Range Std. Dev.
Text 56.1   57  7-100  21.8
Graphic 33.8  31 0-92  215
Multimedia 0.78  0  0-67  0.42
File Retrieval  1.7  0 0-63 0.53
Mail 7.6  3 0-93 12.6

 
  Table 3 provides a standard by which individual Websites can be classified. Variation from each of the "average" number of Web objects can be measured and cataloged. Each of the five categories can be divided into ordinal classes: for example from low to high text, graphic, multimedia, file retriever, or email content. Each class can be based on individual Web object means and standard deviations. I have suggested one approach (Koehler 1997b) that results in classifying Websites according to their dominant Web objects and variation from the "average" model. Based on a set of ordinal categories for the percent of Web objects, Websites can be reduced to six general types: Average (no dominant Web object), Wordsworth (text dominant), Coffee-Table (graphics dominant), Mogul (multimedia dominant), Retriever (ftp/gopher dominant), and Post Office (email dominant). The names follow WWW naming practices; they are relevant but also slightly irreverent. In December 1996 and January 1997, the distribution of object dominant archetypes derived from the sample was "Average" 41.0 percent, "Wordworth" 21.2 percent, "Retriever" 17.4 percent, "Coffee-Table 13.1 percent, "Mogul" 6.1 percent, and "Post Office" 1.2 percent. These are shown in Figure 1.

Figure 1
Figure 1

The object dominance standard reported here or one like it can be generated using commercially available off-the-shelf software. Individual Website Web object distributions can be automatically calculated and those statistics can be migrated to Web document catalogs. It must be noted that as Websites and Webpages change, so do their Web object mixes. Individual Websites should therefore be reassessed periodically.

Website and Webpage Size. Website and Webpage size can be used to classify Web documents. Website size can be assessed in one of three ways: byte-weight, number of objects, and the number of Webpages in a Website. The number of Webpages on a Website is analogous to the number of text objects on the site. In addition, the number of hypertext links from various Webpages within a Website to other pages on the site (internal links) and the number of links from a Website to other Websites (external links) can be measured.

The average number of objects and the byte-weight for each object class per Website are shown in Table 2. As the statistics indicate, the size and construct of Websites vary greatly. The number of text objects ranged for the sample reported in Table 4 from one to more than 13,000. The total number of all objects ranged from two to over 14,000. The total byte-weight of Websites ranged from 292 bytes to over 52 megabytes (exclusive of audios and videos, which averaged 12 megabytes each).

Table 2. Website Size Averages December 1996 and January 1997
Web Object Mean Median Std Dev
Number of Objects 
Text  564.3 106 1472.4
Graphics  181.4  52  314.5
Audio  4.8  1663  37.33
Video  0.5  3.5
FTP  12.6  89.5
Gopher  6.3  21.9
Mail    61.4 4 240.2
Total  833.1  217  1712.3
Byte-Weight
Text  1,360,733  222,829  3,336,292
Graphics  2,174,769  465,473  4,733,840
Total  3,539,064  977,203  6,480,651

Website size can be classified in any number of ways. Because the byte-weight of individual Web objects vary both among types as well as from one like object to another, it may be useful to combine both byte-weight and object count. To achieve that end, byte-weight and object count values can be normalized, summed, then ordinal values assigned to the range. In this case (Koehler 1997b), an average value (4) was assigned to all Websites with values plus or minus 0.5 standard deviations from the mean. For each increment of one standard deviation from average, the assigned value was increased or decreased by one. This resulted in a size range from 1 to 6 or "smallest" to "bigger." None qualified as "biggest." The Website sample distribution by size is shown in Figure 2.

Figure 2

Figure 2 Web Size Distribution

Webpage size is most usefully measured in byte-weight. Like Websites, Webpages vary greatly in size. Individual Webpages range in size from zero kilobytes (kb) to more than 2,000 kb, with no theoretical upward limit. A "zero kilobyte" Webpage occurs when a URL has been defined but little or no content has been placed on the Webpage.  It is in effect a blank page.
Webpages also tend to increase in size over time. Data have been collected weekly since January 1997 and over the period ending on May 22, 1998, the average byte-weight of a non-comatose Webpage increased from 58.66 kb to 111.17 kb, resulting in an annualized increased "byte-creep" of more than 34 percent. Much of this is growth by accretion, new material is added to the Webpage, while the older content is edited but not removed.

Over the same period, the number of non-responding Webpages (both comatose and intermittent) increased from zero to 43.8 percent of the sample. Thus, in an aging Webpage collection, the size of the collection decreases over time measured by the number of non-comatose Webpages. But for those Webpages extant at any given time, their individual, average byte-weight tends to increase.  It may therefore be desirable to not only indicate the intermittence rate of extant Webpages, but their individual growth rates as well.  This is particularly true if byte-weight change is an imperfect yet useful surrogate for content change. The Webpage size and attrition rates are shown in Figure 3.

Figure 3
Figure Attrition and Change

Link Depth and Density. Website structures can be described by their directory structures as is discussed above. They may also be analyzed according the hypertext links to and from Website pages as well as according to the density of Webpages by distance from the propositus page. The hypertext structure is presented as concentric rings. Those Webpages on the first ring have immediate hypertext ties from the propositus to themselves. Those on subsequent concentric rings are connected from the propositus through intermediate pages and are not directly linked from the propositus. It is both possible and likely that Webpages are connected through more than one route. Ring depth is determined as the most direct route. Density is a measure of the average number of Web objects or byte-weight on each ring on the Website or from the propositus. Website densities are measured here from the Website homepage or index page. Home- or index pages are those which resolve from the SLD only or from an identified point of discontinuity. Ring depth and density include all Webpages at a Website that are located on the same SLD. Webpages are included as part of a Website that are not located on the SLD when and only when they are linked directly to one of the pages on the SLD. Thus, for this analysis, links from a non-SLD Webpage are not included within the hypertext structure of a Website unless they are directly linked from another qualifying page. Such a page might be linked to the propositus on the third level because it is directly linked to an on-SLD page linked to the propositus at the second level.
The number of hypertext Website levels varies. The minimum number of levels, including the propositus, encountered in the sample is one, the maximum fifty-nine. There is no limit to the number of levels possible; the most the author has measured is 179. This "lord of the rings" is an English university Website. Most Websites do not exceed eight levels. Website level statistics for two collection periods are shown in Table 3
.
Table 3. Website Levels Statistics, Two Periods  December and January 1996/97 July and August 1997
December and January 1996/97 July and August 1997
Mean  4.82  5.22
Median  4 5
Range: Min.-Max.  0-59  0-50
Standard Deviation  4.88  4.67
Mean Objects Count Per Level  159.4  231.2
Mean Level Density in Megabytes  626.3  854.8

The number of ring levels and densities provide structural data. Structure infers the organization of information. That, in turn, may provide insights into the importance of one "piece of information" over another, the emphasis placed by the Web author on priority and order, as well as suggest information groupings or clusters.  Further research is necessary to establish whether different ring and density configurations indicate different information qualities, quantities, authority, or other pertinent considerations. At minimum, individual Website ring counts and densities can be measured and reported. Further work can establish whether there are different configuration types of significance to the library community.