|
Characteristics |
URx |
Permanence |
Longevity |

Copyright 2000 Wallace Koehler All Rights Reserved
WEb documents undergo significant change over time. This change
takes two forms. The first is Web document morbidity. The second is change.
For the past three years, I have been following the behavior of 360 randomly
selected URLs. The sample was derived using a random URL generator from
WebCrawler (that unfortunately is no longer available). The sample was
stratified so that the distribution of top-level domains (TLDs) matched
that reported in December 1996. The sample has not been increased since
is original collection. For a disussion of the methodology, see my paper
in JASIS. What follows
is a discussion based on my research.
Figure
1 Web Page Morbidity Over 3 Years
Figure 1 (click image for full view) is a plot of the survival and demise of the sample over three years. The blue line represents the percent of Web pages that fail to resolve after three attempts on a weekly basis. The red line is a plot of average values for six weekly collections including the one reported. The average plot tends to smooth the curve. The plot shows three trends. The first was one of repaid decline in the number of "functional" Web pages.
The second shows a major return of Web pages. There are two explanations for this. In fact a number Web sites did return, one after being dormant for more than 18 months. The second is a change in software and the reporting of 404 errors. The 404 error indicates a Web page not found on an extant server. The way in which 404 errors are reported has changed from a simple one line message (my software interpret correctly) to sometimes page long declamations. The page long statements were interpreted as "present."
The third trend, and one following the change in 404 reporting, has been a slower rate of decline over the last year. For the present, the number of Web pages present now hovers at about 50% of the original Web sample. One conclusion one might reach is that after a protracted period of time (perhaps as indicated here, two years) a degree of stability is achieved. That is to say, that if a Web page survives two years there is a fair chance it will continue to do so. Further tracking is necessary to establish the trend.
Because Web pages have a surprising proclivity to return, I prefer to refer to these as "comatose" rather than "dead" pages or links. For the sake of nomenclature, I taken to calling the returned pages that are the result of elaborate 404 messages "ghosts."
It should also be noted that at any given time, about 5% of those listed as missing will return within six weeks of the first missing report. These are termed "intermittent pages." Less 5% of the sample has never gone "intermittent," which is to say; less than 5% have always answered the call.
Figure
2: Web Document Change over 3 Years
Figure 2 is a plot of the changes that the Web page sample has undergone in the past three years. These data reflect only those Web pages that are "present" at the time of collection. Dormant or comatose pages are not included in calculations.
Three types of change are shown. The first, shown as a black line, reflects changes in page content as measured by change in page byte weight. That is, if the number of bytes for a page do not change, it is assumed that the content of the page does not change. That is a fairly safe assumption, but not a perfect one. For example, if I were to change the syllabus from "Turn in papers on July 31" to "Turn in papers om July 21" the byte weight would not change (2 and 3 have the same weight) but the message would. Note too that I included a purposeful typo just before "July 21." The typo would neither change the message nor the byte weight.
The second measure is structural. The green and red lines indicate new links from the page and changes to the links on the page respectively on a weekly basis. It is possible to change the link structure without changing byte weight. Typically about 20% of the pages undergo content and link change and 10% experience new links on a weekly basis. There are occasional spikes when major change behavior occurs. Some of this appears to be a result of software changes, others I cannot yet explain.
Web sites undergo significant change over time. Changes in the number of Web site total objects for the sampling period are shown Table 1.
Table 1. Web site Relative Total Object Changes in Orders of Magnitude in Percent
| Order of Magnitude | Percent Dec/Jan 96/97 to June/July 97 |
Percent June/July 97 to Dec/Jan 97/98
|
| Size Decrease > -2 Orders of Magnitude |
3.1
|
0.6
|
| Size Decrease > -1 < -2 |
3.7
|
6.2
|
| Size Decrease > 0 < -1 |
21.0
|
7.4
|
| No Major Change |
7.5
|
8.0
|
| Size Increase > 0 < 1 |
50.2
|
68.5
|
| Size Increase > 1 < 2 |
11.2
|
6.8
|
| Size Increase > 2 Orders of Magnitude |
3.4
|
2.5
|
Table 1 illustrates two general Web site change trends. First, Web sites are not static. They undergo dramatic size changes. Typically, less than seven percent of Web sites will "implode," that is decrease in size by more than one order of magnitude per period, while less than fifteen percent will "explode," or increase in size by more than one order of magnitude. While some Web sites either implode and explode, the general trend is toward size increases: well more than half of sampled Web sites increased in size over the sample year.
Web site catalogs are necessarily challenged by these size swings. As Web sites increase or decrease in size, sometimes as much as by more than two orders of magnitude, the depth and breadth of information contained in those Web sites are modified, amended, increased, or decreased. A dynamic index or catalog must reflect those changes.
Web pages also undergo change. Two quantitative change measures can be identified. These are changes in byte-weight (content change) and changes to the hypertext link structure of the page (structural changes). Byte-weight change is an imperfect surrogate for semantic or meaning change. Structural changes are modifications, additions, or deletions to the hypertext reference structure of the page. As such structural changes probably represent a subtler yet equally important source of semantic or meaning change as byte-weight.
Figure 1 charts rates of change for Web pages. With a number of interesting and dramatic exceptions, Web pages undergo content and structural change fairly consistently. At any given time, about twenty percent of the sample experience content and new structural change while about ten percent undergo modifications to existing hypertext link structures.
These content and structural changes have been labeled as "omega". Omega can be divided into total omega and three iso-types: content omega, new structural omega, and existing structural omega. Each omega value is an average of the periodic binary measure of change. The value "1" is assigned whenever dimension change of any magnitude occurs, "0" when none occurs. For example, if a Web page were to change one week but not the next, it would have an omega of 0.5 over the period. Omega-t is not additive of the three iso-types. It reflects any change in any of the three measures. Thus, a Web page experiencing Omega-c, Omega-n, and Omega-e changes would carry the same Omega-t as one with only Omega-c type change or other combinations.
Web pages undergo content change more frequently than structural. However, structural change may occur in the absence of content change. Mean omega values for the Web page sample at the end of the sample year were found to be Omega-t 0.298, Omega-c 0.239, Omega-n 0.074, and Omega-e 0.150 (Koehler 1999a). Similar calculations could also be derived for Web sites.
Omega values can be employed in cataloging in one of two ways. Individual
Web page omega values can be calculated and included as part of the bibliographic
record. These values can provide the user with an indicator of the rate
and type of change the Web page undergoes. Omega values relative to the
Web page population, the content of a digital Web library, or any other
Web document pool can also be calculated. Thus, it may be useful to identify
individual Web pages according to the degree to which they vary from some
standard.
Persistence captures existence -- is the page or site "there"
or is it not. Persistence is not concerned with changes to Web pages or
Web sites other than presence or absence. A Web page or Web site may undergo
significant "alteration," including the complete replacement of all Web
objects on the document without affecting its persistence status. Those
same Webpages and Websites may also exhibit intermittence behavior without
change to the document once it returns.
Persistence may take one of three forms: permanence, "comatoseness"
and intermittence. A "permanent" Web site or Web page is one that always
resolves over some period. A comatose Web page or Web site is defined as
one that fails to respond or resolve after consecutive six weekly queries,
including the most recent query. I prefer the term "comatose" to "dead"
or "defunct" because in theory and in practice a Web document once it "disappears"
can "reappear" at the same URL if and whenever the Webmaster or Web author
so chooses.
Webpages and Websites appear to have half-lives of less than two and three years respectively. The comatose rate for Web pages appears to undergo phase shifts. As is shown in Figure 1, over the first eighty weeks of data collection, comatoseness rate was virtually linear. The solid line shown on Figure 1 is the percent of comatose Web pages of the entire sample for the week sampled. The dashed line smoothes that data by providing the average percent for the collection week and the preceding five weeks. A "correction" ensued, followed by a slower growth of comatoseness.
Websites also disappear, although at a slightly reduced rate. By the end of the same sampling year, approximately 25 percent of that sample had "gone" (Koehler 1999a). Not all that fail to respond or resolve at any given time are comatose. An intermittent Web page or Web site is one that has failed to respond at some point but that has returned. Web pages (and Web sites) can be classified according to their measured intermittence behavior over a specified time period. These may be classified and cataloged as never, non-repeater, and by frequency.
The majority of Web pages that persist do so without intermittence.
However, a substantial minority are intermittent at least once and sometimes
more often. At the end of the first data collection year 27.7 percent of
the original 361 Web page sample was comatose. At any given time, between
2.5 and 10 percent of the sample is considered to be gone but intermittent;
4.2 percent were so deemed at the fifty-second collection. Of the non-comatose
sample remaining after one year, 57.3 percent were always "present," thus
42.7 percent were at one time or another intermittently gone. Some are
gone more often than others are. Of those intermittently gone over the
year, 56.8 percent were gone for one duration only (one or more consecutive
weeks), 30.6 percent were gone for two durations, 5.4 percent for three
durations, 4.5 percent for four durations, and 2.7 percent for five or
more durations. The intermittence duration ranged from one to twenty-four
weeks. Most intermittences were from one to six weeks in length, and the
average (mean) duration was 3.37 weeks for those present at the end of
one year and intermittent at least once.
Based on observation and anecdotal evidence, those of us who try to
manage Web documents treat dead links in one of five ways and generally
do very little to address changed content. The five models for dead URL
maintenance are:
1. Do nothing.
2. Manual spot checking and response to complaints
3. Periodic sweep and elimination
4. Periodic sweep and transfer
5. Periodic sweep and mark
The "do-nothing" approach to link management is the simplest of
all. Usually a "web cataloger" will provide us an on-line list of "my-favorite-sites"
and then ignores them. Slowly and sometimes not so slowly, the integrity
of the list erodes until a substantial proportion of the links to resources
are bad. These sites infuriate me, especially if they address subject of
great interest or moment. Either maintain the site or kill it.
Manual spot-checking and response to complaints may be effective depending on the size of the link collection and on its subject matter. Some things are more stable than others are. Collections of government documents or sites created with URL stability in mind may be good candidates. One major site that "collects" professional organization sites, hosted at the University of Waterloo, has a preference for canonical sites (those that reflect in organization name in the domain name -- e.g. www.myorg.org as against www.myuniv.edu/~myorg). They find that canonical URLs have more lasting power than the others do. While the University of Waterloo Professional Societies project uses URL checking that is more sophisticated than the spot-check, it might be enough in their case.
Those who sweep their collections then eliminate busted links will maintain fairly clean sites. However, they will commit the "babies and bath water" error. Not only do URLs die, they come back. My preference in fact is not to term them "dead" but comatose. At any given time, about five percent of the Web is comatose but on the mend. If you eliminate dead links, you will eliminate links that are not really gone. Like everything else, there is a pattern to this. The "No DNS" error may very well only be an indicator that the network is overwhelmed or that the target server is down for one reason or another. Neither of these is permanent. The "No 404" tends to portend the end. But even that is not certain. The No 404 error is more serious for content pages than it is for navigation. And with a No 404 error, the server is at least still there and it may be possible to find the absent page.
There are two strategies I endorse. The first is to sweep sites, then transfer busted URLs to a separate file, perhaps on-line, perhaps not. The busted URLs are then checked to see if they have returned. When they come live again, they are restored to the on-line public resource. Based on conversations I have had from major Web library catalog sites here in the United States and in Europe, this is a common strategy.
My preference and one I have taken with my library and information manager association ethics web site is to mark the busted sites but to leave them as part of the collection. This obviates the need to re-add them when they return. But it also serves the user because it shows that a Web resource once existed and may again exist for a given organization. Moreover, it demonstrates the existence of an organization, and indeed organizations can persist even if they don't have a Web presence.
Bates, M. (1998) Indexing and abstracting for digital libraries and the Internet: Human, database, and domain factors. Journal of the American Society for Information Science 49 (13): 1185-1205.
Brake, D. (1997). Lost in Cyberspace. New Scientist 154 (2088): 12-3.
Chen, Ching-chih (1998). Global digital library: Can the technology havenots claim a place in cyberspace? In Ching-chi Chen, ed., Proceedings NIT '98: 10th International Conference New Information Technology, Hanoi, Vietnam, March 24-26, 1998. West Newton, MA: MicroUse Information, 1998: 9-18.
Chen, Yih-Farn and Elefherios Koutsofios (nd). WebCiao: A Web site Visualization and Tracking System. Available http://www.research.att.com/~chen/webciao/
Glassel, Aimée and Amy Wells (1998). Scout Report Signposts: Design and development for access to cataloged Internet resources. Journal of Internet Cataloging 1 (3): 15-45.
Gorman, Michael (1998). Metadata or cataloguing? A false choice. Journal of Internet Cataloging 2 (1).
Koehler, Wallace (1997). Internet search note: Specialized retrieval and Web search engines. Searcher 5 (5): 63-6.
Koehler, Wallace (1998). Staleness among Web search engines. Searcher 6 (7): 42-3.
Koehler, Wallace (1999a). An analysis of Web page and Web site constancy and permanence. Journal of the American Society for Information Science 50 (2): 162-180.
Koehler, Wallace (1999b). Classifying Web sites and Web pages: The use of metrics and URL characteristics as markers. Journal of Librarianship and Information Science 31 (1): 297-307.
Lawrence, S. and C. Giles. (1998). Searching the World Wide Web. Science 280 (5360).
McDonnell, J., W. Koehler, and B. Carroll (1999). Cataloging challenges in an Area Studies Virtual Library Catalog (ASVLC). Journal of Internet Cataloging 2 (2).
OCLC (n.d.). Building a Catalog of Internet-Accessible Materials. http://www.oclc.org/oclc/man/catproj/overview.htm.
Oder, Norman (1998). Cataloging the Net: Can we do it? Library Journal (October 1): 47-51.
Olsen, N. (1997) Cataloging Internet Resources: A Manual and Practical Guide, 2ed. http://www.oclc.org/oclc/man/catproj/overview.htm.
Ranganathan, S.R. (1933). Chain Indexing
Shafer, Keith. 1997. Scorpion Helps Catalog the Web. Bulletin of the American Society for Information Science 24 (1). Available: http://www.asis.org/Bulletin/Oct-97/shafer.htm
Social Science Information Gateway. Available: http://sosig.ac.uk/about.html.
Star, Susan (1998). Grounded classification: Theory and faceted classification. Library Trends 47 (2): 218-65.
Tennant, Roy (1998). The art and science of digital bibliography. Library Journal digital (October 15, 1998). Available: http://www.bookwire.com/ljdigital.articles?date=current.
Tomaivolo, N. and J. Packer (1996). An analysis of Internet search engines: Assessment of over 200 search queries. Computers in Libraries 16 (6): 58-62.
University of Waterloo, Scholarly Societies Project (1999). URL-Stability
Index for the Scholarly Societies Project. http://lib.waterloo.ca/society/URL_stability_index.html.