MODULE 4
Characteristics
PAGE 1
URx
PAGE 2
Permanence
PAGE 3
Longevity

home logosite map
Copyright 2000 Wallace Koehler All Rights Reserved

Web Document Morbidity and Change


WEb documents undergo significant change over time. This change takes two forms. The first is Web document morbidity. The second is change. For the past three years, I have been following the behavior of 360 randomly selected URLs. The sample was derived using a random URL generator from WebCrawler (that unfortunately is no longer available). The sample was stratified so that the distribution of top-level domains (TLDs) matched that reported in December 1996. The sample has not been increased since is original collection. For a disussion of the methodology, see my paper in JASIS. What follows is a discussion based on my research.
 

Document Morbidity

morbidity thumbnailFigure 1 Web Page Morbidity Over 3 Years

Figure 1 (click image for full view) is a plot of the survival and demise of the sample over three years. The blue line represents the percent of Web pages that fail to resolve after three attempts on a weekly basis. The red line is a plot of average values for six weekly collections including the one reported. The average plot tends to smooth the curve. The plot shows three trends. The first was one of repaid decline in the number of "functional" Web pages.

The second shows a major return of Web pages. There are two explanations for this. In fact a number Web sites did return, one after being dormant for more than 18 months. The second is a change in software and the reporting of 404 errors. The 404 error indicates a Web page not found on an extant server. The way in which 404 errors are reported has changed from a simple one line message (my software interpret correctly) to sometimes page long declamations. The page long statements were interpreted as "present."

The third trend, and one following the change in 404 reporting, has been a slower rate of decline over the last year. For the present, the number of Web pages present now hovers at about 50% of the original Web sample. One conclusion one might reach is that after a protracted period of time (perhaps as indicated here, two years) a degree of stability is achieved. That is to say, that if a Web page survives two years there is a fair chance it will continue to do so. Further tracking is necessary to establish the trend.

Because Web pages have a surprising proclivity to return, I prefer to refer to these as "comatose" rather than "dead" pages or links. For the sake of nomenclature, I taken to calling the returned pages that are the result of elaborate 404 messages "ghosts."

It should also be noted that at any given time, about 5% of those listed as missing will return within six weeks of the first missing report. These are termed "intermittent pages." Less 5% of the sample has never gone "intermittent," which is to say; less than 5% have always answered the call.

Web Document Change


Web document change thumbnailFigure 2: Web Document Change over 3 Years
 

Figure 2 is a plot of the changes that the Web page sample has undergone in the past three years. These data reflect only those Web pages that are "present" at the time of collection. Dormant or comatose pages are not included in calculations.

Three types of change are shown. The first, shown as a black line, reflects changes in page content as measured by change in page byte weight. That is, if the number of bytes for a page do not change, it is assumed that the content of the page does not change. That is a fairly safe assumption, but not a perfect one. For example, if I were to change the syllabus from "Turn in papers on July 31" to "Turn in papers om July 21" the byte weight would not change (2 and 3 have the same weight) but the message would. Note too that I included a purposeful typo just before "July 21." The typo would neither change the message nor the byte weight.

The second measure is structural. The green and red lines indicate new links from the page and changes to the links on the page respectively on a weekly basis. It is possible to change the link structure without changing byte weight. Typically about 20% of the pages undergo content and link change and 10% experience new links on a weekly basis. There are occasional spikes when major change behavior occurs. Some of this appears to be a result of software changes, others I cannot yet explain.

Change as Metadata

Quantitatively measured change intervals and the relative degree of change can be used for bibliographic control. Often Web page and Web site change is trivial, but at other times it is substantial. The definition of trivial or substantial change is necessarily subjective. The addition of, say, a bandwidth demanding revolving logo may be taken as trivial by a reader seeking product price information, but as an important change by someone interested in Web site design. It should be noted that meaning can be changed by substituting one character for another, each with the same byte-weight, without impacting total byte-weight: consider for example the difference in meaning in these two phases "the island was deforested" and "the island was reforested."

Content Change

In the first year (December 1996 to January1998), almost all Web pages (more than 97 percent) and Web sites (more than 99 percent) underwent some kind of change. Change is defined here as variation in byte-weight, in object number, and/or the number and arrangement of hypertext links.

Web sites undergo significant change over time. Changes in the number of Web site total objects for the sampling period are shown Table 1.

Table 1. Web site Relative Total Object Changes in Orders of Magnitude in Percent
Order of Magnitude Percent Dec/Jan 96/97  to June/July 97
Percent June/July 97 to Dec/Jan 97/98
Size Decrease > -2 Orders of Magnitude
 3.1
 0.6
Size Decrease > -1 < -2
 3.7
 6.2
Size Decrease > 0 < -1
 21.0
7.4
No Major Change
 7.5
 8.0
Size Increase > 0 < 1 
50.2
 68.5
Size Increase > 1 < 2 
11.2
 6.8
Size Increase > 2 Orders of Magnitude
 3.4
 2.5

Table 1 illustrates two general Web site change trends. First, Web sites are not static. They undergo dramatic size changes. Typically, less than seven percent of Web sites will "implode," that is decrease in size by more than one order of magnitude per period, while less than fifteen percent will "explode," or increase in size by more than one order of magnitude. While some Web sites either implode and explode, the general trend is toward size increases: well more than half of sampled Web sites increased in size over the sample year.

Web site catalogs are necessarily challenged by these size swings. As Web sites increase or decrease in size, sometimes as much as by more than two orders of magnitude, the depth and breadth of information contained in those Web sites are modified, amended, increased, or decreased. A dynamic index or catalog must reflect those changes.

Web pages also undergo change. Two quantitative change measures can be identified. These are changes in byte-weight (content change) and changes to the hypertext link structure of the page (structural changes). Byte-weight change is an imperfect surrogate for semantic or meaning change. Structural changes are modifications, additions, or deletions to the hypertext reference structure of the page. As such structural changes probably represent a subtler yet equally important source of semantic or meaning change as byte-weight.

Figure 1 charts rates of change for Web pages. With a number of interesting and dramatic exceptions, Web pages undergo content and structural change fairly consistently. At any given time, about twenty percent of the sample experience content and new structural change while about ten percent undergo modifications to existing hypertext link structures.

These content and structural changes have been labeled as "omega".  Omega can be divided into total omega and three iso-types: content omega, new structural omega, and existing structural omega. Each omega value is an average of the periodic binary measure of change. The value "1" is assigned whenever dimension change of any magnitude occurs, "0" when none occurs. For example, if a Web page were to change one week but not the next, it would have an omega of 0.5 over the period. Omega-t is not additive of the three iso-types. It reflects any change in any of the three measures. Thus, a Web page experiencing Omega-c, Omega-n, and Omega-e changes would carry the same Omega-t as one with only Omega-c type change or other combinations.

Web pages undergo content change more frequently than structural. However, structural change may occur in the absence of content change. Mean omega values for the Web page sample at the end of the sample year were found to be Omega-t  0.298, Omega-c 0.239, Omega-n  0.074, and Omega-e 0.150 (Koehler 1999a). Similar calculations could also be derived for Web sites.

Omega values can be employed in cataloging in one of two ways. Individual Web page omega values can be calculated and included as part of the bibliographic record. These values can provide the user with an indicator of the rate and type of change the Web page undergoes. Omega values relative to the Web page population, the content of a digital Web library, or any other Web document pool can also be calculated. Thus, it may be useful to identify individual Web pages according to the degree to which they vary from some standard.
 

Document Persistence


Persistence captures existence  -- is the page or site "there" or is it not. Persistence is not concerned with changes to Web pages or Web sites other than presence or absence. A Web page or Web site may undergo significant "alteration," including the complete replacement of all Web objects on the document without affecting its persistence status. Those same Webpages and Websites may also exhibit intermittence behavior without change to the document once it returns.
 Persistence may take one of three forms: permanence, "comatoseness" and intermittence. A "permanent" Web site or Web page is one that always resolves over some period. A comatose Web page or Web site is defined as one that fails to respond or resolve after consecutive six weekly queries, including the most recent query. I prefer the term "comatose" to "dead" or "defunct" because in theory and in practice a Web document once it "disappears" can "reappear" at the same URL if and whenever the Webmaster or Web author so chooses.

 Webpages and Websites appear to have half-lives of less than two and three years respectively. The comatose rate for Web pages appears to undergo phase shifts. As is shown in Figure 1, over the first eighty weeks of data collection, comatoseness rate was virtually linear. The solid line shown on Figure 1 is the percent of comatose Web pages of the entire sample for the week sampled. The dashed line smoothes that data by providing the average percent for the collection week and the preceding five weeks.  A "correction" ensued, followed by a slower  growth of comatoseness.

Websites also disappear, although at a slightly reduced rate. By the end of the same sampling year, approximately 25 percent of that sample had "gone" (Koehler 1999a). Not all that fail to respond or resolve at any given time are comatose. An intermittent Web page or Web site is one that has failed to respond at some point but that has returned. Web pages (and Web sites) can be classified according to their measured intermittence behavior over a specified time period. These may be classified and cataloged as never, non-repeater, and by frequency.

The majority of Web pages that persist do so without intermittence. However, a substantial minority are intermittent at least once and sometimes more often. At the end of the first data collection year 27.7 percent of the original 361 Web page sample was comatose. At any given time, between 2.5 and 10 percent of the sample is considered to be gone but intermittent; 4.2 percent were so deemed at the fifty-second collection. Of the non-comatose sample remaining after one year, 57.3 percent were always "present," thus 42.7 percent were at one time or another intermittently gone. Some are gone more often than others are. Of those intermittently gone over the year, 56.8 percent were gone for one duration only (one or more consecutive weeks), 30.6 percent were gone for two durations, 5.4 percent for three durations, 4.5 percent for four durations, and 2.7 percent for five or more durations. The intermittence duration ranged from one to twenty-four weeks. Most intermittences were from one to six weeks in length, and the average (mean) duration was 3.37 weeks for those present at the end of one year and intermittent at least once.
 

Current Practice

Despite the many efforts to bring some kind of stability to URL change and demise, most Web documents authors, owners, or publishers have yet to adopt any of the innovations thus far advanced or suggested. These range from the very ambitious archives Brewster Kahle (see http://www.archive.org) and others have suggested or Kahle's Alexa project (http://www.alexa.com) that redirects the user to the archived Web document in the event of the dreaded No DNS or 404 error. OCLC's PURL is a URL redirection service, but it requires that first the Web author register the Web document and receive a PURL, then maintain the integrity of the URL-PURL tie in the event the URL is moved. The search engine Google provides cached documents (a "GURL" if you will) in the event of a retrieval failure of the native document. DOIs may solve the problem of disappearing e-journal articles.

Based on observation and anecdotal evidence, those of us who try to manage Web documents treat dead links in one of five ways and generally do very little to address changed content. The five models for dead URL maintenance are:
 

1. Do nothing.
2. Manual spot checking and response to complaints
3. Periodic sweep and elimination
4. Periodic sweep and transfer
5. Periodic sweep and mark


The "do-nothing" approach to link management is the simplest of all. Usually a "web cataloger" will provide us an on-line list of "my-favorite-sites" and then ignores them. Slowly and sometimes not so slowly, the integrity of the list erodes until a substantial proportion of the links to resources are bad. These sites infuriate me, especially if they address subject of great interest or moment. Either maintain the site or kill it.

Manual spot-checking and response to complaints may be effective depending on the size of the link collection and on its subject matter. Some things are more stable than others are.  Collections of government documents or sites created with URL stability in mind may be good candidates. One major site that "collects" professional organization sites, hosted at the University of Waterloo, has a preference for canonical sites (those that reflect in organization name in the domain name -- e.g. www.myorg.org as against www.myuniv.edu/~myorg). They find that canonical URLs have more lasting power than the others do. While the University of Waterloo Professional Societies project uses URL checking that is more sophisticated than the spot-check, it might be enough in their case.

Those who sweep their collections then eliminate busted links will maintain fairly clean sites. However, they will commit the "babies and bath water" error. Not only do URLs die, they come back. My preference in fact is not to term them "dead" but comatose. At any given time, about five percent of the Web is comatose but on the mend. If you eliminate dead links, you will eliminate links that are not really gone. Like everything else, there is a pattern to this. The "No DNS" error may very well only be an indicator that the network is overwhelmed or that the target server is down for one reason or another. Neither of these is permanent. The "No 404" tends to portend the end. But even that is not certain. The No 404 error is more serious for content pages than it is for navigation. And with a No 404 error, the server is at least still there and it may be possible to find the absent page.

There are two strategies I endorse. The first is to sweep sites, then transfer busted URLs to a separate file, perhaps on-line, perhaps not. The busted URLs are then checked to see if they have returned. When they come live again, they are restored to the on-line public resource. Based on conversations I have had from major Web library catalog sites here in the United States and in Europe, this is a common strategy.

My preference and one I have taken with my library and information manager association ethics web site is to mark the busted sites but to leave them as part of the collection. This obviates the need to re-add them when they return. But it also serves the user because it shows that a Web resource once existed and may again exist for a given organization. Moreover, it demonstrates the existence of an organization, and indeed organizations can persist even if they don't have a Web presence.

Hunting The Busted Link

There are any number of ways of screening for busted links. The most obvious and most time consuming is to "click and check." For a site with hundreds of links, like mine, that's a major chore. There are many software solutions to the problem. That old progenitor browser, Mosaic supported URL checking through its bookmark file. Browser power disappeared for a while, but has since been restored. One can use a variety of software tools designed to check URL efficacy.
 The following should not be taken as an endorsement, but these are what I use. I personally like PowerMarks to screen a set of URLs. URLs can be added and deleted easily, its report is intuitively obvious, it can be set up to run automatically, and it's none too expensive. But it's not only package that does just that. For more sophisticated applications, I prefer to use site maintenance software (WebAnalyzer for example) or a URL "checker" with a wider array of features (FlashSite).

The Busted Link is Harder to Catch Than Once It Was

 Once upon a time, identifying the 404 and DNS errors was easier than it is today. Remember when that was all the error message said: No DNS or 404 Error. It's not like that today. We are no longer told that the page has gone 404, sometimes we get a cute little message.  Or we will be told to check with the site administrator. These messages, because they do not necessarily follow any given format and because they tend to be long can be mistaken for a Web page with "real" content. For the moment, my approach to managing these is three fold: (1) using FlashSite,  look for Web pages that report a constant size, measured in bytes, (2) using FlashSite or other software,  look for sites that have not been updated for a protracted period of time, and (3) about every six months  examine every link on our site.  Pages that do not change size or are not updated may be 404ed. We have begun testing more often than six months on a time allows schedule.

Web Document Morbidity, Change, and Catalogs

I have suggested that these change data might be included as part of the bibliographic record for Web pages and sites. It may be valuable to know that a page or site undergoes periodic change and at what rate. I have termed that rate of change the page or site "omega." The omega is a very straightforward statistic. It is the percent of the time the document changed or was unchanged from on period to another. Because these data can be collected automatically, it should be possible to populate and update catalog records on a more or less automatic basis as well.
 
 
learning objective Consider how changes in document content change the meaning of those documents. Does frequent change make it imposssible to accurately describe such a document?

How useful would it be to know that a described document changes/does not change frequently?

References

Ardito, S. (1998). The Internet: Beginning or end of organized information? Searcher 6 (1): 52-7.

Bates, M. (1998) Indexing and  abstracting for digital libraries and the Internet: Human, database, and domain factors. Journal of the American Society for Information Science 49 (13): 1185-1205.

Brake, D. (1997). Lost in Cyberspace. New Scientist 154 (2088): 12-3.

Chen, Ching-chih (1998). Global digital library: Can the technology havenots claim a place in cyberspace? In Ching-chi Chen, ed., Proceedings NIT '98: 10th International Conference New Information Technology, Hanoi, Vietnam, March 24-26, 1998. West Newton, MA: MicroUse Information, 1998: 9-18.

Chen, Yih-Farn and Elefherios Koutsofios (nd). WebCiao: A Web site Visualization and Tracking System. Available http://www.research.att.com/~chen/webciao/

Glassel, Aimée and Amy Wells (1998). Scout Report Signposts: Design and development for access to cataloged Internet resources. Journal of Internet Cataloging 1 (3): 15-45.

Gorman, Michael (1998). Metadata or cataloguing?  A false choice. Journal of Internet Cataloging 2 (1).

Koehler, Wallace (1997). Internet search note: Specialized retrieval and Web search engines. Searcher 5 (5): 63-6.

Koehler, Wallace (1998). Staleness among Web search engines. Searcher 6 (7): 42-3.

Koehler, Wallace (1999a). An analysis of Web page and Web site constancy and permanence. Journal of the American Society for Information Science 50 (2): 162-180.

Koehler, Wallace (1999b). Classifying Web sites and Web pages: The use of metrics and URL characteristics as markers. Journal of Librarianship and Information Science 31 (1): 297-307.

Lawrence, S. and C. Giles. (1998). Searching the World Wide Web. Science 280 (5360).

McDonnell, J., W. Koehler, and B. Carroll (1999). Cataloging challenges in an Area Studies Virtual Library Catalog (ASVLC). Journal of Internet Cataloging 2 (2).

OCLC (n.d.). Building a Catalog of Internet-Accessible Materials. http://www.oclc.org/oclc/man/catproj/overview.htm.

Oder, Norman (1998). Cataloging the Net: Can we do it? Library Journal (October 1): 47-51.

Olsen, N. (1997) Cataloging Internet Resources: A Manual and Practical Guide, 2ed. http://www.oclc.org/oclc/man/catproj/overview.htm.

Ranganathan, S.R. (1933). Chain Indexing

Shafer, Keith. 1997. Scorpion Helps Catalog the Web. Bulletin of the American Society for Information Science 24 (1). Available: http://www.asis.org/Bulletin/Oct-97/shafer.htm

Social Science Information Gateway. Available: http://sosig.ac.uk/about.html.

Star, Susan (1998). Grounded classification: Theory and faceted classification. Library Trends 47 (2): 218-65.

Tennant, Roy (1998). The art and science of digital bibliography. Library Journal digital (October 15, 1998).  Available: http://www.bookwire.com/ljdigital.articles?date=current.

Tomaivolo, N. and J. Packer (1996). An analysis of Internet search engines: Assessment of over 200 search queries. Computers in Libraries 16 (6): 58-62.

University of Waterloo, Scholarly Societies Project (1999). URL-Stability Index for the Scholarly Societies Project. http://lib.waterloo.ca/society/URL_stability_index.html.