|
MODULE 4 Characteristics |
URx |
Permanence |
Longevity |

|Archives|PURL|DOI|Google|cache|
| There are practical, legal, and ethical issues that are as yet unresolved
in the caching or archiving of Web documents. Discuss the following on-line:
Does caching represent a potential violation of the intellectual property rights of the Web document author? Or is it merely a convenient method of conveying information. If Web documents are to be archived, which version should be archived? Are there ethical, legal, or practical problems associated with archiving? Is the Web inherently dynamic, and if so can we as librarians and information scientists distinguish between different "editions" of the same work? How useful are PURLs for information management? DOIs? "GURLs"? Contrast them . |
Web archives fall into a number of classes. Among these are the
(1) all encompassing, (2) subject area, and (3) archived newsgroup and
discussion list threads.
General archives
Perhaps the most ambitions of archives is that proposed by Brewster
Kahle in a 1998 article in Scientific American [1]. The Kahle archive
has been created and can be accessed at http://www.archive.org.
The Internet Archive contains (in February 2000) by their reckoning almost
15 terabytes of data, of Web documents.
The Internet Archive has an "opt
out" policy. Either by installing a robot exclusion code on the Web
page or by informing the Internet Archive that archiving is disapproved,
a Web author can preclude it.
|
|
What are the potential intellectual property ramification of an "opt
out" policy?
Do Web document creators have passive or active intellectual rights and control over reproduction and storage of their property? What are the intellectual property issue differences among archives, PURLS, DOIs, "GURL"s, and ISP caching? |
The Internet Archive works in concert with the Alexa Internet, also a Kahle initiative. Alexa is a desktop resident utility that provides Web document information, including document statistics and demographics. In its interaction with the Archive, Alexa software will provide an archived copy of "404"ed Web pages.
Subject/Geographic Area
If the maintenance of general archives is complex and fraught with
technical and technical difficulties, one possible solution may be the
development of subject or geographic area "special" Web document archives.
Just as many public libraries now collect and maintain archives of local
materials (the local newspapers, church or school published cookbooks,
high school yearbooks, old phone books, ad nauseum -- or maybe a
better term is add museum), those same institutions could maintain
a similar collection of Web documents.
Newsgroups and Discussion List archives
There are any number of WWW archives of newsgroups and discussion lists
accessible. These take two forms, one merely copies the threaded structure
of the discussion and pastes it to the Web, the other provides some value
added organizational and cosmetic touches. Both serve useful purposes.
For example, see Re: WWW archives of mailing lists at http://www.rutgers.edu/Accounting/anet/lists/anetdev-l/0071.html
OCLC offers an extensive explanation of PURLS at http://purl.oclc.org/. The page also includes statistics on the number of PURLS registered and resolved. OCLC offers PURL software and a registry.
PURLs take the following form: http://purl.oclc.org/aaa/bbb/ccc, where "purl.oclc.org" indicates the PURL server and the balance of the address represents the purled document. Using the OCLC PURL creation software, I registered my professional ethics page with the resulting PURL: http://purl.oclc.org/NET/www.ou.edu/cas/slis/ethics/EthicsBibOrg.htm
Once created, any given PURL will remain permanent may be be associated with one or more different URLs or other resources. Thus, although my professional ethics Web page now resides on an OU server, it has not always done so. For the first two years of its existence, it resided on a commercial ISP. Had I registered that URL with OCLC and received a PURL, I could have disassociated the old URL from the PURL and reassociated the new URL with the existing PURL. Now that it has a PURL, should I move the Web site, I can associate the new URL address sometime in the future to the existing PURL.
DOI argues that while systems are evolving with
segments of the publishing and indexing industries, most initiatives are
ad
hoc
and limited to the contracting parties. DOI is offered as a standard.
The proposed standard is XML DTD
based and one that will permit both publishers and authors to insert DOI
managed citations into text.
DOIs are not designed to be a universal solution to URL migration and loss. They are limited to a specific literature genre and the occasional Web publisher is precluded from their use. DOI differ from PURLs not in who applies them, but also in their potential permanence. PURLs are only so permanent as the willingness of the content creator to maintain and update material and the link from the native document to the PURL. OCLC maintains the resolution capability but not the document. The maintenance of the handle and the document are independent of both author and publisher.
It is inherent in the DOI philosophy that "handled" documents are necessarily static. This is consistent with their target market, published scholarly articles which by their very nature are unlikely candidates for change. PURLs and updatable caches are not so restricted. Thus from a Web perspective, DOIs occupy a very small niche, but an important one. I suspect DOIs will likely prove attractive to publishers that will not be so pressured to maintain electronic archives, authors who will be assured of electronic publication longevity, and libraries assured of long term access to subscribed electronic materials.
The Handle System® is a product of the Corporation for National Research Initiatives (CNRI). According to CNRI, there are several pilot projects underway with major government and other agencies.
Individuals wishing to resolve handles must first download free, client resident software.
The proposed ANSI standard governing DOIs is Z39.84. The draft standard is available at: http://www.niso.org/pdfs/Z3984.pdf
Note the structural resemblance between "GURLs" (for Google URL) and PURLs. The resemblance is but superficial. The key difference between the two is that the PURL offers a permanent identifier for material addressed using impermanent URLs, while the "GURL" simply provides a proxy for a "downed" Web site. The PURL redirects a document request to the address where it resides and is controlled by its owner, while the "GURL" caches the document.
Google explains its caching policy at http://www.google.com/help.html#K.The archived Web document is provided in the event that the original may not be available for one reason or another. They acknowledge that the archived document may not be as recent an edition as the author's own. Implicitly, perhaps something is better than nothing.
To their credit, Google makes it very clear that it is serving up a
cached document and when that document was captured. They identify their
cached version as a "snapshot" of the site, when in fact it is not an "image"
but -- with the exception of their header -- it is an exact html copy.
A cache is temporary storage. How temporary is temporary? If a document is cached, but the original native document is changed by its author after the previous version is cached, does a disservice occur to the document author or to the end user?
There are also intellectual property rights issues associated with caching. For an interesting view, see W3C, Intellectual Property Rights Overview, 1996, Available: http://orion.deusto.es/~abaitua/konzeptu/bbdd/fairuse4.htm or a Canadian Government site "Copyright Infringement and the Internet" at http://strategis.ic.gc.ca/SSG/it03315e.html
We know that books and other publications are often deselected but are sometimes reduced to the ignominy of microfilm. Microfilm though is the most stable of archival media. When we weed a Web site, what happens to it? If a book or serial is owned by a thousand libraries and archives, what is the likelihood that a single copy will persist over some protracted period of time? If a Web site is archived by a single institution, what is its long term prospects? The answer is "no longer at best than the prospects of the single archiving institution."
There is a second and perhaps more important problem associated with Web sites. Books are static. Web documents are dynamic. Once a book is published it is done (yes, I know, there are exceptions to the general rule). Subsequent publications, perhaps with corrections, are new printings. When a book is modified, it comes out as a new edition. The fact that a new edition emerges does not erase forever and from everywhere copies of previous editions.
In the Web environment, the general rule is just the opposite. Web documents, we are told, are always works in progress. When a Web document is modified, the previous version of that document ceases to exist. And we know that Web sites and pages are frequently changed, sometimes significantly, sometimes not. But they are changed.
The problem facing any Web archivist has to be how often are native documents saved. Do they overwrite previously stored documents, as I assume Google does when it finds a more recent version of a document than it has in its cache. Or does the archivist maintain all variations of Web documents in his/her collection? If the latter is so, how often should the archive check for a new version. If one is found, should some kind of qualitative decision be made over the degree or amount of change needed to trigger a document harvest? Minor incremental change might be ignored up to some point. But who decides what is incremental and what is change. If every change is captured and stored, at what point will the archive overwhelm everyone's ability to store the collection?
Consider and discuss these issues.
Stuart Weibel, Erik Jul, and Keith Shafer, PURLs: Persistent Uniform Resource Locators. 1995 Available: http://purl.oclc.org/OCLC/PURL/SUMMARY
Keith Shafer, Stuart Weibel, Erik Jul, Jon Fausey, Introduction to Persistent Uniform Resource Locators, Available: http://purl.oclc.org/OCLC/PURL/INET96
Catherine Lyons and Howard Ratner, DOIs used for Reference Linking: DOI-X, version 1.0, dated October 8, 1999. Available: http://meta.doi.org/doi-x-reflinks_v1-0.PDF