course logo

to home page....................site map

Copyright © 2000 Wallace Koehler - All Rights Reserved

Setting the Context

 
To paraphrase the Greek philosopher Heraclitus: "You cannot access the same Web twice." 

The Web as Library -- Or Not

There are two schools of thought, one holding that the Web is a library and another that it is not. This course is not about the Web as library. But it is important to place the Web within the library context. Your professor is of the view that the Web is not library, but he is tolerant of those who think otherwise. One of our readings is a paper he published in an e-journal: "Digital libraries and World Wide Web sites and page persistence." Information Research 4, 4 (June 1999)
http://www.shef.ac.uk/~is/publications/infres/paper60.html

Whether the Web is library or is a source of material for digital libraries, we must also understand the concept in order to appreciate the context. Explore the following slide show on digital libraries. For definitions of digital libraries documented by other library school students, see http://www.simmons.edu/~schwartz/530-defs.html

There is an important distinction we need to bear in mind when considering options for the management of digital libraries and  the Web environment. As we will see, there are a number of digital library projects that that include the incorporation of mark up language by the library into the text of the document. For example, consider the Model Editions Partnership (MEP) that marks up the digitized version of historical documents. Note the careful phrasing. MEP does not mark-up the historical document itself, but rather a representation of that document in digital format. Given government propensity for document digitization, perhaps MEP or similar technologies will be applied to government documents either by authors themselves or later by historians and librarians. Or perhaps GILS already is a form of native mark up of those documents.

Web documents are a thing apart from archived historical documents or live, GILS marked-up government documents. Web documents are created and "published" by millions independent of any formal mark-up requirement other than to produce material in an html format. There are no rules that require a Web publisher to supply even the most rudimentary metatag. We might cache or archive Web documents, and there are many proposals and technologies that do just that. Once captured in some way, these documents can be marked-up in any way one might wish. But there remains no obligation on the part of the author or publisher to do anything. And for the most part, most of the Web remains outside the direct control of those who may seek to mark it up for what ever purpose.

As a Web author, I am a proponent of the redundant use of classification techniques. As an information scientist, I recognize the limits of author supplied cataloging and indexing. However rigorous I may be in trying to observe the intricacies of metatags, Dublin Core, and various XML mark-up techniques, I am probably not equal to the task and intolerant of  the time costs. There are also skill and technology barriers. At the time of creation, parts of this document were created with software that does not support editing of the document source code.

Whether we mark-up Web documents or apply classic cataloging techniques, Web librarians are limited to post publication methodologies just as they are with most other material maintained in libraries. We use what we are given in the document itself or in what we can discover or infer. Over time, authors and publishers have subscribed to both formal and informal metadata standards. Titles, author names, publication data, structure, and many other factors are organized in more or less standard ways. Publishers, again for the most part, subscribe to standard identification schemes for their monographs and serials, for example the ISBN number books carry. No such code exists on the Web. But remember that even in the "organized" traditional world of print, not everything is done the same way. Books published in English, if they have them, have tables of contents at the front. Books published in many others languages have those same tables of contents at the back. Most scholarly works in English contain indexes. This is non-standard in many other publishing traditions.

Post publication methodologies are hostage in a dynamic medium. Web documents die or change with great frequency. We will suggest here that not only must we acknowledge those changes, we should also take advantage of them to categorize documents where we can.

Finally, what does a library do? What is library science? What is information science? Has the advent of the computer and its ability to rapidly store, process, and retrieve information revolutionized and redefined information science or library science? What then of digital libraries? And finally, can we cope with the WWW within the constraints of the library/information sciences paradigms? If we deal only with Web documents, we no longer need to be too concerned with where to physically put them. We continue to have to consider where to put them in intellectual space. Moreover, unless I am very mistaken, "physical information containers" will be with us for quite some time. Whatever our responsibilities for the physical maintenance of collections will become, we will forever be responsible for the retrieval of appropriate, quality, authoritative information in the service of our patrons and clients. Therein lies much of the challenge.

Cataloging and Indexing Considerations

If the WWW is or is not a library, what is or is not the responsibility of the library and information sciences community to catalog and/or index that content. I believe there are two basic approaches to management of that information. The first is that we treat the WWW in the same way that we treat "information containers" in other formats. We discriminate among the available resources and select from those resources for inclusion in our various libraries.

The second is to attempt to manage the corpus of the Web, to catalog it as a single collection, This is perhaps a natural conclusion deriving from the encyclopedist tradition, culminating perhaps in the arguments of H.G. Wells in his interesting collection of essays World Brain, published in 1938. There is an ample literature in cataloging and indexing that tells us that the same collection needs to have its cataloging and indexing  presented differently for different audiences (e.g. Soergel). In an interesting paper, Colomb argues that the WWW is a "heterogenous and chaotic collection of information." I have no argument with that. Because there are multiple users with multiple needs, Colomb argues that the Web needs multiple indexes. The alternative he sees is an overwhelmingly complex and multi-layered single catalog that is difficult to index, very expensive to maintain, and impossible to use. Bella Haas Weinberg informs us that there is "nothing new under the sun," that the Web poses no new challenges of substance and that it pales in fact when contrasted with everthing that has come before it.

There is another problem. I have already invoked the philosopher Heraclitus. As we shall see, the WWW is in constant flux. Just as it is for other publication systems, the pool of materials on the Web continues unceasingly to increase. Estimates in 1996 and 1997 placed the number of Web pages at between 100 and 600 million pages. Year 2000 estimates have it at between 1 and 1.5 billion public, static pages. Public means accessible without password or behind a firewall. Static means not dynamically produced from a database on demand. But new books, journals, magazines, flyers, films, CDs, ad nauseum are added to our pot of information as well.

But unlike those new books, journals, magazines, flyers, films, CDs, and so on, Web documents (an inclusive term for Web pages, sites, and other structures) undergo constant metamorphosis. In any given year almost all Web pages and all Web sites will be changed by their creators in some way. Not only are we faced with a "heterogenous and chaotic collection of information," we are faced with something that is constantly being redefined. To use library-speak, no longer can we mark 'em, park 'em, and forget 'em. We must forever be forever remarkin' 'em and reparkin' 'em. And there's no forgetin' 'em.

Are there solutions to the issues. Weinberg tells us we can do it. She is right, of course. We have to do it. This course does not offer solutions as such. It does explore the problems and the various solutions that are being offered and explored. I think everyone involved in the process would concede that we are far from the ideal solution but also that progress is being made.

Philosophical Questions

 
learning objective...........................assignments Is the Web a library? From the perspective of the bibliographic control of Web documents, does it matter? Discuss the nature of the WWW. 

Consider the public policy considerations of the "digital divide" both from a domestic and from an international prospective. Will "good" management' of "Web information space" exacerbate or alleviate the digital divide? Check out the US federal initiative at http://www.digitaldivide.gov/


 

Readings to Set the Context

R. M. Colomb, "A Digital Library Needs Many Indexes" Available: http://archive.csee.uq.edu.au/~colomb/Phronesis.html

R. Kling, Beyond Outlaws, Hackers and Pirates:  Ethical Issues in the Work of  Information and Computer Science Professionals, 1995. Available: http://www-slis.lib.indiana.edu/kling/cc/8-ETH1.html

W. Koehler, "The World Wide Web as a Third Information Model: Revolution or Old Wine in New Bottles?" Crimea 98, Libraries and Associations in the Transient World: New Technologies and New Forms of Cooperation Proceedings Available: http://www.gpntb.ru/win/inter-events/crimea98/doc1/doc65.html    (note abstracts in Ukrainian and Russian, text in English).

F.W. Lancaster, "Second Thoughts on the Paperless Society," Library Journal, September 1999: 48-50.

S. Lawrence and C. L. Giles, "Accessibility of Information on the Web," Nature 400, 8 1999: 107-9.

D. Soergel, Organizing Information: Principles of Database and Retrieval Systems. Orland0, FL: Academic Press, 1985.

B.H. Weinberg,. "Improved Internet Access: Guidance from Research on Indexing and Classification" Bulletin of the
American Society for Information Science 25, 2 (1999) Available: http://www.asis.org/Bulletin/Jan-99/weinberg.html
 
learning objective...........................assignments What does Lancaster  mean when he says: "The typical library catalog is a pathetic tool for subject access."? Given what Lawrence and Giles report, are search engines "pathetic tools" too? Are other approaches pathetic as well?

What ethical considerations does Kling raise that are pertinent to our examination of management of the WWW?

If Koehler is correct, do we need more rigor or less in our attempts to bibliographically capture the WWW?

Colomb asks can we/should we have more than one access tool? Given digital economies, is there any reason not to? Would this, as Colomb seems to suggest, somehow strike at the foundations of library science?