course logo


MODULE 1- INTRODUCTION
PAGE 1
PURPOSE
PAGE 2
REQUIREMENTS
PAGE 3
EXPECTATION
PAGE 4
SYLLABUS
PAGE 5
ASSIGNMENTS


.........................
site map
home logo

Copyright © 2002 Wallace Koehler- All Rights Reserved

Expectations

Consider what managing Web based information in a Web based environment is about. Is this process an extension of the library science tradition or is it something new and independent of both the constraints and understandings of that tradition? Whether the management of Web materials is tied or independent of traditional library science, is it a subset of digital library management or again something separate?  Like a true academic, one answers "there are at least two schools of thought." The first argument is that  indeed the Web is part of the grander scheme of thing library [Wells et al]. If so, some would have it that the Web is a libraryand some argue not.

Others have it that management of digital libraries, and with them, the Web is a thing apart from traditional libraries [Welty & Jenkins]. Libraries, they argue, are tied to the need to shelve objects somewhere. Indeed, library science has been and remains a discipline concerned with parking books rather the retrieval of information.

There is some truth to the assertion that library scientists have been concerned with the physical management of information resources. For indeed until very recently, information resources managed by libraries have come in some kind of physical package. And those packages must be placed somewhere. The computer scientist as information professional concerned the management of ethereal objects need not, they argue, be constrained by the shackles of the past. Nor are they blinded by its traditions.

Past practices, it is indeed true, were constrained by physical realities. The card catalog was usually designed to support three access points: subject, title, and author. Why, when so much other information was available and relevant. In theory it is possible to design a physical card catalog with many more than three access points. These might include date of publication, language, publisher, number of pages, size of illustrations, illustration content, color of cover, font and point size, and so on. Card catalogs were limited because of the physical impossibility. A catalog with six access points would have to occupy twice the space of one with three.

Electronic catalogs for whatever kind of collection have changed that. Objects can be described according to any number of characteristics. The depth or degree to which the cataloger can go, the granularity of description is in theory limitless. There is an emerging  body of theory to suggest that we need to either redefine our roles, acknowledge a paradigm shift, or to incorporate new approaches and techniques. It is true that what we can do as information and library scientists has expanded. We can apply those new techniques to "brick and mortar" and digital collections. These can be managed separately or together in hybrid libraries.

The purpose of this course is to promote the idea that the Web can be managed. To do that, we will explore a variety of means developed in recent years to address digital and Web based collections. As we shall see, all rely on metadata considerations. The strategies adopted range everywhere from applying traditional manual cataloging techniques through highly automated metadata harvest and catalog population. We will touch on this range of options and come to understand their purpose and function. This is a survey course. We will not become experts but rather conversant across a range of options.  Likewise, this is not  a theory course. We will again touch on theory to be conversant rather than expert. Thus we begin the introduction to a very interesting course of study.

What is this thing called "metadata"

The metadata revolution began in the ninth century C.E. The Diamond Sutra is the first known text to be dated. Housed at the British Library, the Diamond Sutra is one of the texts defining Mahayana Buddhism.

Our preoccupation with metadata is a twentieth century phenomenon and with good reason. "Metadata are data about data." Entering library students and the most accomplished catalogers are reminded of this little dictum daily. Metadata, in a nutshell, are schemes to describe how data are to be captured, arranged, and offered. When you ask, "where do I put footnotes, which style shall I use?" You are asking a metadata question. Citing your source is data. Using the dictates of Chicago or APA, or MLA, or any of the other hundreds of style guides is metadata.

Metadata is nothing more than a formal system to organize descriptions of something else. Some systems are complex and some are simple. This is true in the "traditional" print world, it is also true in the digital world of which the Web is a part. An interesting guide to Internet metadata can be found at the Lower Saxony State- and University Library, Göttingen site MetaGuide available at: http://www2.sub.uni-goettingen.de/metaguide/index.html.  This is an extraordinarily useful site and is to be explored as part of this course.

For us in libbiz, the  metadata "good book cookbook" is AACR2. This is probably the first and last time you will see AACR2 in this course. It is not that it is to be ignored. Some of what we will see in this course relies heavily on AACR2 and similar sets of cataloging rules. Some of what we will see completely ignores formal cataloging. What, for example, is the underlying rule to spam indexing? What of AACR2 does one find in XML? The metadata assumptions for spam indexing are very informal, they are very formal for mark up languages -- they are unforgiving, you do it right or they don't work.

We are used to picturing metadata as a MARC record, with its numbers, letters, characters arrayed in bewildering complexity. MARC, of course, is but one example of metadata application. The populated MARC record defines a given "knowledge product" in exquisite detail. The template one populates with data is the metadata structure of that system.
 
"Metadata has existed under various names in the computer science and bibliographic description professions for decades, providing enough information to manage and retrieve resources such as files or books." C. F. Thomas & L. S. Griffin, "Who Will Create the Metadata for the Internet?" First Monday, http://www.firstmonday.dk/issues/issue3_12/thomas/index.html#author

Why do we use metadata? We use it for two reasons. The first is that it provides us with a consistent and coherent framework within which to capture information about information. The second, and this only works if we agree to make it work with others, is that it facilitates in the transfer of that information about information to other systems. If you classify fiction works as "Q" at your house and I classify books with red covers also as "Q," while James Bond (the spy, not the ornithologist) has a whole different use for the term; we may each have useful local classification schemes, but data inter transferability is below poor.

What's to be done?

Most of the important work on author-based tools in the digital environment is focused on digital documents in general rather than on Web documents in particular. This body addressed digitizing existing "analog" material from historical documents through contemporary ones. The approaches to cataloging these documents are varied and imaginative. They include (but are not limited to) defining the meaning of content by tagging terms according to precise meaning, by describing the location of the term(s) or objects in the document, by defining the purpose the object or term to the document, and by describing object or term characteristics.

Take, for example, the word "lima." Lima is the capital of Peru, it is a city in the United States. It is also a bean and the number five in Malay. And that's not all. When we see the term in context, we usually recognize and understand the meaning of that character string in context. We have developed an entire genre of humor, called puns, around the multiple meanings of similarly sounding or written words. Computers aren't funny (is there a pun here?). They are not real good at parsing terms and understanding them in context. On the retrieval side, various systems have been developed to identify context. Take KWIC, key word in context, a common adjunct to return sets in various proprietary search interfaces like Lexis-Nexis or Dialog. Keywords are presented in context with surrounding text. The human searcher is presented with document options and can narrow the return set in that fashion.  Computers can and have been "taught" to "define" words in context through coupling of terms. Consider for example software taught to retrieve material on automobiles coupled with brand names. How do you suppose it would classify: "Lincoln rode to Gettysburg in a railroad car." Anything in this sentence about Ford Motor Company products?

Search engines and human beings interpreting documents place importance on the location of information. Web documents almost automatically now carry <title>Title</title> tags. They contain headers. Terms appear at some distance from one another (adjacency). The sequence of terms can be identified - does horse come before cart or vice versa? Where in a document does the term or object first appear? Some search engines rank document relevance according to the frequency a term occurs and its location in the document. Does the term occur in a specific field or metatag. Is it in the top third of the pages in a Web site or in the top third of the terms on a Web page? How often does the term occur?

Defining purpose includes but is not limited to such things as: is it a title, is it a statement of authority, is it organizational? A page labeled "Chapter 1" may refer to the organization of the document, but it could also be identifying the original subgroup within a larger organization. For example, that might be the alpha chapter of a fraternity. Consider the wide array of organizational elements in a typical scholarly monograph. These elements can be used to classify and retrieve information in the print as well as in the electronic environment.

Finally, object characteristics can be used to define knowledge products. This has been less important in the past but not ignored. Under AACR2 guidelines, monographs that are "non-standard" are described. The default is no description, but no description means "within limits, standard." A book with many illustrations will be noted as such, as will an oversized book or one with unusual dimensions. We can and now do describe objects other than text in the electronic world. An immense amount of work has been done to manage graphics and well as audio and video material.  This work serves two related purposes. The first is to describe the object itself and there facilitate its retrieval. The second is to place individual objects within the context of the larger work and therefore to help define the larger work.
 

Librarians have been classifying documents and objects since there were librarians and documents and objects. In recent years information scientists and computer scientists have joined in the effort to manage information.

Just as ever, the field takes two approaches to managing this information. The first includes all those efforts that precede the publication and dissemination of any "knowledge product." This includes all the editing and authority control, review, index creation, classification that these knowledge products may be subjected to. For the sake of brevity, these are labeled "author-based tools."

Once a knowledge product emerges, a whole new set of techniques and technologies are brought to bear to define and retrieve information. These are the catalogers and indexer and now the search engines and directories that may or may not use human and/or computer agents in the generation of their products.

Author-based tools

Author-based tools can be further subdivided into (1) "between the <head>ers" and (2) "in the body."

For general guidance, the IFLA Digital Libraries: Metadata Sources page is a definitive resource.

Author-based tools are approaches to Web document management that are author or owner applied. There are at least seven systems either in use or proposed by Web authors and owners. These are
 

In the Body

Between the <heads>

For an extensive list of specific library projects, see Michael's Website, Libraries and Electronic Archives available at: http://www.wam.umd.edu/~mlhall/library.html

Indexer-Based Approaches

Indexer-based approaches are post hoc indexes. They may result from metadata capture from electronic documents -OR- they may be created de novo. Why might we be interested in editing metadata pulled in from author created indexed documents or by creating catalog records from scratch?

Metadata Tools

See http://www.ukoln.ac.uk/metadata/software-tools/non-ukoln.html

And http://www.ifla.org/II/metadata.htm#general-indices
 

Z39.50

Z39.50 is an ANSI/NISO standard (aka ISO 23950) that supports cross platform compatibility in writing index and catalog records. For example USMARC and GILS are different cataloging templates. Nevertheless, because both are Z39.50 compatible, it is possible to import/export records from one system to the other. See: Welcome to the Library of Congress Maintenance Agency page for International Standard Z39.50

Cataloging

Cataloging covers many functions -- go to the catalog page

References and Assigned Readings

Lower Saxony State- and University Library, Göttingen. MetaGuide available at: http://www2.sub.uni-goettingen.de/metaguide/index.html.

C.A. Welty and J. Jenkins. Formal Ontology for Subject. n.d. Available: http://www.cs.vassar.edu/faculty/welty/subjects/subject.html

A.T. Wells, S. Calcari, and T. Koplow, The Amazing Internet Challenge: How Leading Projects Use Library Skills to Organize the Web. Chicago: American Library Association, 1999.



Assignment 3
 

site map...........................|Requirements|Purpose|...................home logo