course logo

Copyright 2000 Wallace Koehler All Rights Reserved

|Innovative Cataloging|concept|relationship|charateristic|

Concepts

Universal and Faceted Classification Schemes

Faceted and universal classifications systems might be considered opposites sides of the same information classification/retrieval coin. Faceted classification aka colon classification was first developed by S.R. Ranganathan. According to Brian Vickery (1960, p 12) it is "the sorting of terms in a given field of knowledge into homogeneous, mutually exclusive facets, each derived from the parent universe by a single characteristic of division." Universal classification schemes, on the other hand, tend to classify works by a single characteristic. Thus a book on forensic medicine might be classified under medicine only, while the legal aspects are not recognized. Colon or faceted classification schemes would recognize both set of elements

Grounded theory is concerned with finding common traits among diverse concepts through a positive feed back loop evaluation of empirical test, evaluation, and test. It is necessarily iterative and elastic. It is therefore less phenomenological (particularistic) than synthetic (combination of elements).

Faceted classification systems identify the individual characteristics of various concepts and systems and recombine them in synthetic fashion to create somewhat fluid and iterative definitions, classifications of information. Susan Leigh Star argues that grounded theory and faceted classification are particularly important concepts for information definition and therefore retrieval. Given the electronic environment, objections to the approach are made more illusory.
 
learning objective logoclassification scheme What classification schemes have been developed for the WWW?

What do they have in common?

Are universal or faceted schemes more or less useful?

Are standard librarian classifications useful for/on the Web?

What is "grounded theory" and is it useful 

Chain Indexing

A number of on-line catalogs, Yahoo! among them employ chain indexing. Chain indexing is a tree structure that refines categories from the more general to the more specific. At each level of specificity, sub-classes of the super class are provided. This is therefore taxonomic classification. Moreover, chain indexing can support classification based on facets and not upon single concepts. Thus, a text on forensic medicine might be classified as medicine:pathology:legal:criminal. Following Ranganathan, we might substitute numbers for words so that the classification scheme could cross linguistic boundaries with a minimum of translation.

Let us take for example, the countries of the world and go to my office.
 
 
 We could start with the category: All Countries
  All Countries could be subdivided by Region: [World -- North America]. 
  This can be further reduced to Country: [World -- North America -- United States]. 
  By political subdivision: [World -- North America -- United States -- Oklahoma]. 
  By city: [World -- North America -- United States -- Oklahoma -- Norman]. 
  By street address: [World -- North America -- United States -- Oklahoma -- Norman -- 401 West Brooks]. 
  And finally office number: [World -- North America -- United States -- Oklahoma -- Norman -- 401 West Brooks -- Room 23]. 

[notes and explanations are usually somewhere else, but that is not too bad with hypertext. 
Follow the links for explanation to the classes and categories]

At any point we could select all subclasses within the general class. For example, if we were stumped at "North America,"
we could see a list of all countries that meet the definition: Canada, Greenland, Mexico, and the United States. Each sub-group could be further sub-divided. Cities in Oklahoma could be grouped alphabetically, by SE, SW, NE, NW quadrants, distance from Oklahoma City or proximity to I-35, I-40, I-44, and so on.

Herein lies part of the rub. The chain categories used on the Web are non-standard and each are designed on an ad hoc basis. Some are less "ad hoc" than others.

Studies have shown that both inter- and intra-indexer and cataloger uniformity is poor, even for highly trained indexers and catalogers using well defined guides, classification schemes, and thesauri. This means that two catalogers cataloging the same thing will probably not do it the same way. Moreover, the same cataloger cataloging the same resource at two different times will also probably do it differently. Some Web based services are maintained by trained catalogers (again Yahoo!). Others evolve haphazardly.

Librarians are trained to understand and appreciate that one search interface is different from another. For example, the search protocols that underlie Dialog and Lexis-Nexis are very different. Even within Dialog, search protocols differ from one database to another. We know this, we adjust for it.. Moreover, it is well documented by the search services. Dialog's Bluesheets tell us what is there and how how to get at it. Lexis provides all kinds of help.

The Web poses two problems. First, the various Web based search services are notorious in their lack of documentation. Second, most people who use the search services assume that there is some sort of inter-service as well as intra-service similarity. That too often just isn't so.
 
All Countries. We know my office is somewhere on the planet, but hey, who knows where?
Region: Too many countries (>200 now). Not so many regions
Country: Canada, Greenland, Mexico, and the United States.
Sub-Division. 50 of them, more if you count Puerto Rico, Guam, Virgin Islands, military bases abroad, embassies
Cites. Lots of choices
Street address - still more choices
Office - it's only one building

Key Word/Full Text Indexing

There are many ways to assemble an index. There are three sets of related terms that define those assembly methods. Each of the approaches has theoretical and distinct advantages and disadvantages. In the Web environment, some of these systems may function better than others.

There are also a number of other important functions that search systems can offer in support of information retrieval. These include proximity searching, truncation, and temporality. Proximity searching permits the user to specify the distance and sometimes the order of relationship between search terms. If one were searching for example with the US President's residence, many search engines would return an inordinate number of totally irrelevant hits if one were to request "white AND house." However, if one could specify "white AND house -- no intervening terms and in this order" the return set would be less complex and more useful. Many engines will do this. Syntax and command vocabulary differ.

Truncation (we will use the term generically to cover both pre and post truncation as well as substitution) allows the searcher to specify variations in tense, spelling, and number without multiple term entry. These include American and English spelling (color and colour), number (house - houses, man - men, mouse - mice), tense (has - had). Again the abilities, syntax, and command languages vary.

Temporality allows the user to specify time ranges for retrieval. For example, one could specify material published between one date and another or from yesterday to today, or just everything. Once again, the abilities, syntax, and command languages vary.

 Proximity searching, truncation, and temporality are "bells and whistles," albeit very nice bells and whistles. The discussion that follows addresses fundamental theoretical and practical considerations in the building of indexes and their ability to deliver.
 

1. KeyWord Full Text/Fielded Search

Key word retrieval may be either pre- or postcoordinate. In key word retrieval systems, terms are derived either from the document text or from an authoritative thesaurus or similar resource. These terms are then applied to the bibliographic representation of the document. In full text retrieval, the text itself is searched and all terms found on the document are indexed (with the exception of "stop" words). Web directories are key word systems, Web search engines are sometimes full text or sometimes limited to text found in specific fields (i.e. the <title> field).

A number of search engines support fielded searches. It is possible to specify that only index pages or title fields be searched. It may also be possible to specify that only certain portions of a document be searched -- first paragraphs, concluding sentences, etc. The assumption is that the most important information and the document "aboutness" are most often found in specific locations. This can be particularly important when doing full text searches, for the likelihood of the inclusion of less relevant material increases as less important portions of the document are included.

2. Precoordination and postcoordination.

Coordination means the building or assembling of (usually subject) search terms.

In a precoordinate system, the record or document creator designates the search terms from an already authoritative source. Library of Congress Subject Headings (LCSH) is one such source. An important distinguishing characteristic is that concepts are usually represented by a single term drawn from the authoritative source.

In a postcoordinate system, the information creator may create lists of terms or in the case of full-text presentation may allow the text itself to generate search terms. Multiple terms representing concepts may be acceptable (as is the case with the Art & Architecture Thesaurus (AAT)). It is postcoordinate because the end user selects search terms.

3. Vocabulary Control

"Uncontrolled Vocabulary"

Uncontrolled vocabulary is most often associated with postcoordinate full text searching. Key word indexing is by far the most common form of Web index. It is at the basis of most of the portal-based "commercial" search engines. Most of these build inverted indexes following crawls of the WWW by spiders or robots. They identify character strings -- words and other strings -- and build indexes. In principle, these vary little from the index at the back of a book.

What these spiders or robots crawl matters. Some crawl and index "everything" on a Web document. Others have limited themselves to header material or just the title field. Author based indexing tools like Dublin Core and metatags provide data for the search engines to find.

Controlled Vocabulary

Controlled vocabulary is most often associated with precoordinate keyword searching. The BUBL LINK / 5:15 Catalogue of Internet Resources is an interesting use of key words and classification schemes. BUBL, originally based on Dewey Classification, has adopted an additional new interface. They have adopted a subject tree of academic subjects, further subdivided by DCC subclasses.  They seek to include a minimum of five and up to fifteen "relevant resources" for each subject area (hence the 5:15). The top level subject tree is described as derived from the Library of Congress Classification but adapted.  While the data base is small, it utilizes a controlled vocabulary

CyberStacks(sm), Available: http://www.public.iastate.edu/~CYBERSTACKS/   Gerry McKiernan's (Iowa State University  Curator) catalog of Internet resources.
     "CyberStacks(sm) is a centralized, integrated, and unified collection of significant World Wide Web
     (WWW) and other Internet resources categorized using the Library of Congress classification scheme." cite

CyberDewey, Available: http://ivory.lm.com/~mundie/DDHC/CyberDewey.html

Internet Public Library (http://www.ipl.org/)uses a variety of mechanisms to assist information retrieval. The include graphical user interfaces (GUI). pathfinders, formal and informal ("teen collection: dating & stuff") subject classification, and FAQs and frequently asked reference questions (FARQ?).
 

Hybrids

Some search systems offer a combination of systems. SOSIG for example consists of a subject expert selection and indexed "Internet Catalogue" as well as its "Social Science Search Engine." SOSIG recognizes differences in quality between its catalog and its engine. The catalog may have far fewer entries than the engine, but rigorous quality control can be applied to the former that cannot be maintained for the latter. SOSIG however limits its "harvester" to searches to links from Web documents already part of its catalog thus bring a degree of quality control to the process.  The SOSIG site contains an interesting "product warning" for the search engine:
"However, we cannot quarantine the quality of the resources as we can for the SOSIG Internet Catalogue. Hence we recommend that you only use the Social Science Search Engine if you are not finding a sufficient number of Web resources from your searches on the SOSIG Internet Catalogue itself." Source: http://sosig.esrc.bristol.ac.uk/help/harvester.html
SOSIG employs a straightforward metadata format. The following is the record for my ethics page:
 

                [Title]Ethics Links to Librarian and Information Manager Associations WWW
                Pages
             Description:  Produced by Wallace Koehler, Assistant Professor at the School
                of Library and Information Studies, University of Oklahoma, this site provides a
                set of links to Codes of Ethics and Standards of Practice that have been
                published on the WWW by library and information science related professional
                associations. The site is divided into four sections: ethics pages, mission
                statements, homepages of associations which do not publish ethics or mission
                statements online, and other related links. Each link has a brief description of
                what can be found at that site. This Web site actually consists of a single, rather
                long page, which may consequently be slow to download.
             Keywords:  ethics, professional standards, mission statements, librarians,
                computer personnel, information professional associations
             Classification Scheme: UDC
             Classification Number(s): 174
             Subject Section(s):  Professional and Business Ethics
             Resource Type: Resource Guides
             Admin Name: Wallace Koehler
             Admin Email: wkoehler@ou.edu
             Language:  en
             URL:   http://www.ou.edu/cas/slis/ethics/EthicsBibOrg.htm

I have added emphasis to the metadata field titles. Note too that the title field tag is implicit.
 

Fuzzy Searches

Fuzzy searches attempt to manage "aboutness." Fuzziness is a quest for "similar to" rather than "exact." Fuzzy searches employ algorithms that seek to identify associations based on relevance and rankings. This may done either alone or in combination with feedback from the user or with probabilistic prognostications. These result in maps of "likeness."  The "more like this" searches are examples of user feedback in the selection of relevant (and sometimes pertinent) information clusters.

Fuzzy searches are often presented as graphical user interfaces (gui) which seek to demonstrate the relationships among information vectors.



|Innovative Cataloging|concept|relationship|charateristic|