course logo

Copyright 2000 Wallace Koehler All Rights Reserved

|Innovative Cataloging|concept|relationship|charateristic|

Characteristic Cataloging

The URL as Indicator

Stability

The University of Waterloo Scholarly Societies Project (http://www.lib.uwaterloo.ca/society/overview.html) maintains guidelines for inclusion in their catalog (http://www.lib.uwaterloo.ca/society/guidelines.html) These include that the Web site represent a scholarly society, that the site provide a range of information, and that the organization have open membership. In addition, the project has a strong preference for associational URLs that point toward stability. The guideline is a preference for canonical URLs (their term). These take the form "www.myorg.org" or "www.myorg.org.xx"

The University of Waterloo Scholarly Societies Project is to be noted for a second of their information management policies. Until November 1998, the Project maintained an alphabetical index of society names. It has since abandoned that in favor of a search engine limited to the site. The Project has retained its subject guide that spans agriculture to women's issues (http://www.lib.uwaterloo.ca/society/subjects_soc.html).

Finally, the Project published a Stability Index. It is built by assigning values to the form of the URL: a canonical URL receives the value 1.0, an association URL that does not contain the association name in any form is given a zero. Variations between these two receive values based on their perceived stability. Values by subject area are then calculated by deriving the mean value [cite]. According to the Web site, the actual stability and predictive value of the index has yet to be tested; but a protocol has been developed and a test is planned. Library and information science associations have been accorded an index value of 61.2% and are found  in the second half of the associational listings. The Project reports a pattern for all associational groupings: "professions are most likely to have permanent URLs, followed by the sciences and social sciences, and then by the humanities." [cite]

The stability reported by the Scholarly Societies Project results from two factors. The first is the assumption that organizations that have obtained a canonical URL are more likely to continue their existence.  The second is that canonical URLs are inherently more stable than others. There are two reasons for this assumption. The first is once an association has adopted a URL for itself in the most abbreviated form (www.ala.org is as simple a URL that the American Library Association might have and still indicate identity). Remember too that URLs are portable. They can be moved from host to host. The underlying IP number may change, but the URL need not. remember too that PURLs can lend an additional element of stability.

Finally, my own research suggests that there is some truth to the stability assumptions made by the Scholarly Societies Project. Canonical or near canonical URLs do appear to have greater lasting power than do others.

Fragments

URLs consist of a series of fragments. The server level domain (SLD) URL takes the form http://aaa.bbb.ccc/. The right-most fragment (represented here as "ccc") is the top-level domain (TLD). TLDs come in two types: generic (gTLD) and country code (ccTLD). The gTLDs carry the .com, .edu, .gov, .mil, .net, .org and rarely the .int ending. In principle these indicate the corporate source of the material (.com's are private sector, .gov some level of  government in the US, .mil US military, .edu educational, .org non profit organization, .int international organization, and .net a network entity). Although the explanatory power the the gTLD is eroding, it remains to some degree.

The same holds for ccTLDs. They draw upon ISO3166 which defines (among others) two-letter codes for countries and other regions of the world. Thus Australia is indicated by .au, Chile by .cl, France by .fr, Malaysia by .my, United States by .us, and South Africa by .za. Again there is some erosion, but the ccTLD gives good indication of the geographic source of the information.

The second URL fragment, second from the right is the second level domain or 2LD. These take several forms. In the ccTLD environment they may represent in some countries functional tags (that parallel the gTLD nomenclature). Thus academic institutions in the United Kingdom are designated ac.uk, Japanese organizations are or.jp, Mexican government pages carry .gob.mx (gob for gobierno) and so on. They and gTLD 2LDs may also carry various names identifying the entity. These include trademarks, initials, and names. Thus, the University of Oklahoma is identified as ou.edu.

This is a fairly simplistic explanation of domain names. But suffice it to say that these fragments can be dissected and used for cataloging purposed. They may not be precise, but they do give some sense of what the site is about and often where it originated.

URLs also carry file information. These files were named by the site creator and may be anything. These file names often carry with them a certain logic and it has been suggested that these names might also be captured as part of the cataloging process. Take a look at the file structure for this course. It is reflected in the site map. Is there any sense to it?

This is discussed also under URx.

For further discussion see:

Marisa Urgo,  "A Shape for Internet Information: An Alternative Metaphor for Web Site Information" a paper presented at the ASIS Annual Meeting, Pittsburgh, PA, October 1998.

Wallace Koehler and Logan Barnett, "Domain Name Searching and World Wide Web Search Tactics," Searcher 6, 2,
(February1998), pp 54-62.

Wallace Koehler, "Unraveling the Issues, Actors, & Alphabet Soup of the Great Domain Name Debates" Searcher 7, 5 (May 1999). Available: http://www.infotoday.com/searcher/may99/koehler.htm

Wallace Koehler "Classifying Websites and Webpages: The Use of Metrics and URL Characteristics as Markers," Journal of
Librarianship and Information Studies 31, 1 (March 1999), pp 21-31.
 

"Popularity"

The search engine Google is based on the popularity of the Web resource. By popularity it is meant the frequency at which a site is "hit" over another with similar content. Familiarity breeds not contempt but an assumption of user preference. Thus, unlike more "traditional" forms of indexing or cataloging, Google is based on use patterns rather than other catalog or quality criteria. As an aside, it should be noted that the assumption of use indicates utility is not a new one. One technique used to browse card catalogs - if you can remember card catalogs - was to examine the cards themselves in the drawer. Bright shiny cards indicated new entries. Dog eared or dirty cards indicated well used cards, the precursor to Google.
 

Quality Indexing

There are several options available to provide quality control to the Web. The peer reviewed LASE is certainly one option. Others include usability assessment and other forms of quality assessment. For the range of options that have been explored in the literature, see the Ciolek article.

Usability

Usability might be considered a sub-class of quality indexing. Web sites and Web pages can be designed with the impact of the end user in mind. These qualities might be incorporated into the catalog record. For example, SOSIG notes download times and is ever so slightly critical of my page because it is so big and so slow to down load. The usability domain is hugh. There are multiple concerns: accessibility, speed, organization of information, author-based cataloging and indexing, quality of languages, and so on. Claire McInerney has developed a check list for evaluating Web documents. Her check list reflects some of these elements. It is not outside the realm of possibility that some or all or the criteria she identifies could be incorporated as part of the formal description of a document.

Usability is the subject of an important new book. I suggest you take a look at Jakob Nielsen's Designing Web Usability: The Practice of Simplicity before designing Web documents.
 

Temporal Indexing

As is argued in the section of stability, Web pages and sites undergo two basic changes. The first and most basic of all is that with much frequency, Web pages and sites cease to exist. Sometimes, this is intermittent. Second, with even greater frequency, Web pages and sites undergo content change. They change at different rates. Some change by the second, others have much greater lasting power.

I think there are three not mutually exclusive ways to handle this change from the cataloging perspective. The first is to catalog in a very general way. By that I mean, avoid specific characterization of the page or site content. Abstracts should be just that, very abstract and general.

The second option I have suggested is to capture data on the degree of change specific Web documents undergo and to provide information as part of their bibliographic record. This is not a particularly difficult exercise, but it does require frequent monitoring of the Web document, data capture, and repopulating of the bibliographic rate of change metadata field. I have, for lack of a better term, labeled this change rate "omega."

The third option is to perform extensive recataloging as frequently as Web document changes or based on some other change criterion. Further research is needed to establish what those other change criteria and whether it is feasible to follow that strategy.



|Innovative Cataloging|concept|relationship|charateristic|