course logo

MODULE 4
Characteristics
PAGE 1
URx
PAGE 2
Permanence
PAGE 3
Longevity

 

home logo....................site map
 

UR×

We are all familar with the URL or Uniform Resource Locators (URLs). There are three other URx's that have been proposed to help manage the WWW. These are the URC, the URN, and the URI. The MARC 856 field is a related subject but addressed elsewhere in the course. The MARC 856  carries cataloged information about the URL and where appropriate the URN. It should be noted that while the URL is a fully accepted standard, the other URx's have not met with the same degree or success or acceptance.

URL

URLs are the de facto standard by which one addresses Web documents. They take the general form: http://aaa.bbb.ccc. The actual address for a Web document is its IP (Internet Protocol) number. In effect, URLs are proxies for the IP number. Put simply, when one enters a URL in a browser it is translated by an Internet server into its IP number address. It is that IN number that "locates" and "returns" the desired Web document to the requesting computer and presents it to the end user.

The "http" part of the address -- or hypertext transfer protocol -- indicates that the requested is of a certain sort and specifies the transfer means. There are other transfer protocols in common use. Among them are smtp (simple mail transfer protocol), gopher, telnet, and ftp (file transfer protocol).

Two major problems of importance to the bibliometric community have been identified with URLS.

A Web site URL may become nonfunctional for a variety of reasons. A Web site or page may be removed or renamed by its author, it is moved to a new computer, it is recast using different html software and restructured, the site is accessed through a different port, and so on.  A Web site or page may be moved to a new server. When that it done and this is not uncommon, the migrated Web document may contain precisely the same content but will be unaccessable using the old URL.

As we have seen, identifying metadata (author, language, character set, format, size, rights, keywords, abstract) may be located in the document's html code [1] [2]. But the URL itself carries none of that information.
 

URL Fragments

URLs may not carry the metadata refered to above, but they do contain a wealth of information. URLs may be "defragmented" and interpreted. As we have already seen, the left most code in a URL indicates the transfer protocol.

Second, reading the URL from right to left, are domain markers. The right most fragment is the top level domain (TLD). There are two types of TLDs: gTLDs and ccTLDs. The gTLDs or generic top level domains are those which carry the .com, .edu, .gov, .int, .mil, .net, and .org tags. These gTLDs segregate the registered Web sites according the general class of their owners: .com sites are registered by private commerical or economic enterprises, .org by not-for-profit enterprises, .net by network entities, .mil by the US military, .int by international governmental agencies, and .edu by educational institutions. Or at least that is the theory. Distinctions among .com, .org, and .net have become blurred as competition for "desireable" second level domains, particularly for identifiable  trademarks have increased [3].

Country code top level domains or ccTLDs identify the country of origin and in some cases now merely the domain name registry of convenience, again because of pressures for valuable trademark presence on the WWW [3]. ccTLDs are two letter codes based on ISO 3166. ISO 3166 is an International Organization for Standards standard that provides codes to identify countries and other political subdivisions. ISO 3166 includes two-letter, three-letter, and numeric indicators. We are only concerned with the two letter codes. Examples incude .ar for Argentina, .au for Australia, .ca for Canada, .cn for the Peoples Repuplic of China  .fr for France, .ru for Russia, .us for the United States, and .za for South Africa. The full list can be found in many places, among them on the HOTBOT search engine site or at http://www.purduenc.edu/ce/infofile/country.html

The second-level domain or 2LD may take two forms. For some ccTLDs, the 2LD is a functional tag in the same way a gTLD is. Thus, educational institutions in the United Kingdom are marked ac.uk, organizations in Japan are or.jp, or co.cr returns commercial entities in Costa Rica. There is wide variation to this practice, for example the geographic character of the domain name can go to the third level in the United States, sf.ca.us can designate San Francisco.

The second approach is to provide a name, trademark, or other discriminator at the 2LD. Thus www.ou.edu, where "ou" stands for the University of Oklahoma and edu for an educational institution or www.presidencia.gob.mx where "presidencia" indicates the presidency, a Web site on the government (gob for gobierno) server of Mexico.

There is a logic to the domain naming structure. For a fairly straightforward but slightly dated explanation of all this, see Koehler and Barnett [2]. That logic can be used to classify Web sites, albeit imperfectly. Each of the fragments can be parsed from the URL and used to catalog a document.

See also the dicussion under cataloging.

Directory Structure

There is more to a URL than the server-level domain name (SLD, or what lies between that set of double forward slashed and ends inclusiveky with the TLD). URLs also carry the directory location of the specified Web page in question. We are addressing here everything that follows that first single forward slash. Take this page for example. It is: /Characteristics/URX.htm. Heirarchy and organization have been expressed. There are of course no rules, but as Marisa Urgo [1] has argued, these data can also be parsed and used as author supplied descripter terms.

Nothing is easy. One of the leading indexers of our day, Bella Haas Weinberg [4] warns us first that "there is nothing new under sun," but also that the road to indexing is fraught with difficulties." She is, of course, right. But every little bit helps. If (and we can) we parse from URLs all useful information for cataloging and indexing purposes, however non-specific, I suggest that it is better than nothing.

URN

Uniform Resource Name (URN) have been developed to provide Web resources with persistent or permanent names. Thus they differ from the characterstics of the URL which changes not with the resource but with its address. An analogy would be identifying someone accrosing to his or her home address.  I might be known as 401 West Main, Norman, Oklahoma, 73019, office 23. That would be my URL. My "URN" might be my name, but to achieve greater uniqueness, it might better considered my social security number. Thus, no matter where I might have an address, I would have the same SSN. And (we hope) no one else would share that number.

URN based applications include the DOI and handles as well as PURLs.
 

URC

The Uniform Resource Characteristic (URC) are a package of URN, URL, and other identifiers for a resource. The purpose of the URC is to provide adequate bundled information to create indetification block for any given resource. URCs come in several "flavors:" the whois++URC, the trivial URC, the SGML URC, as well as other proposals.

The whois++URC is a bundle of at least two attributes and values. These values and attributes would include the unique URN,
single or multiple URLs (for documents can and are mirrored at more than one address), and various other characteristics - like content related data: length, type, language, and other header data.

The trivial URC eliminates the content characteristics from the URC bundle, leaving the unique URN and the single or multiple URLs. Thus to press the analogy, a trivial URC for me could be my SSN as URN, and my home and office addresses as URLs.

The SGML URC incorporates the trivial URC by adding SGML DTD to the bundle. This translates into a URC that describes document-like entities and these resemble Dublin Core metatags. The URC, placed between the <head> and </head> might take the following form. Developers note that the DTD is kept simple to keep parsing simple, and therefore interpretation and writing to interpreters simple.

<urc>activates the urc
<urn>urn object name</urn> provides the unique object name
<author>Koehler, Wallace</author> gives a DC-like author name
<author type="email">wkoehler@ou.edu</author> alternative author id
<title>Metadata made easy</title> page/document title
<subject scheme="abstract"> what is it about
Metadata can be used for fun and profit. Interpreting it results in healthier
children. Once used, you can feed it to your pets.
</subject>
<instance> describing some modifiers
<coverage>A note on metadata</coverage>what's it about
<url>http://www.ou.edu/cas/slis</url> Site URL
</instance>
<instance>
<coverage>A mirror site in China</coverage> Mirror site for document
<url>http://www.ou.edu.cn/cas/slis</url>Mirror site URL for document
</instance>
</urc>The end

Note the repitition to define each separate or unique qualifier.
 
 
 

URI

Uniform Resource Indicator (URI) are in development.URIs are defined as ASCII-based character string network protocols that represent communications streams. Internationalized Uniform Resource Identifiers (IURI) are a proposed variation on URIs for they are not ASCII-based, but employ the Universal Character Set (UCS) instead permitting them to support a wider range of non roman-character languages.

URIs can be considered URx metadata, their defined syntax strings point to and incorporate the URL, URC, and URN [5]. Resources (the "U" in URx) are anything we can identify, point to, know about. For us as librarians and information scientists, resources include books, articles, Web documents, a sentance on a page, a serial, comic books, recordings of TV programs, music CDs, and so on. Identifiers are proxies for resources.A human being (resource) is labeled by a name (identifier), books are known by ISBN codes, Peacock's article is a five [5].

The value of the URI lies in its ability to translate identity from one of its constituent parts to another, thus a resource identified within the URI context, say through its URL, imparts identity to its URN and URC. Thus, it is possible to retain identity even when a URL fails because the URI has also "tagged" the URN and the URC.
 

UR× Progression

The URx progression for URL and URN to URC and URI are designed to build toward a permanent locator and identifier system for Web-based documents. These systems are fine so far as they go, except that do not and cannot address the problem of resource removal or death. They are inherently designed to manage resource movement.  This has resulted in the recognition of the need of some permanent marker and for some form of archiving. These include proposal by Brewser Kahle to develop a Web archive and the OCLC PURL initiative. These are discussed further on the course "archives" page.
 

Page References and Required/Recommended Readings

[1] Marisa Urgo,  "A Shape for Internet Information: An Alternative Metaphor for Web Site Information" a paper presented at the ASIS Annual Meeting, Pittsburgh, PA, October 1998.

[2] Wallace Koehler and Logan Barnett, "Domain Name Searching and World Wide Web Search Tactics," Searcher 6, 2,
(February1998), pp 54-62.

[3] Wallace Koehler, "Unraveling the Issues, Actors, & Alphabet Soup of the Great Domain Name Debates" Searcher 7, 5 (May 1999). Available: http://www.infotoday.com/searcher/may99/koehler.htm

[4] Bella Hass Weinberg, "Improved Internet Access: Guidance from Research on Indexing and Classification" Bulletin of the American Society for Information Science 25, 2 (1999) Available: http://www.asis.org/Bulletin/Jan-99/weinberg.html

[5] Ian Peacock, "What is...a URI?" Ariadne 18. 1998. Available: http://www.ariadne.ac.uk/issue18/what-is/