
|
Characteristics |
URx |
Permanence |
Longevity |
....................
The "http" part of the address -- or hypertext transfer protocol -- indicates that the requested is of a certain sort and specifies the transfer means. There are other transfer protocols in common use. Among them are smtp (simple mail transfer protocol), gopher, telnet, and ftp (file transfer protocol).
Two major problems of importance to the bibliometric community have been identified with URLS.
As we have seen, identifying metadata (author, language, character set,
format, size, rights, keywords, abstract) may be located in the document's
html code [1] [2].
But the URL itself carries none of that information.
Second, reading the URL from right to left, are domain markers. The right most fragment is the top level domain (TLD). There are two types of TLDs: gTLDs and ccTLDs. The gTLDs or generic top level domains are those which carry the .com, .edu, .gov, .int, .mil, .net, and .org tags. These gTLDs segregate the registered Web sites according the general class of their owners: .com sites are registered by private commerical or economic enterprises, .org by not-for-profit enterprises, .net by network entities, .mil by the US military, .int by international governmental agencies, and .edu by educational institutions. Or at least that is the theory. Distinctions among .com, .org, and .net have become blurred as competition for "desireable" second level domains, particularly for identifiable trademarks have increased [3].
Country code top level domains or ccTLDs identify the country of origin and in some cases now merely the domain name registry of convenience, again because of pressures for valuable trademark presence on the WWW [3]. ccTLDs are two letter codes based on ISO 3166. ISO 3166 is an International Organization for Standards standard that provides codes to identify countries and other political subdivisions. ISO 3166 includes two-letter, three-letter, and numeric indicators. We are only concerned with the two letter codes. Examples incude .ar for Argentina, .au for Australia, .ca for Canada, .cn for the Peoples Repuplic of China .fr for France, .ru for Russia, .us for the United States, and .za for South Africa. The full list can be found in many places, among them on the HOTBOT search engine site or at http://www.purduenc.edu/ce/infofile/country.html
The second-level domain or 2LD may take two forms. For some ccTLDs, the 2LD is a functional tag in the same way a gTLD is. Thus, educational institutions in the United Kingdom are marked ac.uk, organizations in Japan are or.jp, or co.cr returns commercial entities in Costa Rica. There is wide variation to this practice, for example the geographic character of the domain name can go to the third level in the United States, sf.ca.us can designate San Francisco.
The second approach is to provide a name, trademark, or other discriminator at the 2LD. Thus www.ou.edu, where "ou" stands for the University of Oklahoma and edu for an educational institution or www.presidencia.gob.mx where "presidencia" indicates the presidency, a Web site on the government (gob for gobierno) server of Mexico.
There is a logic to the domain naming structure. For a fairly straightforward but slightly dated explanation of all this, see Koehler and Barnett [2]. That logic can be used to classify Web sites, albeit imperfectly. Each of the fragments can be parsed from the URL and used to catalog a document.
See also the dicussion under cataloging.
Nothing is easy. One of the leading indexers of our day, Bella Haas Weinberg [4] warns us first that "there is nothing new under sun," but also that the road to indexing is fraught with difficulties." She is, of course, right. But every little bit helps. If (and we can) we parse from URLs all useful information for cataloging and indexing purposes, however non-specific, I suggest that it is better than nothing.
URN based applications include the DOI and handles as well as
PURLs.
The whois++URC is a bundle of at least two attributes and values. These
values and attributes would include the unique URN,
single or multiple URLs (for documents can and are mirrored at more
than one address), and various other characteristics - like content related
data: length, type, language, and other header data.
The trivial URC eliminates the content characteristics from the URC bundle, leaving the unique URN and the single or multiple URLs. Thus to press the analogy, a trivial URC for me could be my SSN as URN, and my home and office addresses as URLs.
The SGML URC incorporates the trivial URC by adding SGML DTD to the bundle. This translates into a URC that describes document-like entities and these resemble Dublin Core metatags. The URC, placed between the <head> and </head> might take the following form. Developers note that the DTD is kept simple to keep parsing simple, and therefore interpretation and writing to interpreters simple.
<urc>activates the urc
<urn>urn object name</urn> provides the
unique object name
<author>Koehler, Wallace</author>
gives
a DC-like author name
<author type="email">wkoehler@ou.edu</author>
alternative
author id
<title>Metadata made easy</title>
page/document
title
<subject scheme="abstract"> what is it about
Metadata can be used for fun and profit. Interpreting it results in
healthier
children. Once used, you can feed it to your pets.
</subject>
<instance> describing some modifiers
<coverage>A note on metadata</coverage>what's
it about
<url>http://www.ou.edu/cas/slis</url>
Site
URL
</instance>
<instance>
<coverage>A mirror site in China</coverage>
Mirror
site for document
<url>http://www.ou.edu.cn/cas/slis</url>Mirror
site URL for document
</instance>
</urc>The end
Note the repitition to define each separate or
unique qualifier.
URIs can be considered URx metadata, their defined syntax strings point to and incorporate the URL, URC, and URN [5]. Resources (the "U" in URx) are anything we can identify, point to, know about. For us as librarians and information scientists, resources include books, articles, Web documents, a sentance on a page, a serial, comic books, recordings of TV programs, music CDs, and so on. Identifiers are proxies for resources.A human being (resource) is labeled by a name (identifier), books are known by ISBN codes, Peacock's article is a five [5].
The value of the URI lies in its ability to translate identity from
one of its constituent parts to another, thus a resource identified within
the URI context, say through its URL, imparts identity to its URN and URC.
Thus, it is possible to retain identity even when a URL fails because the
URI has also "tagged" the URN and the URC.
[2] Wallace Koehler and Logan Barnett, "Domain Name
Searching and World Wide Web Search Tactics," Searcher 6, 2,
(February1998), pp 54-62.
[3] Wallace Koehler, "Unraveling the Issues, Actors, & Alphabet Soup of the Great Domain Name Debates" Searcher 7, 5 (May 1999). Available: http://www.infotoday.com/searcher/may99/koehler.htm
[4] Bella Hass Weinberg, "Improved Internet Access: Guidance from Research on Indexing and Classification" Bulletin of the American Society for Information Science 25, 2 (1999) Available: http://www.asis.org/Bulletin/Jan-99/weinberg.html
[5] Ian Peacock, "What is...a URI?"
Ariadne 18.
1998. Available: http://www.ariadne.ac.uk/issue18/what-is/