|
INTRO |
Spam Indexing |
Mark Up |
MetaTags |
SGML/XML |
Dublin Core |
PICS |
RDF |
Copyright © 2002 Wallace Koehler - All Rights Reserved
Mark up languages vary greatly in complexity and use. They all have one thing in common: they are to be used in the electronic digital environmnet. As we know, that is a large environment that extends well beyond the Web. We explore some of the non-Web mark up languages because they could be migrated to Web applications. Or, they may give rise or inspiration to those applications.
An excllent descriptive resource for mark up languages can be found at http://ukoln.ac.uk/metadata/desire/overview/rev_toc.htm
A minimum for this page it looks like:
| <head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="GENERATOR" content="Mozilla/4.7 [en] (Win95; U) [Netscape]"> <meta name="Author" content="wallace koehler"> <title>Mark Up Made Easy</title> <!--This file created 12:01 PM 2/20/00 by Claris Home Page version 3.0--> <X-CLARIS-WINDOW TOP=0 BOTTOM=435 LEFT=0 RIGHT=788> <<!--This file created 12:01 PM </head> |
As we have already seen, a wide range of metatags can be added to provide far more information to catalogers, search engine spiders, and others seeking to define the information content of the page.
As promised, a more complex set of headers. These are the Dublin Core headers for the Dublin Core page in this document and these are repeated from that page. Note first that everything lies between the <head>s. Each Dublin Core metatag consists of three parts (1) the "meta name, (2) a schema statement, and (3) a reference. Each metatag defines document data of some kind (title or creator name, for example). The schema defines the scheme used. The reference provides the reader with the justification and definition for the tag.
Other header systems may or may not provide this degree of documentation
within the header. Dublin Core is one of the simpler of the formal mark
up systems, as you will see as you explore others listed below.
| <head>
<title> Web Management: Dublin Core </title> <META NAME="DC.Title" CONTENT="Web Document Management"> <LINK REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#title">
<META NAME="DC.Title.Alternative" CONTENT="LIS 5990 Summer 2000">
<META NAME="DC.Creator.PersonalName" CONTENT="Koehler, Wallace">
<META NAME="DC.Creator.PersonalName.Address"
<META NAME="DC.Subject" CONTENT="metadata">
<META NAME="DC.Subject" CONTENT="bibliographic control">
<META NAME="DC.Subject" CONTENT="Web sites">
<META NAME="DC.Subject" CONTENT="WWW">
<META NAME="DC.Subject" CONTENT="(SCHEME=LCSH) Dublin Core,
<META NAME="DC.Subject" CONTENT="(SCHEME=LCSH) Internet">
<META NAME="DC.Subject" CONTENT="(SCHEME=LCSH) Library Catalogs
<META NAME="DC.Subject" CONTENT="(SCHEME=LCSH) Indexing,
<META NAME="DC.Subject" CONTENT="(SCHEME=LCSH) Classification and
<META NAME="DC.Subject" CONTENT="(SCHEME=LCCS) ZA">
<META NAME="DC.Subject" CONTENT="(SCHEME=LCCS) Z">
<META NAME="DC.Description" CONTENT="A library school web-based
<META NAME="DC.Publisher" CONTENT="School of Library and Information
<META NAME="DC.Date" CONTENT="(SCHEME=ISO8601) 2000-05-01">
<META NAME="DC.Type" CONTENT="Text.Index">
<META NAME="DC.Format" CONTENT="(SCHEME=IMT) text/html">
<META NAME="DC.Identifier" CONTENT="http://www.ou.edu/">
<META NAME="DC.Language" CONTENT="(SCHEME=ISO639-1) en">
<META NAME="DC.Coverage" CONTENT="metadata">
<META NAME="DC.Rights" CONTENT="Copyright Wallace Koehler 2000 All
<META NAME="DC.Date.X-MetadataLastModified"
|
Let us take a block of text from within this Web site. It has
the following appearance in the browser or when printed out.
|
The course is designed to familiarize graduate students and working librarians with the various approaches to and problems with bibliographic management of the Web strategies either at the author end or at the cataloger end. The first question necessarily must be "can we do it", followed quickly by "should we do it." If we can do and if we should do it, what is it we should do? These define the purpose of this course. Purpose, Expectations, and
|
Yet imbedded within this text are numerous html codes, which look like
this:
| <a href="#contact"><img SRC="button_gif.gif" BORDER=0 X-CLARIS-USEIMAGEWIDTH
X-CLARIS-USEIMAGEHEIGHT height=40 width=42 align=TEXTTOP></a>
<p>The course is designed to familiarize graduate students and working librarians with the various approaches to and problems with bibliographic management of the Web strategies either at the author end or at the cataloger end. p>The first question necessarily must be "can we do it", followed quickly by "should we do it." If we can do and if we should do it, what is it we should do? These define the purpose of this course. <br> <br> <br> <table BORDER=0 WIDTH="99%" BGCOLOR="#FFFFFF" > <tr> <td WIDTH="42" HEIGHT="136" BGCOLOR="#FFFFFF"></td> <td WIDTH="99%" HEIGHT="136">
|
From the point of view of the cataloger or indexer, none of this is particularly useful in defining what the paragraph is "about" or how to manage its information content. It is telling us that there are control codes within the text, the size and location of graphics, and the addresses of linked objects. The various SGML DTD based mark up languages add additional codes, different codes for different mark systems to specify metadata categories in the document.
Let's look again at the html text, this time with some "mark up." The
document is color coded: green for title,
red
for author, lavender for corporate author
and contact,
andgray
for abstract. This scheme could work, for
the source code is not colorful, it carries a color code. For example,
the title is rendered <font color="#009900">Web Document Management
Course</font>. If our system were defined to populated a catalog's title
field each time it encountered <font color="#009900">document title</font>
we would have a start at a cataloging mark up language. We would
of course be precluded from using these four colors for other purposes.
Moreover, our mark up lexicon would be as limited as our palette.
| p>Welcome to the Web Document Management Course.
It is designed to be
self-contained, sometimes synchronous, more often asynchronous Web based course offered by <a href="http://www.ou.edu/cas/slis/ethics/CV/Index.htm">Dr. Wallace Koehler</a>, <a href="http://www.ou.edu/cas/slis">MLIS</a>, at the <a href="http://www.ou.edu">Valdosta State University</a>.<a href="#contact"><img SRC="button_gif.gif" BORDER=0 X-CLARIS-USEIMAGEWIDTH X-CLARIS-USEIMAGEHEIGHT height=40 width=42 align=TEXTTOP></a> <p>The course is designed to familiarize graduate students and working librarians with the various approaches to and problems with bibliographic management of the Web strategies either at the author end or at the cataloger end. p>The first question necessarily must be "can we do it", followed quickly by "should we do it." If we can do and if we should do it, what is it we should do? These define the purpose of this course. <br> <br> <br> <table BORDER=0 WIDTH="99%" BGCOLOR="#FFFFFF" > <tr> <td WIDTH="42" HEIGHT="136" BGCOLOR="#FFFFFF"></td> <td WIDTH="99%" HEIGHT="136">
|
The value and ease of this kind of mark up language is that we can pluck metadata elements from the text of the document. Moreover we can place terms and concepts in context if our mark up language is sufficiently granular or rich. Consider if we were limited to mark up for subjects and had but one code it, say <s>subject</s>. Consider the sentence: "Henry Ford built Ford cars." To mark it up "Henry <s>Ford</s> built <s>Ford</s> cars" would be ambiguous at best.
What if we could modify those terms, and mark them up as "personalname" and "productname" it might look like this: "Henry <spern>Ford</spern> built <spron>Ford</spron> cars"
As yet we have not marked up the object "car," the verb "built," nor Ford's first name. We could. "Built" is the past tense of the transitive verb "to build." Maybe we could mark it <ptt>built</ptt>. The word "car" could be defined as a non-specific noun (<nsn> ?) and so on.
In theory we could mark up every word in a document. But consider, if we had a document with its first sentence marked up: "<sperfn>Henry<sperfn> <sperln>Ford</sperln> <ptt>built</ptt> <spron>Ford</spron> <nsn>cars</nsn>" we would be able to populate a catalog record with a great deal of information and be able to specify and retrieve records describing actions in the past that involve a specified historical person manufacturing a specified product (cars, not trucks carrying the Ford logo).
As you can see, this gets pretty complicated in a hurry. But that's not the end. Let's address the word "car" and its classification. Let's redefine it. "Car" belongs to the phylum "vehicle," genus "land," species "motorized," race "four-wheel." We might classify it nvlmf. Thus, we might mark the sentence up as "<sperfn>Henry<sperfn> <sperln>Ford</sperln> <ptt>built</ptt> <spron>Ford</spron> <nvlmf>cars</nvlmf>" If our document were only about Henry Ford building cars, we might not need to repeat the code nor add others. Why go to the trouble. Surely the character string "car" can be searched. But that is more ambiguous than "nvlmf AND car." And if your return set is numbered in the tens of thousands as it often is on the Web, anything that narrows toward relevance should be considered.
Would Web authors be willing to adopt such mark
up language. I think the answer is yes and no. In the "early years" of
word processing, we had to remember all sorts of key strokes to create
documents. Word processing is now about as complex as typing. Using an
html text editor is also much more straight forward than it was not too
long ago. It is possible to build a text editor that can imbed the appropriate
code using drop down menus other other aids. Will we do it?
ALIWEB does not scan the native page. Its weakness is that it is dependent upon Web page creators to maintain the integrity of the bibliographic record. But it is also one of its strengths, for the record is author created.
The template consist of five elements or tags. These are:
| "SITEINFO for a record containing information
about the server
ORGANIZATION for a record containing information about the organisation DOCUMENT for a record containing information on documents (pages) on the server SERVICE for a record containing information on services available on the server USER for a record containing information on users at the site" |
An example of a fully marked record can be found at http://www.nexor.com/site.idx
The Harvest SOIF contains twenty element:
| Abstract | Brief abstract about the object. |
| Author | Author(s) of the object. |
| Description | Brief description about the object |
| File-Size | Number of bytes in the object. |
| Full-Text | Entire contents of the object. |
| Gatherer-Host | Host on which the Gatherer ran to extract information from the object. |
| Gatherer-Name | Name of the Gatherer that extracted information from the object. (eg. Full-Text, Selected-Text, or Terse). |
| Gatherer-Port | Port number on the Gatherer-Host that serves the Gatherer's information. |
| Gatherer-Version | Version number of the Gatherer. |
| Keywords | Searchable keywords extracted from the object. |
| Last-Modification-Time | The time that the object was last modified (in seconds since epoch). |
| MD5 | MD5 16-byte checksum of the object. |
| Partial-Text | Only the selected contents from the object |
| Refresh-Rate | How often the Broker attempts to update the content summary (in seconds relative to Update-Time). |
| Time-to-Live | How long content summary is valid (in seconds relative to Update-Time). |
| Title | Title of the object. |
| Type | Example: Archive, Audio, Awk, Backup, Binary, C, CHeader, Command,
Compressed, CompressedTar, Configuration, Data, Directory, DotFile, Dvi, FAQ, FYI, Font, FormattedText, GDBM, GNUCompressed, GNUCompressedTar, HTML, Image, Internet-Draft, MacCompressed, Mail, Makefile, ManPage,Object, OtherCode, PCCompressed, Patch, Perl, PostScript, RCS, README, RFC, SCCS, ShellArchive, Tar, Tcl, Tex, Text, Troff, Uuencoded, and WaisSource |
| Update-Time | The time that Gatherer updated (generated) the content summary from the object (in seconds since the epoch). |
| URL | URL of the object. |
| URL-References | Any URL references present within HTML objects. |
Source: http://xtal1.sdsc.edu/Harvest/brokers/Attributes.html
Harvest/SOIF appears to be no longer supported
at the University of Colorado, Boulder.
David Beckett, IAFA Templates in Use as Internet Metadata. Available: http://www.w3j.com/1/a.052/paper/052.html
Darren R. Hardy, Michael F. Schwartz, Duane Wessels,
Harvest
User's Manual, v 1.2 dated April 1995. Available:
http://www.epi.mh-hannover.de/~bueker/harvest/user-manual/user-manual.html
PUBLISHER's MARKUP
ONIX International (Online Information exchange) is a metadata standard developed
for and by publishers and online book sellers. It is an Internet compliant standard.
The standard was first released in 2000. It can be used to describe both traditional
and e-publications. It is an XML based markup language supported its own document
type definition .
There are a number of specialized RDF applications under development. For example, the Information and Content Exchange (ICE) is a mark up language designed as a B2B (business to business) publishing standard "for use by content syndicators and their subscribers." (http://www.w3.org/TR/NOTE-ice).
The RSS (RDF/Rich Site Summary) is an RDF-based metadata system for news transfer and mark up.
PRISM (Publishing Requirements for Industry Standard Metadata) is an RDF-based software publishing industry mark up standard.
NewsML, developed by the International Press Telecommunications Council, is similar to RSS and facilitates news data exchanges. It is not, however, an RDF- based system.
NITF was developed as a complement to NewsML. NITF is an html-based mark up for the transfer of textual news stories. For additional information on these applications, see http://www.xmlnews.org/ and http://www.ilrt.bris.ac.uk/discovery/rdf-dev/roads/subject-listing/rdfapps.html.
A crosswalk between FGDC Digital Geospatial Metadata to USMARC can be found at http://www.alexandria.ucsb.edu/public-documents/metadata/fgdc2marc.html.
As has already been pointed out, the archival and historical community have been active in document mark up. The following list is not exhaustive but these are important applications.
- CES (Corpus Encoding Standard) http://www.cs.vassar.edu/CES/CES1-1.html
- EAD DTD (Encoded Archival Description Document Type Definition) http://www.loc.gov/ead/ead/
- MEP (Model Editions Partnership) http://mep.cla.sc.edu/mepinfo/mep-info.html
- TEI (Text Encoding Initiative) http://www.tei-c.org/
CES follows TEI standards and is described as a TEI subset. It is described as "simplified" TEI.
The CES Header acts as an "electronic
title page" and consists of a extensive element
tree. For an example of the CES markup for George Orwell's
1984,
go to the bottom of the Web page at http://www.cs.vassar.edu/CES/CES1-3.html
Markup guidelines are found at http://MEP.cla.sc.edu/MepGuide.html Like other archival markup languages, MEP conforms to TEI standards.
MEP consists of both header definition as well as a textual markup within the body of the document. MEP, like the others, placed bibliographic definition between its <mepHeader></mepHeader>, but also marks up the electronic surrogate of the document itself between <doc></doc> metatags.
EAD consists of three main elements: header <eadheader></eadheader>,
frontmatter <frontmatter>, and archival description <archdesc>. The
EAD header consists of tags that define elements and atributes and
follow TEI standards. Frontmatter is seen as a bibliographic
descripter, but less formal in structure than the header taht need not
follow TEI format. Archival description contains information about the
document body. (see http://lcweb.loc.gov/ead/tglib/tlover.html)
| <teiHeader>
<fileDesc> <titleStmt>Title</titleStmt> <editionStmt>Edition</editionStmt> <extent>Extent</extent> <publicationStmt>Pub</publicationStmt> <seriesStmt> Series</seriesStmt> <notesStmt>Notes</notesStmt> <sourceDesc>Source </sourceDesc> </fileDesc> <more header here> </teiHeader> |
The TEI header contains a series of elements. The first is always the file description <fileDesc>. It is designed to provide standard bibliographic referants as shown above.
The second major element is the Encoding Description <encodingDesc>
. It is an optional description of methods and editorial
principles underlying the document description. There are nine optional
subdivisions to the <encodingDesc>
The third major element is the Profile Description <profileDesc>. This too is a optional element and includes information surrounding a document of a non-bibliographic nature. These include various subelements that describe how a document came to be <creation>, languages and sub-languagesfound in the text <langUsage> , and descriptions of the nature of the text <textClass> . There are addtional subelements.
The fourth and final element is the Revision Description <revisionDesc>. This last element pertains not to the described document, but rather to revisions to the description. It is, if you will, a provinance of the bibliographic document description. It too is optional.